enhance solr to support per-document results in batch mode
----------------------------------------------------------
Key: SOLR-3018
URL: https://issues.apache.org/jira/browse/SOLR-3018
Project: Solr
Issue Type: Improvement
Components: clients - java
Affects Versions: 4.0
Environment: any
Reporter: Rob Tulloh
It would be useful to have Solr return per-document results instead of a
generic SolrException when multiple documents are being passed via
CommonsHttpSolrServer.The API supports adding multiple streams/files to a
request (see SOLR-3010 for an example usage in jython) but when an error is
detected, an exception is returned to the caller and the caller must then
determine which document failed to be processed. This is particularly
problematic for simple document extraction when using solr and tika to
pre-process documents for indexing. In this case, a batch of documents is
passed to solr for processing by tika. If any of the documents fails to be
processed, a SolrException is thrown:
{noformat}
Mon Jan 9 18:04:50 2012 Caught SolrException handling documents [13356414,
23590833, 33917483] (<jclass org.apache.solr.common.SolrException 9>,
org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException:
TIKA-198: Illegal IOException from
org.apache.tika.parser.microsoft.TNEFParser@6d893ae8
org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException:
TIK
{noformat}
Instead of this exception, the API could be configured to return a response
that has a result per-document indicating the server's response for processing
of the batch. A caller could then check the response and extract the relevant
parsed content for successful documents and do special handling for documents
that failed to be parsed.
There are reasonable workarounds for this in the current product. First,
callers can pass 1 document at a time for processing and then there is no
ambiguity on what the result is for a document. Another approach is to pass a
small batch of documents to Solr/Tika and if an exception is thrown, reprocess
the documents one at a time. If the corpus of documents is largely
well-behaved, minimal retries will be needed to reprocess failures.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]