[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771663#comment-16771663
 ] 

Karl Wright commented on CONNECTORS-1563:
-----------------------------------------

Hi Subasini,

Are you now Tika-extracting in ManifoldCF, or in Solr?
The text field looks like it contains properly extracted content, along with 
other stuff you do not want.  Is this correct?

If the extraction is happening in Solr, then I have no idea what this is coming 
from.  If the extraction is happening in ManifoldCF, then if you have placed a 
Metadata Adjuster transformer in the pipeline between the Tika Extractor and 
the Solr Output Connector, I'd say you had set it up to concatenate many fields 
together into a text field.  The Metadata Adjuster has that ability.

The choice of how metadata (or content) fields get mapped to Solr schema is set 
up in your Solr output connection configuration.  The Tika extraction basically 
replaces a binary input document with a character-sequence output document plus 
metadata fields.  The character-sequence output document then must be sent to 
Solr not using the exracting update handler, but just the standard handler, so 
the handler should be changed from /update/extract to just /update, and the 
"Use extracting update handler" should be turned off.  The actual field name 
used for the extracted content body can also be changed, if desired, in the 
"Schema" part of the configuration.  But what is there by default works with 
Solr as it's set up by default.





> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> -----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1563
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
>             Project: ManifoldCF
>          Issue Type: Task
>          Components: Lucene/SOLR connector
>            Reporter: Sneha
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: Document simple history.docx, managed-schema, manifold 
> settings.docx, manifoldcf.log, solr.log, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to