[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771663#comment-16771663 ]
Karl Wright commented on CONNECTORS-1563: ----------------------------------------- Hi Subasini, Are you now Tika-extracting in ManifoldCF, or in Solr? The text field looks like it contains properly extracted content, along with other stuff you do not want. Is this correct? If the extraction is happening in Solr, then I have no idea what this is coming from. If the extraction is happening in ManifoldCF, then if you have placed a Metadata Adjuster transformer in the pipeline between the Tika Extractor and the Solr Output Connector, I'd say you had set it up to concatenate many fields together into a text field. The Metadata Adjuster has that ability. The choice of how metadata (or content) fields get mapped to Solr schema is set up in your Solr output connection configuration. The Tika extraction basically replaces a binary input document with a character-sequence output document plus metadata fields. The character-sequence output document then must be sent to Solr not using the exracting update handler, but just the standard handler, so the handler should be changed from /update/extract to just /update, and the "Use extracting update handler" should be turned off. The actual field name used for the extracted content body can also be changed, if desired, in the "Schema" part of the configuration. But what is there by default works with Solr as it's set up by default. > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > ----------------------------------------------------------------------------------------------- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector > Reporter: Sneha > Assignee: Karl Wright > Priority: Major > Attachments: Document simple history.docx, managed-schema, manifold > settings.docx, manifoldcf.log, solr.log, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)