[ 
https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771607#comment-16771607
 ] 

Subasini Rath commented on CONNECTORS-1563:
-------------------------------------------

Hi Karl,
    Could you please guide me - to which field manifold writes the actual 
textual content of the document.

Currently I am using the _text_ field but it has been found that _text_ does 
not contain the actual data. Rather it added some extra values to the actual 
content.

In my managed-schema : 

<field name="_text_" type="text_general" multiValued="true" indexed="true" 
stored="true"/>

After my indexing in Solr, the value looks like : (The first 4 lines are 
appended before the content of file)

"title":["NETWORK PLANNING\u0000"],
        "_text_":[" \n \n stream_size 34070  \n X-Parsed-By 
org.apache.tika.parser.DefaultParser  \n X-Parsed-By 
org.apache.tika.parser.txt.TXTParser  \n stream_content_type application/pdf  
\n stream_name cs.exe?bmsdocid=9.2.1&func=eebms.docdownload  \n 
stream_source_info cs.exe?bmsdocid=9.2.1&func=eebms.docdownload  \n 
Content-Encoding UTF-8  \n resourceName 
cs.exe?bmsdocid=9.2.1&func=eebms.docdownload  \n Content-Type text/plain; 
charset=UTF-8  \n  \n \n  9.2.1 UNCONTROLLED IF PRINTED Page 1 of 13\nCompany 
Policy\nNETWORK\nDocument No Amendment No Approved By Approval Date Review 
Date\n: : : : :\n9.2.1 9 CEO 23/05/2016 23/05/2019\n9.2.1 NETWORK PLANNING\n1.0 
POLICY STATEMENT\nThe company will plan the expansion and augmentation of its 
electrical network to achieve levels of safety, reliability and quality of 
supply commensurate with community, regulator, customer and shareholder 
expectations.\nThe company will coordinate its planning with the NSW 
transmission utility Transgrid and neighbouring distribution utilities to 
develop effective solutions to satisfy load growth within the company’s supply 
area and in adjacent franchise areas where the company’s network has 
influence.\n2.0 PURPOSE\nTo provide principles for planning network



Thanks & Regards,
Subasini Rath
O: +91-33 6636-8889 
M: +91 983-1234-341
Email: subasini.r...@endeavourenergy.com.au



> SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream 
> must have > 0 bytes
> -----------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1563
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1563
>             Project: ManifoldCF
>          Issue Type: Task
>          Components: Lucene/SOLR connector
>            Reporter: Sneha
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: Document simple history.docx, managed-schema, manifold 
> settings.docx, manifoldcf.log, solr.log, solrconfig.xml
>
>
> I am encountering this problem:
> I have checked "Use the Extract Update Handler:" param then I am getting an 
> error on Solr i.e. null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 
> bytes
> If I ignore tika exception, my documents get indexed but dont have content 
> field on Solr.
> I am using Solr 7.3.1 and manifoldCF 2.8.1
> I am using solr cell and hence not configured external tika extractor in 
> manifoldCF pipeline
> Please help me with this problem
> Thanks in advance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to