[jira] Commented: (SOLR-2116) TikaEntityProcessor does not find parser by default

David Smiley (JIRA) Sun, 13 Feb 2011 15:20:26 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994165#comment-12994165
 ]


David Smiley commented on SOLR-2116:
------------------------------------

I encountered this bug and fixed it independently just now, just as the patch 
file here does. This is how Solr Cell configures Tika too.  I encountered this 
on 3x by using the example-DIH that comes which Solr that includes a core named 
"tika".

Furthermore, I found a configuration bug in that core in solrconfig.xml in 
which the <dataDir> is specified as opposed to it just defaulting to the 
correct place. The result is that this core will erroneously use the sample 
example/solr/data directory which is bad.

Can a committer please commit the patch and remove the dataDir in that tika 
core on branch 3x?  This is a bug after all.

> TikaEntityProcessor does not find parser by default
> ---------------------------------------------------
>
>                 Key: SOLR-2116
>                 URL: https://issues.apache.org/jira/browse/SOLR-2116
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler, contrib - Solr Cell (Tika 
> extraction)
>    Affects Versions: 3.1, 4.0
>            Reporter: Lance Norskog
>         Attachments: SOLR-2116.patch, pdflist-data-config.xml, pdflist.xml
>
>
> The TikaEntityProcessor does not find the correct document parser by default.
> This is in a two-level DIH config file. I have attached 
> pdflist-data-config.xml and pdflist.xml, the XML file list supplying. To test 
> this, you will need the current 3.x branch or 4.0 trunk.
> # Set up a Tika-enabled Solr 
> # copy any PDF file to /tmp/testfile.pdf
> # copy the pdflist-data-config.xml to your solr/conf
> # and add this snippet to your solrconfig.xml
> {code:xml}
> <requestHandler name="/pdflist"
>       class="org.apache.solr.handler.dataimport.DataImportHandler">
>   <lst name="defaults">
>               <str name="config">pdflist-data-config.xml</str>
>       </lst>
> </requestHandler>
> {code}
> [http://localhost:8983/solr/pdflist?command=full-import] will make one 
> document with the id and text fields populated. If you remove this line:
> {code}
>  parser="org.apache.tika.parser.pdf.PDFParser"
> {code}
> from the TikaEntityProcessor entity, the parser will not be found and you 
> will get a document with the "id" field and nothing else.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (SOLR-2116) TikaEntityProcessor does not find parser by default

Reply via email to