[ 
https://issues.apache.org/jira/browse/SOLR-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976986#action_12976986
 ] 

Martijn van Groningen edited comment on SOLR-2116 at 1/3/11 5:23 PM:
---------------------------------------------------------------------

I've encountered the same issue in my Solr setup. After some digging I found 
the problem, it is simply not loading classes from the lib directory.

When no tika config is specified in the data-config.xml, the 
TikaEntityProcessor tries to load the TikaConfig in the manner specified below:
{code}
....
String tikaConfigFile = context.getResolvedEntityAttribute("tikaConfig");
if (tikaConfigFile == null) {
  tikaConfig = TikaConfig.getDefaultConfig();
} else {
....
{code}

The problem with this way of loading the TIkaConfig is, that it doesn't use the 
classloader from the SolrResourceLoader and therefore not loading any jars from 
the solr lib directory. The attached patch resolves the issue that no content 
is parsed by Tika. I simply use the constructor that requires a ClassLoader as 
argument. I retrieve the classloader from the SolrCore.
{code}
...
String tikaConfigFile = context.getResolvedEntityAttribute("tikaConfig");
if (tikaConfigFile == null) {
   ClassLoader classLoader = 
context.getSolrCore().getResourceLoader().getClassLoader();
   tikaConfig = new TikaConfig(classLoader);
} else {
...
{code}

I haven't added a test that demonstrates this bug, since it only occurs when 
Tika libs (and its dependencies) are in the Solr lib directory and I don't know 
how to replicate this situation in the solr build. The TestTikaEntityProcessor 
class doesn't have this problem since all classes are on the normal classpath 
when the build is running.

      was (Author: martijn):
    I've encountered the same issue on my Solr setup. After some digging I 
found the problem, it is simply not loading classes from the lib directory.

When no tika config is specified in the data-config.xml, the 
TikaEntityProcessor tries to load the TikaConfig in the manner specified below:
{code}
....
String tikaConfigFile = context.getResolvedEntityAttribute("tikaConfig");
if (tikaConfigFile == null) {
  tikaConfig = TikaConfig.getDefaultConfig();
} else {
....
{code}

The problem with this way of loading the TIkaConfig is, that it doesn't use the 
classloader from the SolrResourceLoader and therefore not loading any jars from 
the solr lib directory. The attached patch resolves the issue that no content 
is parsed by Tika. I simply use the constructor that requires a ClassLoader as 
argument. I retrieve the classloader from the SolrCore.
{code}
...
String tikaConfigFile = context.getResolvedEntityAttribute("tikaConfig");
if (tikaConfigFile == null) {
   ClassLoader classLoader = 
context.getSolrCore().getResourceLoader().getClassLoader();
   tikaConfig = new TikaConfig(classLoader);
} else {
...
{code}

I haven't added a test that demonstrates this bug, since it only occurs when 
Tika libs (and its dependencies) are in the Solr lib directory and I don't know 
how to replicate this situation in the solr build. The TestTikaEntityProcessor 
class doesn't have this problem since all classes are on the normal classpath 
when the build is running.
  
> TikaEntityProcessor does not find parser by default
> ---------------------------------------------------
>
>                 Key: SOLR-2116
>                 URL: https://issues.apache.org/jira/browse/SOLR-2116
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler, contrib - Solr Cell (Tika 
> extraction)
>    Affects Versions: 3.1, 4.0
>            Reporter: Lance Norskog
>         Attachments: pdflist-data-config.xml, pdflist.xml, SOLR-2116.patch
>
>
> The TikaEntityProcessor does not find the correct document parser by default.
> This is in a two-level DIH config file. I have attached 
> pdflist-data-config.xml and pdflist.xml, the XML file list supplying. To test 
> this, you will need the current 3.x branch or 4.0 trunk.
> # Set up a Tika-enabled Solr 
> # copy any PDF file to /tmp/testfile.pdf
> # copy the pdflist-data-config.xml to your solr/conf
> # and add this snippet to your solrconfig.xml
> {code:xml}
> <requestHandler name="/pdflist"
>       class="org.apache.solr.handler.dataimport.DataImportHandler">
>   <lst name="defaults">
>               <str name="config">pdflist-data-config.xml</str>
>       </lst>
> </requestHandler>
> {code}
> [http://localhost:8983/solr/pdflist?command=full-import] will make one 
> document with the id and text fields populated. If you remove this line:
> {code}
>  parser="org.apache.tika.parser.pdf.PDFParser"
> {code}
> from the TikaEntityProcessor entity, the parser will not be found and you 
> will get a document with the "id" field and nothing else.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to