Only a month late to respond, and the response likely won't help.

I agree with Shawn that Tika can be a memory hog.  I try to leave 1GB per 
thread, but your mileage will vary dramatically depending on your docs.  I'd 
expect that you'd get an OOM, though, somewhere...

There have been rare bugs in various parsers, including the PDFParser, in 
various versions of Tika that cause permanent hangs.  I haven't experimented 
with DIH and known trigger files, but I suspect you'd get the behavior that 
you're seeing if this were to happen.

So, short of rolling your own ETL'r in lieu of DIH or hardening DIH to run tika 
in a different process (tika-server, perhaps -- 
https://issues.apache.org/jira/browse/SOLR-7632) or going big with Hadoop, 
morphlines, etc, your only hope is to upgrade Tika and hope that that was one 
of the bugs that we've already identified and fixed.

If you do go with morphlines...I don't think this has been fixed yet: 
https://github.com/kite-sdk/kite/issues/397

Did you ever figure out what was going wrong?

Best,

         Tim

-----Original Message-----
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Tuesday, July 21, 2015 10:41 AM
To: solr-user@lucene.apache.org
Subject: Re: Data Import Handler Stays Idle

On 7/21/2015 8:17 AM, Paden wrote:
> There are some zip files inside the directory and have been addressed 
> to in the database. I'm thinking those are the one's it's jumping 
> right over. They are not the issue. At least I'm 95% sure. And Shawn 
> if you're still watching I'm sorry I'm using solr-5.1.0.

Have you started Solr with a larger heap than the default 512MB in Solr 5.x?  
Tika can require a lot of memory.  I would have expected there to be 
OutOfMemoryError exceptions in the log if that were the problem, though.

You may need to use the "-m" option on the startup scripts to increase the max 
heap.  Starting with "-m 2g" would be a good idea.

Also, seeing the entire multi-line IOException from the log (which may be 
dozens of lines) could be important.

Thanks,
Shawn

Reply via email to