Only a month late to respond, and the response likely won't help. I agree with Shawn that Tika can be a memory hog. I try to leave 1GB per thread, but your mileage will vary dramatically depending on your docs. I'd expect that you'd get an OOM, though, somewhere...
There have been rare bugs in various parsers, including the PDFParser, in various versions of Tika that cause permanent hangs. I haven't experimented with DIH and known trigger files, but I suspect you'd get the behavior that you're seeing if this were to happen. So, short of rolling your own ETL'r in lieu of DIH or hardening DIH to run tika in a different process (tika-server, perhaps -- https://issues.apache.org/jira/browse/SOLR-7632) or going big with Hadoop, morphlines, etc, your only hope is to upgrade Tika and hope that that was one of the bugs that we've already identified and fixed. If you do go with morphlines...I don't think this has been fixed yet: https://github.com/kite-sdk/kite/issues/397 Did you ever figure out what was going wrong? Best, Tim -----Original Message----- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Tuesday, July 21, 2015 10:41 AM To: solr-user@lucene.apache.org Subject: Re: Data Import Handler Stays Idle On 7/21/2015 8:17 AM, Paden wrote: > There are some zip files inside the directory and have been addressed > to in the database. I'm thinking those are the one's it's jumping > right over. They are not the issue. At least I'm 95% sure. And Shawn > if you're still watching I'm sorry I'm using solr-5.1.0. Have you started Solr with a larger heap than the default 512MB in Solr 5.x? Tika can require a lot of memory. I would have expected there to be OutOfMemoryError exceptions in the log if that were the problem, though. You may need to use the "-m" option on the startup scripts to increase the max heap. Starting with "-m 2g" would be a good idea. Also, seeing the entire multi-line IOException from the log (which may be dozens of lines) could be important. Thanks, Shawn