Yeah, PDF extraction has always been at least somewhat problematic. It has improved over the years, but still not likely to be perfect.

That said, I'm not aware of any specific PDF extraction issue that would bring down Solr - as opposed to causing a 500 status with an exception in PDF extraction, with the exception of memory usage. Some PDF documents, especially those which are graphic-intense can require a lot of memory. The rest of Solr could be adversely affected if all available JVM heap is consumed. The solution is to give the JVM more heap space.

So, what is your specific symptom?

-- Jack Krupansky

-----Original Message----- From: Brian McDowell
Sent: Thursday, May 22, 2014 12:24 AM
To: solr-user@lucene.apache.org
Subject: pdfs

Has anyone had issues with indexing pdf files? Some pdfs are bringing down
Solr completely so that it actually needs to be manually restarted. We are
using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
problem because the release notes associated with the new tika version and
also the new pdfbox indicate fixes for pdf issues. It didn't work and now
this issue is causing us to reevaluate using Solr. Any help on this matter
would be greatly appreciated. Thank you!

Reply via email to