Yeah, PDF extraction has always been at least somewhat problematic. It has
improved over the years, but still not likely to be perfect.
That said, I'm not aware of any specific PDF extraction issue that would
bring down Solr - as opposed to causing a 500 status with an exception in
PDF extraction, with the exception of memory usage. Some PDF documents,
especially those which are graphic-intense can require a lot of memory. The
rest of Solr could be adversely affected if all available JVM heap is
consumed. The solution is to give the JVM more heap space.
So, what is your specific symptom?
-- Jack Krupansky
-----Original Message-----
From: Brian McDowell
Sent: Thursday, May 22, 2014 12:24 AM
To: solr-user@lucene.apache.org
Subject: pdfs
Has anyone had issues with indexing pdf files? Some pdfs are bringing down
Solr completely so that it actually needs to be manually restarted. We are
using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
problem because the release notes associated with the new tika version and
also the new pdfbox indicate fixes for pdf issues. It didn't work and now
this issue is causing us to reevaluate using Solr. Any help on this matter
would be greatly appreciated. Thank you!