Indeed. Another solution is to purchase ABBYY or Nuance as a server, and have them do that work. You will even get OCR. Both offer a Linux SDK.
-----Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, April 16, 2015 7:56 AM To: solr-user@lucene.apache.org Subject: RE: Indexing PDF and MS Office files +1 :) >PS: one more thing - please, tell your management that you will never >ever successfully all real-world PDFs and cater for that fact in your >requirements :-)