If you use pdftotext with a simple fork/exec per document, you will get about 5 MB/s throughput on a single AMD x86_64. Much of that is because of the fork/exec. I suggest that you use HTML output and UTF-8 encoding for the PDF, because that way you can get title/keywords and such as http meta keywords.
If you have the appetite for something truly great, try: - Socket server listening for parsing requests - pass off accept() sockets to pre-forked children - in the children, use vfork, rather than fork - tmpfs for outputted HTML documents - Tempting to implement using mod_perl and httpd, at least to me. -----Original Message----- From: Siegfried Goeschl [mailto:sgoes...@gmx.at] Sent: Thursday, April 16, 2015 7:53 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Hi Vijay, I know the this road too well :-) For PDF you can fallback to other tools for text extraction * ps2ascii.ps * XPDF's pdftotext CLI utility (more comfortable than Ghostscript) * some other tools exists as well (pdflib) If you start command line tools from your JVM please have a look at commons-exec :-) Cheers, Siegfried Goeschl PS: one more thing - please, tell your management that you will never ever successfully all real-world PDFs and cater for that fact in your requirements :-) On 16.04.15 13:10, Vijaya Narayana Reddy Bhoomi Reddy wrote: > Erick, > > I tried indexing both ways - SolrJ / Tika's AutoParser and as well as > SolrCell's ExtractRequestHandler. Majority of the PDF and Word > documents are getting parsed properly and indexed into Solr. However, > a minority of them keep failing wither PDFParser or OfficeParser error. > > Not sure if this behaviour can be modified so that all the documents > can be indexed. The business requirement we have is to index all the > documents. > However, if a small percentage of them fails, not sure what other ways > exist to index them. > > Any help please? > > > Thanks & Regards > Vijay > > > > On 15 April 2015 at 15:20, Erick Erickson <erickerick...@gmail.com> wrote: > >> There's quite a discussion here: >> https://issues.apache.org/jira/browse/SOLR-7137 >> >> But, I personally am not a huge fan of pushing all the work on to >> Solr, in a production environment the Solr server is responsible for >> indexing, parsing the docs through Tika, perhaps searching etc. This >> doesn't scale all that well. >> >> So an alternative is to use SolrJ with Tika, which is totally >> independent of what version of Tika is on the Solr server. Here's an >> example. >> >> http://lucidworks.com/blog/indexing-with-solrj/ >> >> Best, >> Erick >> >> On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy >> <vijaya.bhoomire...@whishworks.com> wrote: >>> Thanks everyone for the responses. Now I am able to index PDF >>> documents successfully. I have implemented manual extraction using >>> Tika's >> AutoParser >>> and PDF functionality is working fine. However, the error with some >>> MS office word documents still persist. >>> >>> The error message is "java.lang.IllegalArgumentException: This >>> paragraph >> is >>> not the first one in the table" which will eventually result in >> "Unexpected >>> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser" >>> >>> Upon some reading, it looks like its a bug with Tika 1.5 and seems >>> to >> have >>> been fixed with Tika 1.6 ( >> https://issues.apache.org/jira/browse/TIKA-1251 ). >>> I am new to Solr / Tika and hence wondering whether I can change the >>> Tika library alone to v1.6 without impacting any of the libraries >>> within Solr 4.10.2? Please let me know your response and how to get >>> away with this issue. >>> >>> Many thanks in advance. >>> >>> Thanks & Regards >>> Vijay >>> >>> >>> On 15 April 2015 at 05:14, Shyam R <shyam.reme...@gmail.com> wrote: >>> >>>> Vijay, >>>> >>>> You could try different excel files with different formats to rule >>>> out >> the >>>> issue is with TIKA version being used. >>>> >>>> Thanks >>>> Murthy >>>> >>>> On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes >>>> <trhodes...@gmail.com> >>>> wrote: >>>> >>>>> Perhaps the PDF is protected and the content can not be extracted? >>>>> >>>>> i have an unverified suspicion that the tika shipped with solr >>>>> 4.10.2 >> may >>>>> not support some/all office 2013 document formats. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 4/14/2015 8:18 PM, Jack Krupansky wrote: >>>>> >>>>>> Try doing a manual extraction request directly to Solr (not via >> SolrJ) >>>> and >>>>>> use the extractOnly option to see if the content is actually >> extracted. >>>>>> >>>>>> See: >>>>>> https://cwiki.apache.org/confluence/display/solr/ >>>>>> Uploading+Data+with+Solr+Cell+using+Apache+Tika >>>>>> >>>>>> Also, some PDF files actually have the content as a bitmap image, >>>>>> so >> no >>>>>> text is extracted. >>>>>> >>>>>> >>>>>> -- Jack Krupansky >>>>>> >>>>>> On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi >>>>>> Reddy >> < >>>>>> vijaya.bhoomire...@whishworks.com> wrote: >>>>>> >>>>>> Hi, >>>>>>> >>>>>>> I am trying to index PDF and Microsoft Office files (.doc, >>>>>>> .docx, >> .ppt, >>>>>>> .pptx, .xlx, and .xlx) files into Solr. I am facing the >>>>>>> following >>>> issues. >>>>>>> Request to please let me know what is going wrong with the >>>>>>> indexing process. >>>>>>> >>>>>>> I am using solr 4.10.2 and using the default example server >>>> configuration >>>>>>> that comes with Solr distribution. >>>>>>> >>>>>>> PDF Files - Indexing as such works fine, but when I query using >>>>>>> *.* >> in >>>>>>> the >>>>>>> Solr Query console, metadata information is displayed properly. >>>> However, >>>>>>> the PDF content field is empty. This is happening for all PDF >>>>>>> files >> I >>>>>>> have >>>>>>> tried. I have tried with some proprietary files, PDF eBooks etc. >>>> Whatever >>>>>>> be the PDF file, content is not being displayed. >>>>>>> >>>>>>> MS Office files - For some office files, everything works >>>>>>> perfect >> and >>>>>>> the >>>>>>> extracted content is visible in the query console. However, for >>>> others, I >>>>>>> see the below error message during the indexing process. >>>>>>> >>>>>>> *Exception in thread "main" >>>>>>> >> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: >>>>>>> org.apache.tika.exception.TikaException: Unexpected >>>>>>> RuntimeException from >>>>>>> org.apache.tika.parser.microsoft.OfficeParser* >>>>>>> >>>>>>> >>>>>>> I am using SolrJ to index the documents and below is the code >> snippet >>>>>>> related to indexing. Please let me know where the issue is >> occurring. >>>>>>> >>>>>>> static String solrServerURL = " >>>>>>> http://localhost:8983/solr"; >>>>>>> static SolrServer solrServer = new HttpSolrServer(solrServerURL); >>>>>>> static ContentStreamUpdateRequest >> indexingReq >>>> = >>>>>>> new >>>>>>> >>>>>>> ContentStreamUpdateRequest("/update/extract"); >>>>>>> >>>>>>> indexingReq.addFile(file, fileType); >>>>>>> indexingReq.setParam("literal.id", literalId); >>>>>>> indexingReq.setParam("uprefix", "attr_"); >>>>>>> indexingReq.setParam("fmap.content", "content"); >>>>>>> indexingReq.setParam("literal.fileurl", fileURL); >>>>>>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, >> true); >>>>>>> solrServer.request(indexingReq); >>>>>>> >>>>>>> Thanks & Regards >>>>>>> Vijay >>>>>>> >>>>>>> -- >>>>>>> The contents of this e-mail are confidential and for the >>>>>>> exclusive >> use >>>> of >>>>>>> the intended recipient. If you receive this e-mail in error >>>>>>> please >>>> delete >>>>>>> it from your system immediately and notify us either by e-mail >>>>>>> or telephone. You should not copy, forward or otherwise disclose >>>>>>> the >>>> content >>>>>>> of the e-mail. The views expressed in this communication may not >>>>>>> necessarily be the view held by WHISHWORKS. >>>>>>> >>>>>>> >>>>> >>>> >>>> >>>> -- >>>> Ph: 9845704792 >>>> >>> >>> -- >>> The contents of this e-mail are confidential and for the exclusive >>> use of the intended recipient. If you receive this e-mail in error >>> please delete it from your system immediately and notify us either >>> by e-mail or telephone. You should not copy, forward or otherwise >>> disclose the content of the e-mail. The views expressed in this >>> communication may not necessarily be the view held by WHISHWORKS. >> >