RE: Indexing PDF and MS Office files

Davis, Daniel (NIH/NLM) [C] Thu, 16 Apr 2015 09:11:02 -0700

If you use pdftotext with a simple fork/exec per document, you will get about 5 
MB/s throughput on a single AMD x86_64.   Much of that is because of the 
fork/exec.   I suggest that you use HTML output and UTF-8 encoding  for the 
PDF, because that way you can get title/keywords and such as http meta keywords.


If you have the appetite for something truly great, try:
 - Socket server listening for parsing requests
 - pass off accept() sockets to pre-forked children
 - in the children, use vfork, rather than fork
 -  tmpfs for outputted HTML documents
 - Tempting to implement using mod_perl and httpd, at least to me.

-----Original Message-----
From: Siegfried Goeschl [mailto:sgoes...@gmx.at] 
Sent: Thursday, April 16, 2015 7:53 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF and MS Office files

Hi Vijay,

I know the this road too well :-)

For PDF you can fallback to other tools for text extraction

* ps2ascii.ps
* XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
* some other tools exists as well (pdflib)

If you start command line tools from your JVM please have a look at 
commons-exec :-)

Cheers,

Siegfried Goeschl

PS: one more thing - please, tell your management that you will never ever 
successfully all real-world PDFs and cater for that fact in your requirements 
:-)

On 16.04.15 13:10, Vijaya Narayana Reddy Bhoomi Reddy wrote:
> Erick,
>
> I tried indexing both ways - SolrJ / Tika's AutoParser and as well as 
> SolrCell's ExtractRequestHandler. Majority of the PDF and Word 
> documents are getting parsed properly and indexed into Solr. However, 
> a minority of them keep failing wither PDFParser or OfficeParser error.
>
> Not sure if this behaviour can be modified so that all the documents 
> can be indexed. The business requirement we have is to index all the 
> documents.
> However, if a small percentage of them fails, not sure what other ways 
> exist to index them.
>
> Any help please?
>
>
> Thanks & Regards
> Vijay
>
>
>
> On 15 April 2015 at 15:20, Erick Erickson <erickerick...@gmail.com> wrote:
>
>> There's quite a discussion here:
>> https://issues.apache.org/jira/browse/SOLR-7137
>>
>> But, I personally am not a huge fan of pushing all the work on to 
>> Solr, in a production environment the Solr server is responsible for 
>> indexing, parsing the docs through Tika, perhaps searching etc. This 
>> doesn't scale all that well.
>>
>> So an alternative is to use SolrJ with Tika, which is totally 
>> independent of what version of Tika is on the Solr server. Here's an 
>> example.
>>
>> http://lucidworks.com/blog/indexing-with-solrj/
>>
>> Best,
>> Erick
>>
>> On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy 
>> <vijaya.bhoomire...@whishworks.com> wrote:
>>> Thanks everyone for the responses. Now I am able to index PDF 
>>> documents successfully. I have implemented manual extraction using 
>>> Tika's
>> AutoParser
>>> and PDF functionality is working fine. However,  the error with some 
>>> MS office word documents still persist.
>>>
>>> The error message is "java.lang.IllegalArgumentException: This 
>>> paragraph
>> is
>>> not the first one in the table" which will eventually result in
>> "Unexpected
>>> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser"
>>>
>>> Upon some reading, it looks like its a bug with Tika 1.5 and seems 
>>> to
>> have
>>> been fixed with Tika 1.6 (
>> https://issues.apache.org/jira/browse/TIKA-1251 ).
>>> I am new to Solr / Tika and hence wondering whether I can change the 
>>> Tika library alone to v1.6 without impacting any of the libraries 
>>> within Solr 4.10.2? Please let me know your response and how to get 
>>> away with this issue.
>>>
>>> Many thanks in advance.
>>>
>>> Thanks & Regards
>>> Vijay
>>>
>>>
>>> On 15 April 2015 at 05:14, Shyam R <shyam.reme...@gmail.com> wrote:
>>>
>>>> Vijay,
>>>>
>>>> You could try different excel files with different formats to rule 
>>>> out
>> the
>>>> issue is with TIKA version being used.
>>>>
>>>> Thanks
>>>> Murthy
>>>>
>>>> On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes 
>>>> <trhodes...@gmail.com>
>>>> wrote:
>>>>
>>>>> Perhaps the PDF is protected and the content can not be extracted?
>>>>>
>>>>> i have an unverified suspicion that the tika shipped with solr 
>>>>> 4.10.2
>> may
>>>>> not support some/all office 2013 document formats.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 4/14/2015 8:18 PM, Jack Krupansky wrote:
>>>>>
>>>>>> Try doing a manual extraction request directly to Solr (not via
>> SolrJ)
>>>> and
>>>>>> use the extractOnly option to see if the content is actually
>> extracted.
>>>>>>
>>>>>> See:
>>>>>> https://cwiki.apache.org/confluence/display/solr/
>>>>>> Uploading+Data+with+Solr+Cell+using+Apache+Tika
>>>>>>
>>>>>> Also, some PDF files actually have the content as a bitmap image, 
>>>>>> so
>> no
>>>>>> text is extracted.
>>>>>>
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>> On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi 
>>>>>> Reddy
>> <
>>>>>> vijaya.bhoomire...@whishworks.com> wrote:
>>>>>>
>>>>>>   Hi,
>>>>>>>
>>>>>>> I am trying to index PDF and Microsoft Office files (.doc, 
>>>>>>> .docx,
>> .ppt,
>>>>>>> .pptx, .xlx, and .xlx) files into Solr. I am facing the 
>>>>>>> following
>>>> issues.
>>>>>>> Request to please let me know what is going wrong with the 
>>>>>>> indexing process.
>>>>>>>
>>>>>>> I am using solr 4.10.2 and using the default example server
>>>> configuration
>>>>>>> that comes with Solr distribution.
>>>>>>>
>>>>>>> PDF Files - Indexing as such works fine, but when I query using 
>>>>>>> *.*
>> in
>>>>>>> the
>>>>>>> Solr Query console, metadata information is displayed properly.
>>>> However,
>>>>>>> the PDF content field is empty. This is happening for all PDF 
>>>>>>> files
>> I
>>>>>>> have
>>>>>>> tried. I have tried with some proprietary files, PDF eBooks etc.
>>>> Whatever
>>>>>>> be the PDF file, content is not being displayed.
>>>>>>>
>>>>>>> MS Office files -  For some office files, everything works 
>>>>>>> perfect
>> and
>>>>>>> the
>>>>>>> extracted content is visible in the query console. However, for
>>>> others, I
>>>>>>> see the below error message during the indexing process.
>>>>>>>
>>>>>>> *Exception in thread "main"
>>>>>>>
>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>>>>>>> org.apache.tika.exception.TikaException: Unexpected 
>>>>>>> RuntimeException from
>>>>>>> org.apache.tika.parser.microsoft.OfficeParser*
>>>>>>>
>>>>>>>
>>>>>>> I am using SolrJ to index the documents and below is the code
>> snippet
>>>>>>> related to indexing. Please let me know where the issue is
>> occurring.
>>>>>>>
>>>>>>>                           static String solrServerURL = "
>>>>>>> http://localhost:8983/solr";;
>>>>>>> static SolrServer solrServer = new HttpSolrServer(solrServerURL);
>>>>>>>                           static ContentStreamUpdateRequest
>> indexingReq
>>>> =
>>>>>>> new
>>>>>>>
>>>>>>>       ContentStreamUpdateRequest("/update/extract");
>>>>>>>
>>>>>>>                           indexingReq.addFile(file, fileType); 
>>>>>>> indexingReq.setParam("literal.id", literalId); 
>>>>>>> indexingReq.setParam("uprefix", "attr_"); 
>>>>>>> indexingReq.setParam("fmap.content", "content"); 
>>>>>>> indexingReq.setParam("literal.fileurl", fileURL); 
>>>>>>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true,
>> true);
>>>>>>> solrServer.request(indexingReq);
>>>>>>>
>>>>>>> Thanks & Regards
>>>>>>> Vijay
>>>>>>>
>>>>>>> --
>>>>>>> The contents of this e-mail are confidential and for the 
>>>>>>> exclusive
>> use
>>>> of
>>>>>>> the intended recipient. If you receive this e-mail in error 
>>>>>>> please
>>>> delete
>>>>>>> it from your system immediately and notify us either by e-mail 
>>>>>>> or telephone. You should not copy, forward or otherwise disclose 
>>>>>>> the
>>>> content
>>>>>>> of the e-mail. The views expressed in this communication may not 
>>>>>>> necessarily be the view held by WHISHWORKS.
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ph: 9845704792
>>>>
>>>
>>> --
>>> The contents of this e-mail are confidential and for the exclusive 
>>> use of the intended recipient. If you receive this e-mail in error 
>>> please delete it from your system immediately and notify us either 
>>> by e-mail or telephone. You should not copy, forward or otherwise 
>>> disclose the content of the e-mail. The views expressed in this 
>>> communication may not necessarily be the view held by WHISHWORKS.
>>
>

RE: Indexing PDF and MS Office files

Reply via email to