Re: Indexing PDF and MS Office files

Erick Erickson Wed, 15 Apr 2015 07:22:06 -0700

There's quite a discussion here: https://issues.apache.org/jira/browse/SOLR-7137


But, I personally am not a huge fan of pushing all the work on to Solr, in a
production environment the Solr server is responsible for indexing, parsing the
docs through Tika, perhaps searching etc. This doesn't scale all that well.

So an alternative is to use SolrJ with Tika, which is totally independent of
what version of Tika is on the Solr server. Here's an example.

http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
<vijaya.bhoomire...@whishworks.com> wrote:
> Thanks everyone for the responses. Now I am able to index PDF documents
> successfully. I have implemented manual extraction using Tika's AutoParser
> and PDF functionality is working fine. However,  the error with some MS
> office word documents still persist.
>
> The error message is "java.lang.IllegalArgumentException: This paragraph is
> not the first one in the table" which will eventually result in "Unexpected
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser"
>
> Upon some reading, it looks like its a bug with Tika 1.5 and seems to have
> been fixed with Tika 1.6 ( https://issues.apache.org/jira/browse/TIKA-1251 ).
> I am new to Solr / Tika and hence wondering whether I can change the Tika
> library alone to v1.6 without impacting any of the libraries within Solr
> 4.10.2? Please let me know your response and how to get away with this
> issue.
>
> Many thanks in advance.
>
> Thanks & Regards
> Vijay
>
>
> On 15 April 2015 at 05:14, Shyam R <shyam.reme...@gmail.com> wrote:
>
>> Vijay,
>>
>> You could try different excel files with different formats to rule out the
>> issue is with TIKA version being used.
>>
>> Thanks
>> Murthy
>>
>> On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes <trhodes...@gmail.com>
>> wrote:
>>
>> > Perhaps the PDF is protected and the content can not be extracted?
>> >
>> > i have an unverified suspicion that the tika shipped with solr 4.10.2 may
>> > not support some/all office 2013 document formats.
>> >
>> >
>> >
>> >
>> >
>> > On 4/14/2015 8:18 PM, Jack Krupansky wrote:
>> >
>> >> Try doing a manual extraction request directly to Solr (not via SolrJ)
>> and
>> >> use the extractOnly option to see if the content is actually extracted.
>> >>
>> >> See:
>> >> https://cwiki.apache.org/confluence/display/solr/
>> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika
>> >>
>> >> Also, some PDF files actually have the content as a bitmap image, so no
>> >> text is extracted.
>> >>
>> >>
>> >> -- Jack Krupansky
>> >>
>> >> On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy <
>> >> vijaya.bhoomire...@whishworks.com> wrote:
>> >>
>> >>  Hi,
>> >>>
>> >>> I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
>> >>> .pptx, .xlx, and .xlx) files into Solr. I am facing the following
>> issues.
>> >>> Request to please let me know what is going wrong with the indexing
>> >>> process.
>> >>>
>> >>> I am using solr 4.10.2 and using the default example server
>> configuration
>> >>> that comes with Solr distribution.
>> >>>
>> >>> PDF Files - Indexing as such works fine, but when I query using *.* in
>> >>> the
>> >>> Solr Query console, metadata information is displayed properly.
>> However,
>> >>> the PDF content field is empty. This is happening for all PDF files I
>> >>> have
>> >>> tried. I have tried with some proprietary files, PDF eBooks etc.
>> Whatever
>> >>> be the PDF file, content is not being displayed.
>> >>>
>> >>> MS Office files -  For some office files, everything works perfect and
>> >>> the
>> >>> extracted content is visible in the query console. However, for
>> others, I
>> >>> see the below error message during the indexing process.
>> >>>
>> >>> *Exception in thread "main"
>> >>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>> >>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
>> >>> from
>> >>> org.apache.tika.parser.microsoft.OfficeParser*
>> >>>
>> >>>
>> >>> I am using SolrJ to index the documents and below is the code snippet
>> >>> related to indexing. Please let me know where the issue is occurring.
>> >>>
>> >>>                          static String solrServerURL = "
>> >>> http://localhost:8983/solr";;
>> >>> static SolrServer solrServer = new HttpSolrServer(solrServerURL);
>> >>>                          static ContentStreamUpdateRequest indexingReq
>> =
>> >>> new
>> >>>
>> >>>      ContentStreamUpdateRequest("/update/extract");
>> >>>
>> >>>                          indexingReq.addFile(file, fileType);
>> >>> indexingReq.setParam("literal.id", literalId);
>> >>> indexingReq.setParam("uprefix", "attr_");
>> >>> indexingReq.setParam("fmap.content", "content");
>> >>> indexingReq.setParam("literal.fileurl", fileURL);
>> >>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>> >>> solrServer.request(indexingReq);
>> >>>
>> >>> Thanks & Regards
>> >>> Vijay
>> >>>
>> >>> --
>> >>> The contents of this e-mail are confidential and for the exclusive use
>> of
>> >>> the intended recipient. If you receive this e-mail in error please
>> delete
>> >>> it from your system immediately and notify us either by e-mail or
>> >>> telephone. You should not copy, forward or otherwise disclose the
>> content
>> >>> of the e-mail. The views expressed in this communication may not
>> >>> necessarily be the view held by WHISHWORKS.
>> >>>
>> >>>
>> >
>>
>>
>> --
>> Ph: 9845704792
>>
>
> --
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.

Re: Indexing PDF and MS Office files

Reply via email to