Re: Indexing PDF and MS Office files

Vijaya Narayana Reddy Bhoomi Reddy Thu, 16 Apr 2015 06:17:48 -0700

For MS Word documents, one common pattern for all failed documents I
noticed is that all of them contain embedded images (like scanned signature
images embedded into the documents. These documents are much like some
letterheads where someone scanned the signature image and then embedded
into the document along with the text) with in the documents.


For other documents which completed successfully, no images were present.
Just wondering if these are causing the issue.


Thanks & Regards
Vijay



On 16 April 2015 at 12:58, Vijaya Narayana Reddy Bhoomi Reddy <
vijaya.bhoomire...@whishworks.com> wrote:

> Thanks Tim.
>
> I shall raise a Jira with the stack trace information.
>
> Thanks & Regards
> Vijay
>
>
> On 16 April 2015 at 12:54, Allison, Timothy B. <talli...@mitre.org> wrote:
>
>> This sounds like a Tika issue, let's move discussion to that list.
>>
>> If you are still having problems after you upgrade to Tika 1.8, please at
>> least submit the stack traces (if you can) to the Tika jira.  We may be
>> able to find a document that triggers that stack trace in govdocs1 or the
>> slice of CommonCrawl that Julien Nioche contributed to our eval effort.
>>
>> Tika is not perfect and it will fail on some files, but we are always
>> working to improve it.
>>
>> Best,
>>
>>           Tim
>>
>> -----Original Message-----
>> From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
>> vijaya.bhoomire...@whishworks.com]
>> Sent: Thursday, April 16, 2015 7:44 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Indexing PDF and MS Office files
>>
>> Thanks Allison.
>>
>> I tried with the mentioned changes. But still no luck. I am using the code
>> from lucidworks site provided by Erick and now included the changes
>> mentioned by you. But still the issue persists with a small percentage of
>> documents (both PDF and MS Office documents) failing. Unfortunately, these
>> documents are proprietary and client-confidential and hence I am not sure
>> whether they can be uploaded into Jira.
>>
>> These files normally open in Adobe Reader and MS Office tools.
>>
>> Thanks & Regards
>> Vijay
>>
>>
>> On 16 April 2015 at 12:33, Allison, Timothy B. <talli...@mitre.org>
>> wrote:
>>
>> > I entirely agree with Erick -- it is best to isolate Tika in its own jvm
>> > if you can -- bad things can happen if you don't [1] [2].
>> >
>> > Erick's blog on SolrJ is fantastic.  If you want to have Tika parse
>> > embedded documents/attachments, make sure to set the parser in the
>> > ParseContext before parsing:
>> >
>> > ParseContext context = new ParseContext();
>> > //add this line:
>> > context.set(Parser.class, _autoParser)
>> >  InputStream input = new FileInputStream(file);
>> >
>> > Tika 1.8 is soon to be released.  If that doesn't fix your problems,
>> > please submit stacktraces (and docs, if possible) to the Tika jira, and
>> > we'll try to make the fixes.
>> >
>> > Cheers,
>> >
>> >             Tim
>> >
>> > [1]
>> >
>> http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
>> > [2]
>> >
>> http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
>> > -----Original Message-----
>> > From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
>> > vijaya.bhoomire...@whishworks.com]
>> > Sent: Thursday, April 16, 2015 7:10 AM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Indexing PDF and MS Office files
>> >
>> > Erick,
>> >
>> > I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
>> > SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
>> > are getting parsed properly and indexed into Solr. However, a minority
>> of
>> > them keep failing wither PDFParser or OfficeParser error.
>> >
>> > Not sure if this behaviour can be modified so that all the documents
>> can be
>> > indexed. The business requirement we have is to index all the documents.
>> > However, if a small percentage of them fails, not sure what other ways
>> > exist to index them.
>> >
>> > Any help please?
>> >
>> >
>> > Thanks & Regards
>> > Vijay
>> >
>> >
>> >
>> > On 15 April 2015 at 15:20, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>> >
>> > > There's quite a discussion here:
>> > > https://issues.apache.org/jira/browse/SOLR-7137
>> > >
>> > > But, I personally am not a huge fan of pushing all the work on to
>> Solr,
>> > in
>> > > a
>> > > production environment the Solr server is responsible for indexing,
>> > > parsing the
>> > > docs through Tika, perhaps searching etc. This doesn't scale all that
>> > well.
>> > >
>> > > So an alternative is to use SolrJ with Tika, which is totally
>> independent
>> > > of
>> > > what version of Tika is on the Solr server. Here's an example.
>> > >
>> > > http://lucidworks.com/blog/indexing-with-solrj/
>> > >
>> > > Best,
>> > > Erick
>> > >
>> > > On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
>> > > <vijaya.bhoomire...@whishworks.com> wrote:
>> > > > Thanks everyone for the responses. Now I am able to index PDF
>> documents
>> > > > successfully. I have implemented manual extraction using Tika's
>> > > AutoParser
>> > > > and PDF functionality is working fine. However,  the error with
>> some MS
>> > > > office word documents still persist.
>> > > >
>> > > > The error message is "java.lang.IllegalArgumentException: This
>> > paragraph
>> > > is
>> > > > not the first one in the table" which will eventually result in
>> > > "Unexpected
>> > > > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser"
>> > > >
>> > > > Upon some reading, it looks like its a bug with Tika 1.5 and seems
>> to
>> > > have
>> > > > been fixed with Tika 1.6 (
>> > > https://issues.apache.org/jira/browse/TIKA-1251 ).
>> > > > I am new to Solr / Tika and hence wondering whether I can change the
>> > Tika
>> > > > library alone to v1.6 without impacting any of the libraries within
>> > Solr
>> > > > 4.10.2? Please let me know your response and how to get away with
>> this
>> > > > issue.
>> > > >
>> > > > Many thanks in advance.
>> > > >
>> > > > Thanks & Regards
>> > > > Vijay
>> > > >
>> > > >
>> > > > On 15 April 2015 at 05:14, Shyam R <shyam.reme...@gmail.com> wrote:
>> > > >
>> > > >> Vijay,
>> > > >>
>> > > >> You could try different excel files with different formats to rule
>> out
>> > > the
>> > > >> issue is with TIKA version being used.
>> > > >>
>> > > >> Thanks
>> > > >> Murthy
>> > > >>
>> > > >> On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes <
>> trhodes...@gmail.com>
>> > > >> wrote:
>> > > >>
>> > > >> > Perhaps the PDF is protected and the content can not be
>> extracted?
>> > > >> >
>> > > >> > i have an unverified suspicion that the tika shipped with solr
>> > 4.10.2
>> > > may
>> > > >> > not support some/all office 2013 document formats.
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > On 4/14/2015 8:18 PM, Jack Krupansky wrote:
>> > > >> >
>> > > >> >> Try doing a manual extraction request directly to Solr (not via
>> > > SolrJ)
>> > > >> and
>> > > >> >> use the extractOnly option to see if the content is actually
>> > > extracted.
>> > > >> >>
>> > > >> >> See:
>> > > >> >> https://cwiki.apache.org/confluence/display/solr/
>> > > >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika
>> > > >> >>
>> > > >> >> Also, some PDF files actually have the content as a bitmap
>> image,
>> > so
>> > > no
>> > > >> >> text is extracted.
>> > > >> >>
>> > > >> >>
>> > > >> >> -- Jack Krupansky
>> > > >> >>
>> > > >> >> On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi
>> > Reddy
>> > > <
>> > > >> >> vijaya.bhoomire...@whishworks.com> wrote:
>> > > >> >>
>> > > >> >>  Hi,
>> > > >> >>>
>> > > >> >>> I am trying to index PDF and Microsoft Office files (.doc,
>> .docx,
>> > > .ppt,
>> > > >> >>> .pptx, .xlx, and .xlx) files into Solr. I am facing the
>> following
>> > > >> issues.
>> > > >> >>> Request to please let me know what is going wrong with the
>> > indexing
>> > > >> >>> process.
>> > > >> >>>
>> > > >> >>> I am using solr 4.10.2 and using the default example server
>> > > >> configuration
>> > > >> >>> that comes with Solr distribution.
>> > > >> >>>
>> > > >> >>> PDF Files - Indexing as such works fine, but when I query using
>> > *.*
>> > > in
>> > > >> >>> the
>> > > >> >>> Solr Query console, metadata information is displayed properly.
>> > > >> However,
>> > > >> >>> the PDF content field is empty. This is happening for all PDF
>> > files
>> > > I
>> > > >> >>> have
>> > > >> >>> tried. I have tried with some proprietary files, PDF eBooks
>> etc.
>> > > >> Whatever
>> > > >> >>> be the PDF file, content is not being displayed.
>> > > >> >>>
>> > > >> >>> MS Office files -  For some office files, everything works
>> perfect
>> > > and
>> > > >> >>> the
>> > > >> >>> extracted content is visible in the query console. However, for
>> > > >> others, I
>> > > >> >>> see the below error message during the indexing process.
>> > > >> >>>
>> > > >> >>> *Exception in thread "main"
>> > > >> >>>
>> > > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>> > > >> >>> org.apache.tika.exception.TikaException: Unexpected
>> > RuntimeException
>> > > >> >>> from
>> > > >> >>> org.apache.tika.parser.microsoft.OfficeParser*
>> > > >> >>>
>> > > >> >>>
>> > > >> >>> I am using SolrJ to index the documents and below is the code
>> > > snippet
>> > > >> >>> related to indexing. Please let me know where the issue is
>> > > occurring.
>> > > >> >>>
>> > > >> >>>                          static String solrServerURL = "
>> > > >> >>> http://localhost:8983/solr";;
>> > > >> >>> static SolrServer solrServer = new
>> HttpSolrServer(solrServerURL);
>> > > >> >>>                          static ContentStreamUpdateRequest
>> > > indexingReq
>> > > >> =
>> > > >> >>> new
>> > > >> >>>
>> > > >> >>>      ContentStreamUpdateRequest("/update/extract");
>> > > >> >>>
>> > > >> >>>                          indexingReq.addFile(file, fileType);
>> > > >> >>> indexingReq.setParam("literal.id", literalId);
>> > > >> >>> indexingReq.setParam("uprefix", "attr_");
>> > > >> >>> indexingReq.setParam("fmap.content", "content");
>> > > >> >>> indexingReq.setParam("literal.fileurl", fileURL);
>> > > >> >>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT,
>> true,
>> > > true);
>> > > >> >>> solrServer.request(indexingReq);
>> > > >> >>>
>> > > >> >>> Thanks & Regards
>> > > >> >>> Vijay
>> > > >> >>>
>> > > >> >>> --
>> > > >> >>> The contents of this e-mail are confidential and for the
>> exclusive
>> > > use
>> > > >> of
>> > > >> >>> the intended recipient. If you receive this e-mail in error
>> please
>> > > >> delete
>> > > >> >>> it from your system immediately and notify us either by e-mail
>> or
>> > > >> >>> telephone. You should not copy, forward or otherwise disclose
>> the
>> > > >> content
>> > > >> >>> of the e-mail. The views expressed in this communication may
>> not
>> > > >> >>> necessarily be the view held by WHISHWORKS.
>> > > >> >>>
>> > > >> >>>
>> > > >> >
>> > > >>
>> > > >>
>> > > >> --
>> > > >> Ph: 9845704792
>> > > >>
>> > > >
>> > > > --
>> > > > The contents of this e-mail are confidential and for the exclusive
>> use
>> > of
>> > > > the intended recipient. If you receive this e-mail in error please
>> > delete
>> > > > it from your system immediately and notify us either by e-mail or
>> > > > telephone. You should not copy, forward or otherwise disclose the
>> > content
>> > > > of the e-mail. The views expressed in this communication may not
>> > > > necessarily be the view held by WHISHWORKS.
>> > >
>> >
>> > --
>> > The contents of this e-mail are confidential and for the exclusive use
>> of
>> > the intended recipient. If you receive this e-mail in error please
>> delete
>> > it from your system immediately and notify us either by e-mail or
>> > telephone. You should not copy, forward or otherwise disclose the
>> content
>> > of the e-mail. The views expressed in this communication may not
>> > necessarily be the view held by WHISHWORKS.
>> >
>>
>> --
>> The contents of this e-mail are confidential and for the exclusive use of
>> the intended recipient. If you receive this e-mail in error please delete
>> it from your system immediately and notify us either by e-mail or
>> telephone. You should not copy, forward or otherwise disclose the content
>> of the e-mail. The views expressed in this communication may not
>> necessarily be the view held by WHISHWORKS.
>>
>
>

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.

Re: Indexing PDF and MS Office files

Reply via email to