Re: Indexing PDF and MS Office files

Erick Erickson Tue, 14 Apr 2015 09:56:49 -0700

looks like this is just a file that Tika can't handle, based on this line:

bq: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser


You might be able to get some joy from parsing this from Java and see
if a more recent Tika would fix it. Here's some  sample code:

http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Tue, Apr 14, 2015 at 9:44 AM, Vijaya Narayana Reddy Bhoomi Reddy
<vijaya.bhoomire...@whishworks.com> wrote:
> Andrea,
>
> Yes, I am using the stock schema.xml that comes with the example server of
> Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and
> put into the content field in the index.
>
> Please find the log information for the Parsing error below.
>
>
> org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@138b0c5
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:368)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
> at
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
> at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Unknown Source)
> Caused by: org.apache.tika.exception.TikaException: Unexpected
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> ... 32 more
> Caused by: java.lang.IllegalArgumentException: This paragraph is not the
> first one in the table
> at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932)
> at
> org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188)
> at
> org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172)
> at
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> ... 35 more
>
> ERROR - 2015-04-14 14:51:21.151; org.apache.solr.common.SolrException;
> null:org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@138b0c5
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:368)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
> at
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
> at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Unknown Source)
> Caused by: org.apache.tika.exception.TikaException: Unexpected
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> ... 32 more
> Caused by: java.lang.IllegalArgumentException: This paragraph is not the
> first one in the table
> at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932)
> at
> org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188)
> at
> org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172)
> at
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> ... 35 more
>
>
> Thanks & Regards
> Vijay
>
>
> On 14 April 2015 at 17:06, Vijaya Narayana Reddy Bhoomi Reddy <
> vijaya.bhoomire...@whishworks.com> wrote:
>
>> Hi,
>>
>> Here are the solr-config xml and the error log from Solr logs for your
>> reference. As mentioned earlier, I didnt make any changes to the
>> solr-config.xml as I am using the xml file out of the box one that came
>> with the default installation.
>>
>> Please let me know your thoughts on why these issues are occuring.
>>
>> Thanks & Regards
>> Vijay
>>
>>
>> *Vijay Bhoomireddy*, Big Data Architect
>>
>> 1000 Great West Road, Brentford, London, TW8 9DW
>> *T:  +44 20 3475 7980*
>> *M: **+44 7481 298 360*
>> *W: *ww <http://www.whishworks.com/>w.whishworks.com
>> <http://www.whishworks.com/>
>>
>> <https://www.linkedin.com/company/whishworks>
>> <http://www.whishworks.com/blog/>  <https://twitter.com/WHISHWORKS>
>> <https://www.facebook.com/whishworksit>
>>
>> On 14 April 2015 at 15:57, Vijaya Narayana Reddy Bhoomi Reddy <
>> vijaya.bhoomire...@whishworks.com> wrote:
>>
>>> Hi,
>>>
>>> I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
>>> .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
>>> Request to please let me know what is going wrong with the indexing
>>> process.
>>>
>>> I am using solr 4.10.2 and using the default example server configuration
>>> that comes with Solr distribution.
>>>
>>> PDF Files - Indexing as such works fine, but when I query using *.* in
>>> the Solr Query console, metadata information is displayed properly.
>>> However, the PDF content field is empty. This is happening for all PDF
>>> files I have tried. I have tried with some proprietary files, PDF eBooks
>>> etc. Whatever be the PDF file, content is not being displayed.
>>>
>>> MS Office files -  For some office files, everything works perfect and
>>> the extracted content is visible in the query console. However, for others,
>>> I see the below error message during the indexing process.
>>>
>>> *Exception in thread "main"
>>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
>>> org.apache.tika.parser.microsoft.OfficeParser*
>>>
>>>
>>> I am using SolrJ to index the documents and below is the code snippet
>>> related to indexing. Please let me know where the issue is occurring.
>>>
>>>                         static String solrServerURL = "
>>> http://localhost:8983/solr";;
>>> static SolrServer solrServer = new HttpSolrServer(solrServerURL);
>>>                         static ContentStreamUpdateRequest indexingReq =
>>> new
>>>         ContentStreamUpdateRequest("/update/extract");
>>>
>>>                         indexingReq.addFile(file, fileType);
>>> indexingReq.setParam("literal.id", literalId);
>>> indexingReq.setParam("uprefix", "attr_");
>>> indexingReq.setParam("fmap.content", "content");
>>> indexingReq.setParam("literal.fileurl", fileURL);
>>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>>> solrServer.request(indexingReq);
>>>
>>> Thanks & Regards
>>> Vijay
>>>
>>>
>>>
>>
>
> --
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.

Re: Indexing PDF and MS Office files

Reply via email to