Re: Indexing PDF and MS Office files

Vijaya Narayana Reddy Bhoomi Reddy Tue, 14 Apr 2015 09:46:20 -0700

Andrea,

Yes, I am using the stock schema.xml that comes with the example server of
Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and
put into the content field in the index.

Please find the log information for the Parsing error below.

org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
... 32 more
Caused by: java.lang.IllegalArgumentException: This paragraph is not the
first one in the table
at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932)
at
org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188)
at
org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 35 more

ERROR - 2015-04-14 14:51:21.151; org.apache.solr.common.SolrException;
null:org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
... 32 more
Caused by: java.lang.IllegalArgumentException: This paragraph is not the
first one in the table
at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932)
at
org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188)
at
org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 35 more

Thanks & Regards
Vijay

On 14 April 2015 at 17:06, Vijaya Narayana Reddy Bhoomi Reddy <
vijaya.bhoomire...@whishworks.com> wrote:

> Hi,
>
> Here are the solr-config xml and the error log from Solr logs for your
> reference. As mentioned earlier, I didnt make any changes to the
> solr-config.xml as I am using the xml file out of the box one that came
> with the default installation.
>
> Please let me know your thoughts on why these issues are occuring.
>
> Thanks & Regards
> Vijay
>
>
> *Vijay Bhoomireddy*, Big Data Architect
>
> 1000 Great West Road, Brentford, London, TW8 9DW
> *T:  +44 20 3475 7980*
> *M: **+44 7481 298 360*
> *W: *ww <http://www.whishworks.com/>w.whishworks.com
> <http://www.whishworks.com/>
>
> <https://www.linkedin.com/company/whishworks>
> <http://www.whishworks.com/blog/>  <https://twitter.com/WHISHWORKS>
> <https://www.facebook.com/whishworksit>
>
> On 14 April 2015 at 15:57, Vijaya Narayana Reddy Bhoomi Reddy <
> vijaya.bhoomire...@whishworks.com> wrote:
>
>> Hi,
>>
>> I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
>> .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
>> Request to please let me know what is going wrong with the indexing
>> process.
>>
>> I am using solr 4.10.2 and using the default example server configuration
>> that comes with Solr distribution.
>>
>> PDF Files - Indexing as such works fine, but when I query using *.* in
>> the Solr Query console, metadata information is displayed properly.
>> However, the PDF content field is empty. This is happening for all PDF
>> files I have tried. I have tried with some proprietary files, PDF eBooks
>> etc. Whatever be the PDF file, content is not being displayed.
>>
>> MS Office files -  For some office files, everything works perfect and
>> the extracted content is visible in the query console. However, for others,
>> I see the below error message during the indexing process.
>>
>> *Exception in thread "main"
>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
>> org.apache.tika.parser.microsoft.OfficeParser*
>>
>>
>> I am using SolrJ to index the documents and below is the code snippet
>> related to indexing. Please let me know where the issue is occurring.
>>
>>                         static String solrServerURL = "
>> http://localhost:8983/solr";;
>> static SolrServer solrServer = new HttpSolrServer(solrServerURL);
>>                         static ContentStreamUpdateRequest indexingReq =
>> new
>>         ContentStreamUpdateRequest("/update/extract");
>>
>>                         indexingReq.addFile(file, fileType);
>> indexingReq.setParam("literal.id", literalId);
>> indexingReq.setParam("uprefix", "attr_");
>> indexingReq.setParam("fmap.content", "content");
>> indexingReq.setParam("literal.fileurl", fileURL);
>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>> solrServer.request(indexingReq);
>>
>> Thanks & Regards
>> Vijay
>>
>>
>>
>

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.

Re: Indexing PDF and MS Office files

Reply via email to