looks like this is just a file that Tika can't handle, based on this line: bq: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser
You might be able to get some joy from parsing this from Java and see if a more recent Tika would fix it. Here's some sample code: http://lucidworks.com/blog/indexing-with-solrj/ Best, Erick On Tue, Apr 14, 2015 at 9:44 AM, Vijaya Narayana Reddy Bhoomi Reddy <vijaya.bhoomire...@whishworks.com> wrote: > Andrea, > > Yes, I am using the stock schema.xml that comes with the example server of > Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and > put into the content field in the index. > > Please find the log information for the Parsing error below. > > > org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@138b0c5 > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) > at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > at org.eclipse.jetty.server.Server.handle(Server.java:368) > at > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) > at > org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) > at > org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) > at > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) > at > org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) > at > org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Unknown Source) > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5 > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) > ... 32 more > Caused by: java.lang.IllegalArgumentException: This paragraph is not the > first one in the table > at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932) > at > org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188) > at > org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > ... 35 more > > ERROR - 2015-04-14 14:51:21.151; org.apache.solr.common.SolrException; > null:org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@138b0c5 > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) > at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > at org.eclipse.jetty.server.Server.handle(Server.java:368) > at > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) > at > org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) > at > org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) > at > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) > at > org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) > at > org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Unknown Source) > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5 > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) > ... 32 more > Caused by: java.lang.IllegalArgumentException: This paragraph is not the > first one in the table > at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932) > at > org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188) > at > org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > ... 35 more > > > Thanks & Regards > Vijay > > > On 14 April 2015 at 17:06, Vijaya Narayana Reddy Bhoomi Reddy < > vijaya.bhoomire...@whishworks.com> wrote: > >> Hi, >> >> Here are the solr-config xml and the error log from Solr logs for your >> reference. As mentioned earlier, I didnt make any changes to the >> solr-config.xml as I am using the xml file out of the box one that came >> with the default installation. >> >> Please let me know your thoughts on why these issues are occuring. >> >> Thanks & Regards >> Vijay >> >> >> *Vijay Bhoomireddy*, Big Data Architect >> >> 1000 Great West Road, Brentford, London, TW8 9DW >> *T: +44 20 3475 7980* >> *M: **+44 7481 298 360* >> *W: *ww <http://www.whishworks.com/>w.whishworks.com >> <http://www.whishworks.com/> >> >> <https://www.linkedin.com/company/whishworks> >> <http://www.whishworks.com/blog/> <https://twitter.com/WHISHWORKS> >> <https://www.facebook.com/whishworksit> >> >> On 14 April 2015 at 15:57, Vijaya Narayana Reddy Bhoomi Reddy < >> vijaya.bhoomire...@whishworks.com> wrote: >> >>> Hi, >>> >>> I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, >>> .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. >>> Request to please let me know what is going wrong with the indexing >>> process. >>> >>> I am using solr 4.10.2 and using the default example server configuration >>> that comes with Solr distribution. >>> >>> PDF Files - Indexing as such works fine, but when I query using *.* in >>> the Solr Query console, metadata information is displayed properly. >>> However, the PDF content field is empty. This is happening for all PDF >>> files I have tried. I have tried with some proprietary files, PDF eBooks >>> etc. Whatever be the PDF file, content is not being displayed. >>> >>> MS Office files - For some office files, everything works perfect and >>> the extracted content is visible in the query console. However, for others, >>> I see the below error message during the indexing process. >>> >>> *Exception in thread "main" >>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: >>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from >>> org.apache.tika.parser.microsoft.OfficeParser* >>> >>> >>> I am using SolrJ to index the documents and below is the code snippet >>> related to indexing. Please let me know where the issue is occurring. >>> >>> static String solrServerURL = " >>> http://localhost:8983/solr"; >>> static SolrServer solrServer = new HttpSolrServer(solrServerURL); >>> static ContentStreamUpdateRequest indexingReq = >>> new >>> ContentStreamUpdateRequest("/update/extract"); >>> >>> indexingReq.addFile(file, fileType); >>> indexingReq.setParam("literal.id", literalId); >>> indexingReq.setParam("uprefix", "attr_"); >>> indexingReq.setParam("fmap.content", "content"); >>> indexingReq.setParam("literal.fileurl", fileURL); >>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); >>> solrServer.request(indexingReq); >>> >>> Thanks & Regards >>> Vijay >>> >>> >>> >> > > -- > The contents of this e-mail are confidential and for the exclusive use of > the intended recipient. If you receive this e-mail in error please delete > it from your system immediately and notify us either by e-mail or > telephone. You should not copy, forward or otherwise disclose the content > of the e-mail. The views expressed in this communication may not > necessarily be the view held by WHISHWORKS.