Andrea, Yes, I am using the stock schema.xml that comes with the example server of Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and put into the content field in the index.
Please find the log information for the Parsing error below. org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) ... 32 more Caused by: java.lang.IllegalArgumentException: This paragraph is not the first one in the table at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932) at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188) at org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 35 more ERROR - 2015-04-14 14:51:21.151; org.apache.solr.common.SolrException; null:org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) ... 32 more Caused by: java.lang.IllegalArgumentException: This paragraph is not the first one in the table at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932) at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188) at org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 35 more Thanks & Regards Vijay On 14 April 2015 at 17:06, Vijaya Narayana Reddy Bhoomi Reddy < vijaya.bhoomire...@whishworks.com> wrote: > Hi, > > Here are the solr-config xml and the error log from Solr logs for your > reference. As mentioned earlier, I didnt make any changes to the > solr-config.xml as I am using the xml file out of the box one that came > with the default installation. > > Please let me know your thoughts on why these issues are occuring. > > Thanks & Regards > Vijay > > > *Vijay Bhoomireddy*, Big Data Architect > > 1000 Great West Road, Brentford, London, TW8 9DW > *T: +44 20 3475 7980* > *M: **+44 7481 298 360* > *W: *ww <http://www.whishworks.com/>w.whishworks.com > <http://www.whishworks.com/> > > <https://www.linkedin.com/company/whishworks> > <http://www.whishworks.com/blog/> <https://twitter.com/WHISHWORKS> > <https://www.facebook.com/whishworksit> > > On 14 April 2015 at 15:57, Vijaya Narayana Reddy Bhoomi Reddy < > vijaya.bhoomire...@whishworks.com> wrote: > >> Hi, >> >> I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, >> .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. >> Request to please let me know what is going wrong with the indexing >> process. >> >> I am using solr 4.10.2 and using the default example server configuration >> that comes with Solr distribution. >> >> PDF Files - Indexing as such works fine, but when I query using *.* in >> the Solr Query console, metadata information is displayed properly. >> However, the PDF content field is empty. This is happening for all PDF >> files I have tried. I have tried with some proprietary files, PDF eBooks >> etc. Whatever be the PDF file, content is not being displayed. >> >> MS Office files - For some office files, everything works perfect and >> the extracted content is visible in the query console. However, for others, >> I see the below error message during the indexing process. >> >> *Exception in thread "main" >> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: >> org.apache.tika.exception.TikaException: Unexpected RuntimeException from >> org.apache.tika.parser.microsoft.OfficeParser* >> >> >> I am using SolrJ to index the documents and below is the code snippet >> related to indexing. Please let me know where the issue is occurring. >> >> static String solrServerURL = " >> http://localhost:8983/solr"; >> static SolrServer solrServer = new HttpSolrServer(solrServerURL); >> static ContentStreamUpdateRequest indexingReq = >> new >> ContentStreamUpdateRequest("/update/extract"); >> >> indexingReq.addFile(file, fileType); >> indexingReq.setParam("literal.id", literalId); >> indexingReq.setParam("uprefix", "attr_"); >> indexingReq.setParam("fmap.content", "content"); >> indexingReq.setParam("literal.fileurl", fileURL); >> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); >> solrServer.request(indexingReq); >> >> Thanks & Regards >> Vijay >> >> >> > -- The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS.