hi, So I'm back from my long vacations :)
I'm trying to bring-up a fresh solr 6.6 standalone instance on windows 2012R2 server. Replaced: poi-*3.15-beta1 ---> poi-*3.16 tika-*1.13 ---> tika-*1.15 Tried to index one txt file and got (with poi and tika files that come out of the box, it indexes this txt file without errors): SimplePostTool: WARNING: Response: <html> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/> <title>Error 500 Server Error</title> </head> <body><h2>HTTP ERROR 500</h2> <p>Problem accessing /solr/v20170703xxx/update/extract. Reason: <pre> Server Error</pre></p><h3>Caused by:</h3><pre>java.lang.NoClassDefFoundError: org/apache/commons/compress/archivers/ArchiveStreamProvider at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(Unknown Source) at java.security.SecureClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.access$100(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:112) at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:83) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:115) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.server.Server.handle(Server.java:534) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.ClassNotFoundException: org.apache.commons.compress.archivers.ArchiveStreamProvider at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) ... 51 more </pre> <h3>Caused by:</h3><pre>java.lang.ClassNotFoundException: org.apache.commons.compress.archivers.ArchiveStreamProvider at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(Unknown Source) at java.security.SecureClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.defineClass(Unknown Source) at java.net.URLClassLoader.access$100(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.net.FactoryURLClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:112) at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:83) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:115) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.server.Server.handle(Server.java:534) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang.Thread.run(Unknown Source) </pre> </body> </html> SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 500 for URL: http://localhost:80/solr/v20170703xxx/update/extract?resource.name=xxxxxx 1 files indexed. COMMITting Solr index changes to http://localhost:80/solr/v20170703xxx/update... Time spent: 0:00:00.350 On Mon, Jun 5, 2017 at 7:41 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > https://issues.apache.org/jira/browse/SOLR-10335 is tracking the upgrade > in Solr to Tika 1.15. Please chime in on that issue. > > You should be able to swap in POI 3.16 (final) wherever you had earlier > versions, make sure to include: poi, poi-scratchpad, poi-ooxml, > poi-ooxml-schemas. And make sure to include tika-parsers (1.15), > tika-core, tika-java7, tika-xmp. Also, include commons-collections4 (which > is new in POI w Tika 1.14). (I assume you have already added curvesapi?) > > -----Original Message----- > From: Gytis Mikuciunas [mailto:gyt...@gmail.com] > Sent: Saturday, June 3, 2017 5:39 AM > To: solr-user@lucene.apache.org > Subject: RE: Solr 6.4. Can't index MS Visio vsdx files > > Great Tim. > > What do I need to do to integrate it on my current installation? > > > On May 31, 2017 16:24, "Allison, Timothy B." <talli...@mitre.org> wrote: > > Apache Tika 1.15 is now available. > > -----Original Message----- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Tuesday, May 9, 2017 7:45 AM > To: solr-user@lucene.apache.org > Subject: RE: Solr 6.4. Can't index MS Visio vsdx files > > Probably better to ask on the Tika list. We'll push the release asap > after PDFBox 2.0.6 is out. Andreas plans to cut the release candidate for > PDFBox this Friday. Tika will probably have an RC by Monday 5/15, with the > release happening later in the week...That's if there are no surprises...[2] > > You can get a recent build if you'd like to test [1]. > > Best, > > Tim > > [1] https://builds.apache.org/view/Tika/job/Tika-trunk/ > [2] If you are curious, for the comparison reports btwn PDFBox 2.0.5 and > 2.0.6-SNAPSHOT on ~500k pdfs, see: http://162.242.228.174/ > reports/reports_pdfbox_2_0_6.tar.gz > > -----Original Message----- > From: Gytis Mikuciunas [mailto:gyt...@gmail.com] > Sent: Tuesday, May 9, 2017 7:17 AM > To: solr-user@lucene.apache.org > Subject: Re: Solr 6.4. Can't index MS Visio vsdx files > > Are there any news regarding Tika 1.15? Maybe it's already ready for > download somewhere > > G. > > On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B. <talli...@mitre.org> > wrote: > > > The release candidate for POI was just cut...unfortunately, I think > > after Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for > opening that! > > > > That'll be done within a week unless there are surprises. Once that's > > out, I have to update a few things, but I'd think we'd have a > > candidate for Tika a week later, then a week for release. > > > > You can get nightly builds here: https://builds.apache.org/ > > > > Please ask on the POI or Tika users lists for how to get the > > latest/latest running, and thank you, again, for opening the issue on > POI's Bugzilla. > > > > Best, > > > > Tim > > > > -----Original Message----- > > From: Gytis Mikuciunas [mailto:gyt...@gmail.com] > > Sent: Wednesday, April 12, 2017 1:00 AM > > To: solr-user@lucene.apache.org > > Subject: Re: Solr 6.4. Can't index MS Visio vsdx files > > > > when 1.15 will be released? maybe you have some beta version and I > > could test it :) > > > > SAX sounds interesting, and from info that I found in google it could > > solve my issues. > > > > On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B. > > <talli...@mitre.org> > > wrote: > > > > > It depends. We've been trying to make parsers more, erm, flexible, > > > but there are some problems from which we cannot recover. > > > > > > Tl;dr there isn't a short answer. :( > > > > > > My sense is that DIH/ExtractingDocumentHandler is intended to get > > > people up and running with Solr easily but it is not really a great > > > idea for production. See Erick's gem: https://lucidworks.com/2012/ > > > 02/14/indexing-with-solrj/ > > > > > > As for the Tika portion... at the very least, Tika _shouldn't_ cause > > > the ingesting process to crash. At most, it should fail at the file > > > level and not cause greater havoc. In practice, if you're > > > processing millions of files from the wild, you'll run into bad > > > behavior and need to defend against permanent hangs, oom, memory leaks. > > > > > > Also, at the least, if there's an exception with an embedded file, > > > Tika should catch it and keep going with the rest of the file. If > > > this doesn't happen let us know! We are aware that some types of > > > embedded file stream problems were causing parse failures on the > > > entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't > > > let them percolate up through the parent file (they're reported in > > > the > > metadata though). > > > > > > Specifically for your stack traces: > > > > > > For your initial problem with the missing class exceptions -- I > > > thought we used to catch those in docx and log them. I haven't been > > > able to track this down, though. I can look more if you have a need. > > > > > > For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' > > > name 'PolylineTo' ", this problem might go away if we implemented a > > > pure SAX parser for vsdx. We just did this for docx and pptx > > > (coming in 1.15) and these are more robust to variation because they > > > aren't requiring a match with the ooxml schema. I haven't looked > > > much at vsdx, but that _might_ help. > > > > > > For "TODO Support v5 Pointers", this isn't supported and would > > > require contributions. However, I agree that POI shouldn't throw a > > > Runtime exception. Perhaps open an issue in POI, or maybe we should > > > catch this special example at the Tika level? > > > > > > For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI > > > team _might_ be able to modify the parser to ignore a stream if > > > there's an exception, but that's often a sign that something needs > > > to be fixed with the parser. In short, the solution will come from > POI. > > > > > > Best, > > > > > > Tim > > > > > > -----Original Message----- > > > From: Gytis Mikuciunas [mailto:gyt...@gmail.com] > > > Sent: Tuesday, April 11, 2017 1:56 PM > > > To: solr-user@lucene.apache.org > > > Subject: RE: Solr 6.4. Can't index MS Visio vsdx files > > > > > > Thanks for your responses. > > > Are there any posibilities to ignore parsing errors and continue > > indexing? > > > because now solr/tika stops parsing whole document if it finds any > > > exception > > > > > > On Apr 11, 2017 19:51, "Allison, Timothy B." <talli...@mitre.org> > wrote: > > > > > > > You might want to drop a note to the dev or user's list on Apache > POI. > > > > > > > > I'm not extremely familiar with the vsd(x) portion of our code base. > > > > > > > > The first item ("PolylineTo") may be caused by a mismatch btwn > > > > your doc and the ooxml spec. > > > > > > > > The second item appears to be an unsupported feature. > > > > > > > > The third item may be an area for improvement within our > > > > codebase...I can't tell just from the stacktrace. > > > > > > > > You'll probably get more helpful answers over on POI. Sorry, I > > > > can't help with this... > > > > > > > > Best, > > > > > > > > Tim > > > > > > > > P.S. > > > > > 3.1. ooxml-schemas-1.3.jar instead of > > > > > poi-ooxml-schemas-3.15.jar > > > > > > > > You shouldn't need both. Ooxml-schemas-1.3.jar should be a super > > > > set of poi-ooxml-schemas-3.15.jar > > > > > > > > > > > > > > > > > >