Are there any news regarding Tika 1.15? Maybe it's already ready for download somewhere
G. On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > The release candidate for POI was just cut...unfortunately, I think after > Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for opening that! > > That'll be done within a week unless there are surprises. Once that's > out, I have to update a few things, but I'd think we'd have a candidate for > Tika a week later, then a week for release. > > You can get nightly builds here: https://builds.apache.org/ > > Please ask on the POI or Tika users lists for how to get the latest/latest > running, and thank you, again, for opening the issue on POI's Bugzilla. > > Best, > > Tim > > -----Original Message----- > From: Gytis Mikuciunas [mailto:gyt...@gmail.com] > Sent: Wednesday, April 12, 2017 1:00 AM > To: solr-user@lucene.apache.org > Subject: Re: Solr 6.4. Can't index MS Visio vsdx files > > when 1.15 will be released? maybe you have some beta version and I could > test it :) > > SAX sounds interesting, and from info that I found in google it could > solve my issues. > > On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B. <talli...@mitre.org> > wrote: > > > It depends. We've been trying to make parsers more, erm, flexible, > > but there are some problems from which we cannot recover. > > > > Tl;dr there isn't a short answer. :( > > > > My sense is that DIH/ExtractingDocumentHandler is intended to get > > people up and running with Solr easily but it is not really a great > > idea for production. See Erick's gem: https://lucidworks.com/2012/ > > 02/14/indexing-with-solrj/ > > > > As for the Tika portion... at the very least, Tika _shouldn't_ cause > > the ingesting process to crash. At most, it should fail at the file > > level and not cause greater havoc. In practice, if you're processing > > millions of files from the wild, you'll run into bad behavior and need > > to defend against permanent hangs, oom, memory leaks. > > > > Also, at the least, if there's an exception with an embedded file, > > Tika should catch it and keep going with the rest of the file. If > > this doesn't happen let us know! We are aware that some types of > > embedded file stream problems were causing parse failures on the > > entire file, and we now catch those in Tika 1.15-SNAPSHOT and don't > > let them percolate up through the parent file (they're reported in the > metadata though). > > > > Specifically for your stack traces: > > > > For your initial problem with the missing class exceptions -- I > > thought we used to catch those in docx and log them. I haven't been > > able to track this down, though. I can look more if you have a need. > > > > For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' > > name 'PolylineTo' ", this problem might go away if we implemented a > > pure SAX parser for vsdx. We just did this for docx and pptx (coming > > in 1.15) and these are more robust to variation because they aren't > > requiring a match with the ooxml schema. I haven't looked much at > > vsdx, but that _might_ help. > > > > For "TODO Support v5 Pointers", this isn't supported and would require > > contributions. However, I agree that POI shouldn't throw a Runtime > > exception. Perhaps open an issue in POI, or maybe we should catch > > this special example at the Tika level? > > > > For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI > > team _might_ be able to modify the parser to ignore a stream if > > there's an exception, but that's often a sign that something needs to > > be fixed with the parser. In short, the solution will come from POI. > > > > Best, > > > > Tim > > > > -----Original Message----- > > From: Gytis Mikuciunas [mailto:gyt...@gmail.com] > > Sent: Tuesday, April 11, 2017 1:56 PM > > To: solr-user@lucene.apache.org > > Subject: RE: Solr 6.4. Can't index MS Visio vsdx files > > > > Thanks for your responses. > > Are there any posibilities to ignore parsing errors and continue > indexing? > > because now solr/tika stops parsing whole document if it finds any > > exception > > > > On Apr 11, 2017 19:51, "Allison, Timothy B." <talli...@mitre.org> wrote: > > > > > You might want to drop a note to the dev or user's list on Apache POI. > > > > > > I'm not extremely familiar with the vsd(x) portion of our code base. > > > > > > The first item ("PolylineTo") may be caused by a mismatch btwn your > > > doc and the ooxml spec. > > > > > > The second item appears to be an unsupported feature. > > > > > > The third item may be an area for improvement within our > > > codebase...I can't tell just from the stacktrace. > > > > > > You'll probably get more helpful answers over on POI. Sorry, I > > > can't help with this... > > > > > > Best, > > > > > > Tim > > > > > > P.S. > > > > 3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar > > > > > > You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set > > > of poi-ooxml-schemas-3.15.jar > > > > > > > > > > > >