when 1.15 will be released? maybe you have some beta version and I could test it :)
SAX sounds interesting, and from info that I found in google it could solve my issues. On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > It depends. We've been trying to make parsers more, erm, flexible, but > there are some problems from which we cannot recover. > > Tl;dr there isn't a short answer. :( > > My sense is that DIH/ExtractingDocumentHandler is intended to get people > up and running with Solr easily but it is not really a great idea for > production. See Erick's gem: https://lucidworks.com/2012/ > 02/14/indexing-with-solrj/ > > As for the Tika portion... at the very least, Tika _shouldn't_ cause the > ingesting process to crash. At most, it should fail at the file level and > not cause greater havoc. In practice, if you're processing millions of > files from the wild, you'll run into bad behavior and need to defend > against permanent hangs, oom, memory leaks. > > Also, at the least, if there's an exception with an embedded file, Tika > should catch it and keep going with the rest of the file. If this doesn't > happen let us know! We are aware that some types of embedded file stream > problems were causing parse failures on the entire file, and we now catch > those in Tika 1.15-SNAPSHOT and don't let them percolate up through the > parent file (they're reported in the metadata though). > > Specifically for your stack traces: > > For your initial problem with the missing class exceptions -- I thought we > used to catch those in docx and log them. I haven't been able to track > this down, though. I can look more if you have a need. > > For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type' name > 'PolylineTo' ", this problem might go away if we implemented a pure SAX > parser for vsdx. We just did this for docx and pptx (coming in 1.15) and > these are more robust to variation because they aren't requiring a match > with the ooxml schema. I haven't looked much at vsdx, but that _might_ > help. > > For "TODO Support v5 Pointers", this isn't supported and would require > contributions. However, I agree that POI shouldn't throw a Runtime > exception. Perhaps open an issue in POI, or maybe we should catch this > special example at the Tika level? > > For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the POI team > _might_ be able to modify the parser to ignore a stream if there's an > exception, but that's often a sign that something needs to be fixed with > the parser. In short, the solution will come from POI. > > Best, > > Tim > > -----Original Message----- > From: Gytis Mikuciunas [mailto:gyt...@gmail.com] > Sent: Tuesday, April 11, 2017 1:56 PM > To: solr-user@lucene.apache.org > Subject: RE: Solr 6.4. Can't index MS Visio vsdx files > > Thanks for your responses. > Are there any posibilities to ignore parsing errors and continue indexing? > because now solr/tika stops parsing whole document if it finds any > exception > > On Apr 11, 2017 19:51, "Allison, Timothy B." <talli...@mitre.org> wrote: > > > You might want to drop a note to the dev or user's list on Apache POI. > > > > I'm not extremely familiar with the vsd(x) portion of our code base. > > > > The first item ("PolylineTo") may be caused by a mismatch btwn your > > doc and the ooxml spec. > > > > The second item appears to be an unsupported feature. > > > > The third item may be an area for improvement within our codebase...I > > can't tell just from the stacktrace. > > > > You'll probably get more helpful answers over on POI. Sorry, I can't > > help with this... > > > > Best, > > > > Tim > > > > P.S. > > > 3.1. ooxml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar > > > > You shouldn't need both. Ooxml-schemas-1.3.jar should be a super set > > of poi-ooxml-schemas-3.15.jar > > > > > > >