Mine too, but I know it is important for many use cases. Maybe adding to XHtmlContentHandler some tracking of open tags and a new method to close them?
2018-02-07 12:59 GMT-02:00 Allison, Timothy B. <talli...@mitre.org>: > Do we worry about properly closing tags on an exception? > > <body> > <div parser="parser1"> > <p> > kaboom > <div parser="parser2> > .... > > My focus is normally text so broken tags aren't a problem for me...but > others? > > -----Original Message----- > From: Luís Filipe Nassif [mailto:lfcnas...@gmail.com] > Sent: Monday, February 5, 2018 5:34 PM > To: dev@tika.apache.org > Subject: Re: Not-yet-broken breaking changes for Tika 2? > > From a forensic use case it is better just saying we are trying another > parser and not resetting the content handler, because the first parser can > extract relevant content before the exception. > > To not spool everything to temp files to re-read the stream, I think we > can create an optional setinputstreamfactory() method in TikaInputStream, > so the user can implement an InputStreamFactory interface with a > getInputStream method, if he does not want to pay a performance hit with > temp files for everything. > > Luis > > Em 5 de fev de 2018 4:52 PM, "Chris Mattmann" <mattm...@apache.org> > escreveu: > > I think we should just say, OK now we're trying a different parser.... > > > > On 2/5/18, 9:51 AM, "Allison, Timothy B." <talli...@mitre.org> wrote: > > To my mind, the real challenge is what to do with content that should > be ignored... > > If the strategy is back-off-on-exception (try the DOCX parser, but if > there's an exception, use the Zip parser), what do we do with the sax > elements that have already been written? Do we need a new handler type > that has a reset() method? > > Or do we just say, hey, now we're trying a different parser... > > > -----Original Message----- > From: Mattmann, Chris A (1761) [mailto:chris.a.mattm...@jpl.nasa.gov] > Sent: Monday, February 5, 2018 12:29 PM > To: dev@tika.apache.org > Subject: Re: Not-yet-broken breaking changes for Tika 2? > > Our solution is just to run the parser 2x....yes I get it will induce > overhead, but as a start, why not? > In short just run through the stream 2x.... > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++++++ > Chris Mattmann, Ph.D. > Associate Chief Technology and Innovation Officer, OCIO Manager, > Advanced IT Research and Open Source Projects Office (1761) Manager, NSF > and Open Source Programs and Applications Office (8212) NASA Jet Propulsion > Laboratory Pasadena, CA 91109 USA > Office: 180-503E, Mailstop: 180-502 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++++++ > Director, Information Retrieval and Data Science Group (IRDS) Adjunct > Associate Professor, Computer Science Department University of Southern > California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++++++++ > > > On 2/5/18, 9:25 AM, "Nick Burch" <apa...@gagravarr.org> wrote: > > On Mon, 5 Feb 2018, Chris Mattmann wrote: > > Let's have a go at implementing it! You know my thoughts (make > it like > > OODT ;) )\ > > I'm still keen to hear how we can do the text content like OODT! > > I have tried to copy the OODT model for the proposed metadata case > though > :) > > Nick > > > On 2/5/18, 8:37 AM, "Nick Burch" <apa...@gagravarr.org> wrote: > > > > Ping - anyone got any thoughts on the proposed metadata parser > stuff, and > > any ideas on the content part? > > > > On Tue, 2 Jan 2018, Nick Burch wrote: > > > On Thu, 26 Oct 2017, Chris Mattmann wrote: > > >> On collision, the precedence order defines what key takes > precedence and > > >> _overwrites_ the other. Overwrite is but one option (you > could save *all* > > >> the values it’s a multi-valued key structure so…) > > > > > > OK, I think that's fine. I've had a go at updating the wiki > for the metadata > > > case: > > > https://wiki.apache.org/tika/CompositeParserDiscussion# > Supplementary.2FAdditive > > > And example Tika Config settings for it > > > https://wiki.apache.org/tika/CompositeParserDiscussion# > line-20 > > > If people are happy with how that sounds/looks, I can have a > stab at > > > implementing it, as I *think* it's quite easy > > > > > > > > > However... that still leaves the Context (XHTML SAX events) > case to solve! > > > > > > Anyone have any ideas on how we can append to or > cancel/reset the Content > > > Handler series of SAX events when we move onto a second+ > parser for a file? > > > > > > Thanks > > > Nick > > > > > >> On 10/26/17, 9:43 AM, "Nick Burch" <apa...@gagravarr.org> > wrote: > > >> > > >> On Thu, 26 Oct 2017, Chris Mattmann wrote: > > >> > My general approach to conflicting metadata is simply > to define > > >> > precedence orders. > > >> > > > >> > For example here is one documented from OODT: > > >> > > > >> > > > >> https://cwiki.apache.org/confluence/display/OODT/ > Understanding+CAS-PGE+Metadata+Precendence > > >> > > > >> > We can do similar things with Tika, e.g., > > >> > > > >> > [CoreMetadata.PROPERTIES] > > >> > [ImageParser.METADATA] > > >> > [TikaOCR.METADATA] > > >> > > >> What happens if two different parsers both output the > same bit of > > >> metadata > > >> though? eg Tim's example of one giving dc:creator of Tim > and the second > > >> giving dc:creator of Chris? > > >> > > >> > > >> Secondly, what about the XHTML sax events stream? I > think that's > > >> probably > > >> the harder case... > > >> > > >> Nick > > > > > > >