Hi, On Fri, Mar 28, 2014 at 5:32 AM, Stefano Fornari <stefano.forn...@gmail.com> wrote: > On #1 I am still wondering why for indexing we need structure information. > is there any particular reason? wouldn't make more sense to get just the > text by default and only optionally getting the structure?
The trouble is that then each parser would need to have code for producing both text and XHTML. Since the overhead of producing XHTML instead of just text is pretty low, and since it's very easy for clients that only care about the text output to just strip out the markup, it made more sense to design the system to always produce XHTML. The same applies for document metadata. All parsers produce as much metadata as they can, but must clients will just ignore most or all of the returned metadata fields. However, since the overhead of producing all the information is lower than that of adding explicit options to control which metadata needs to be extracted and returned, it makes sense to to just let clients filter out those bits that they don't care about. > On #2, I expected the code you presented would not work. And in fact the > pattern is quite odd, isn't it? What is the reason of throwing the > exception if limiting the text read is a legal use case? (I am asking just > to understand the background). Yes, the pattern is a bit awkward and generally shouldn't be recommended as it uses an exception to control the flow of the program. However, in this case we considered it worth doing as the alternative would have been far more complicated. Basically we wanted to avoid having to modify each parser implementation (even those implemented outside Tika...) to keep track of how much content has already been extracted and instead do that just once in the WriteOutContentHandler class. However, the only way for the WriteOutContentHandler to signal that parsing should be stopped is by throwing a SAXException, which is what we're doing here. By catching the exception and inspecting it with isWriteLimitReached() the client can determine whether this is what happened. BR, Jukka Zitting