Hi,

On Fri, Mar 28, 2014 at 5:32 AM, Stefano Fornari
<stefano.forn...@gmail.com> wrote:
> On #1 I am still wondering why for indexing we need structure information.
> is there any particular reason? wouldn't make more sense to get just the
> text by default and only optionally getting the structure?

The trouble is that then each parser would need to have code for
producing both text and XHTML. Since the overhead of producing XHTML
instead of just text is pretty low, and since it's very easy for
clients that only care about the text output to just strip out the
markup, it made more sense to design the system to always produce
XHTML.

The same applies for document metadata. All parsers produce as much
metadata as they can, but must clients will just ignore most or all of
the returned metadata fields. However, since the overhead of producing
all the information is lower than that of adding explicit options to
control which metadata needs to be extracted and returned, it makes
sense to to just let clients filter out those bits that they don't
care about.

> On #2, I expected the code you presented would not work. And in fact the
> pattern is quite odd, isn't it? What is the reason of throwing the
> exception if limiting the text read is a legal use case? (I am asking just
> to understand the background).

Yes, the pattern is a bit awkward and generally shouldn't be
recommended as it uses an exception to control the flow of the
program. However, in this case we considered it worth doing as the
alternative would have been far more complicated.

Basically we wanted to avoid having to modify each parser
implementation (even those implemented outside Tika...) to keep track
of how much content has already been extracted and instead do that
just once in the WriteOutContentHandler class. However, the only way
for the WriteOutContentHandler to signal that parsing should be
stopped is by throwing a SAXException, which is what we're doing here.
By catching the exception and inspecting it with isWriteLimitReached()
the client can determine whether this is what happened.

BR,

Jukka Zitting

Reply via email to