Re: Multiple documents per input stream

Ken Krugler Wed, 23 Sep 2009 10:49:07 -0700

Hi Jukka,

On Mon, Sep 14, 2009 at 3:53 PM, Ken Krugler
<[email protected]> wrote:

Has this been discussed previously? Just curious, as I'd thoughtaboutchanging my mbox parser to handle incremental calls to parse(), andsavestate in the context object being passed in. This would require asmall
change to how I call the parser, as it would then be a loop (while
(is.available() > 0) { parser.parse(is, xxx); })


See TIKA-252 [1] for a related feature request.

Tika has been designed to deal with documents as single entities,
since there is no comprehensive composite document abstraction that we
could easily use. Trying to solve that problem you quickly end up with
questions about whether an inline image should be treated the same as
a file attachment, or whether things like <img> tags in HTML documents
should be resolved and the images included in the parse output. It's
not an unsolvable problem, but it's complex enough that so far we've
scoped the issue outside Tika.

OK, and I agree that trying to deal with embedded documents is a toughproblem.

My particular issue is that I'm using Tika in Bixo as the generalparser, via the AutoDetectParser.

Which means I need to be able to generically extract the title,author, last modified date, etc. from the metadata, without having toknow any specific details about the XHTML output.

So one way to slice the above problem would be to only worry aboutcorrect handling of "container" document formats, where sub-docs areall peers and typically contain standard metadata such as title,author, last modified date, etc.


I'll look into the options you outline below, for current releases.

Longer term it would be great to not have to worry about handling twodifferent cases - e.g. by being able to call


while (parser.parse(is, handler, metadata, context)) {
        <process the doc>
}

Though I think this would also require passing in metadata likeRESOURCE_NAME_KEY, CONTENT_TYPE and CONTENT_ENCODING via context, toavoid having to worry about selectively clearing out metadata. But Ithink that would be better anyway, versus the co-mingling of input &output data in the metadata container.


Thanks,

-- Ken

However, within the current Tika design there are a couple of options
that you could do:

* As suggested in TIKA-252, you could extend the PackageParser to
embed per-component metadata into the produced XHTML output. Your
application would then need to detect the component boundaries and the
included metadata from the XHTML output.

* Alternatively you could inject a custom delegate parser that
intercepts each component stream and handles it separately without
producing output to be included in the top-level parse result.

[1] https://issues.apache.org/jira/browse/TIKA-252

BR,

Jukka Zitting


--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Re: Multiple documents per input stream

Reply via email to