Hi Jukka,

On Mon, Sep 14, 2009 at 3:53 PM, Ken Krugler
<[email protected]> wrote:
Has this been discussed previously? Just curious, as I'd thought about changing my mbox parser to handle incremental calls to parse(), and save state in the context object being passed in. This would require a small
change to how I call the parser, as it would then be a loop (while
(is.available() > 0) { parser.parse(is, xxx); })

See TIKA-252 [1] for a related feature request.

Tika has been designed to deal with documents as single entities,
since there is no comprehensive composite document abstraction that we
could easily use. Trying to solve that problem you quickly end up with
questions about whether an inline image should be treated the same as
a file attachment, or whether things like <img> tags in HTML documents
should be resolved and the images included in the parse output. It's
not an unsolvable problem, but it's complex enough that so far we've
scoped the issue outside Tika.

OK, and I agree that trying to deal with embedded documents is a tough problem.

My particular issue is that I'm using Tika in Bixo as the general parser, via the AutoDetectParser.

Which means I need to be able to generically extract the title, author, last modified date, etc. from the metadata, without having to know any specific details about the XHTML output.

So one way to slice the above problem would be to only worry about correct handling of "container" document formats, where sub-docs are all peers and typically contain standard metadata such as title, author, last modified date, etc.

I'll look into the options you outline below, for current releases.

Longer term it would be great to not have to worry about handling two different cases - e.g. by being able to call

while (parser.parse(is, handler, metadata, context)) {
        <process the doc>
}

Though I think this would also require passing in metadata like RESOURCE_NAME_KEY, CONTENT_TYPE and CONTENT_ENCODING via context, to avoid having to worry about selectively clearing out metadata. But I think that would be better anyway, versus the co-mingling of input & output data in the metadata container.

Thanks,

-- Ken

However, within the current Tika design there are a couple of options
that you could do:

* As suggested in TIKA-252, you could extend the PackageParser to
embed per-component metadata into the produced XHTML output. Your
application would then need to detect the component boundaries and the
included metadata from the XHTML output.

* Alternatively you could inject a custom delegate parser that
intercepts each component stream and handles it separately without
producing output to be included in the top-level parse result.

[1] https://issues.apache.org/jira/browse/TIKA-252

BR,

Jukka Zitting

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Reply via email to