Hi Jukka,
On Mon, Sep 14, 2009 at 3:53 PM, Ken Krugler
<[email protected]> wrote:
Has this been discussed previously? Just curious, as I'd thought
about
changing my mbox parser to handle incremental calls to parse(), and
save
state in the context object being passed in. This would require a
small
change to how I call the parser, as it would then be a loop (while
(is.available() > 0) { parser.parse(is, xxx); })
See TIKA-252 [1] for a related feature request.
Tika has been designed to deal with documents as single entities,
since there is no comprehensive composite document abstraction that we
could easily use. Trying to solve that problem you quickly end up with
questions about whether an inline image should be treated the same as
a file attachment, or whether things like <img> tags in HTML documents
should be resolved and the images included in the parse output. It's
not an unsolvable problem, but it's complex enough that so far we've
scoped the issue outside Tika.
OK, and I agree that trying to deal with embedded documents is a tough
problem.
My particular issue is that I'm using Tika in Bixo as the general
parser, via the AutoDetectParser.
Which means I need to be able to generically extract the title,
author, last modified date, etc. from the metadata, without having to
know any specific details about the XHTML output.
So one way to slice the above problem would be to only worry about
correct handling of "container" document formats, where sub-docs are
all peers and typically contain standard metadata such as title,
author, last modified date, etc.
I'll look into the options you outline below, for current releases.
Longer term it would be great to not have to worry about handling two
different cases - e.g. by being able to call
while (parser.parse(is, handler, metadata, context)) {
<process the doc>
}
Though I think this would also require passing in metadata like
RESOURCE_NAME_KEY, CONTENT_TYPE and CONTENT_ENCODING via context, to
avoid having to worry about selectively clearing out metadata. But I
think that would be better anyway, versus the co-mingling of input &
output data in the metadata container.
Thanks,
-- Ken
However, within the current Tika design there are a couple of options
that you could do:
* As suggested in TIKA-252, you could extend the PackageParser to
embed per-component metadata into the produced XHTML output. Your
application would then need to detect the component boundaries and the
included metadata from the XHTML output.
* Alternatively you could inject a custom delegate parser that
intercepts each component stream and handles it separately without
producing output to be included in the top-level parse result.
[1] https://issues.apache.org/jira/browse/TIKA-252
BR,
Jukka Zitting
--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378