Hi Jukka,
On Wed, Sep 23, 2009 at 7:38 PM, Ken Krugler
<[email protected]> wrote:
Longer term it would be great to not have to worry about handling two
different cases - e.g. by being able to call
while (parser.parse(is, handler, metadata, context)) {
<process the doc>
}
Though I think this would also require passing in metadata like
RESOURCE_NAME_KEY, CONTENT_TYPE and CONTENT_ENCODING via context,
to avoid
having to worry about selectively clearing out metadata. But I
think that
would be better anyway, versus the co-mingling of input & output
data in the
metadata container.
The second option I gave in my earlier message is now a bit more
straightforward with the parsing context option introduced recently in
Tika trunk. You can now explicitly pass a delegate parser to be used
to process any component documents:
Parser myComponentParser = new Parser() {
public void parse(...) throws ... {
// Process the component document stream
// in any way you like, optionally passing the
// extracted text also to the top level parser
// through the given ContentHandler
}
};
Map<String, Object> context = new HashMap<String, Object>();
context.put(Parser.class.getName(), myComponentParser);
parser.parse(stream, handler, metadata, context);
In this example myComponentParser.parse() would get called once for
each component document inside a package.
OK, thanks.
Though I don't think this would address the fundamental question of
how to generically extract metadata like the title from compound
documents, right?
You'd still have to know something about how the delegate parser
embeds this information in the actual XHTML output.
Thanks,
-- Ken
--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378