Hi,
On Wed, Sep 23, 2009 at 7:38 PM, Ken Krugler
<[email protected]> wrote:
> Longer term it would be great to not have to worry about handling two
> different cases - e.g. by being able to call
>
> while (parser.parse(is, handler, metadata, context)) {
> <process the doc>
> }
>
> Though I think this would also require passing in metadata like
> RESOURCE_NAME_KEY, CONTENT_TYPE and CONTENT_ENCODING via context, to avoid
> having to worry about selectively clearing out metadata. But I think that
> would be better anyway, versus the co-mingling of input & output data in the
> metadata container.
The second option I gave in my earlier message is now a bit more
straightforward with the parsing context option introduced recently in
Tika trunk. You can now explicitly pass a delegate parser to be used
to process any component documents:
Parser myComponentParser = new Parser() {
public void parse(...) throws ... {
// Process the component document stream
// in any way you like, optionally passing the
// extracted text also to the top level parser
// through the given ContentHandler
}
};
Map<String, Object> context = new HashMap<String, Object>();
context.put(Parser.class.getName(), myComponentParser);
parser.parse(stream, handler, metadata, context);
In this example myComponentParser.parse() would get called once for
each component document inside a package.
BR,
Jukka Zitting