Hi,

On Wed, Sep 23, 2009 at 7:38 PM, Ken Krugler
<[email protected]> wrote:
> Longer term it would be great to not have to worry about handling two
> different cases - e.g. by being able to call
>
> while (parser.parse(is, handler, metadata, context)) {
>        <process the doc>
> }
>
> Though I think this would also require passing in metadata like
> RESOURCE_NAME_KEY, CONTENT_TYPE and CONTENT_ENCODING via context, to avoid
> having to worry about selectively clearing out metadata. But I think that
> would be better anyway, versus the co-mingling of input & output data in the
> metadata container.

The second option I gave in my earlier message is now a bit more
straightforward with the parsing context option introduced recently in
Tika trunk. You can now explicitly pass a delegate parser to be used
to process any component documents:

    Parser myComponentParser = new Parser() {
        public void parse(...) throws ... {
            // Process the component document stream
            // in any way you like, optionally passing the
            // extracted text also to the top level parser
            // through the given ContentHandler
        }
    };

    Map<String, Object> context = new HashMap<String, Object>();
    context.put(Parser.class.getName(), myComponentParser);
    parser.parse(stream, handler, metadata, context);

In this example myComponentParser.parse() would get called once for
each component document inside a package.

BR,

Jukka Zitting

Reply via email to