Re: Multiple documents per input stream

Ken Krugler Sun, 27 Sep 2009 06:01:33 -0700

Hi Jukka,

On Wed, Sep 23, 2009 at 7:38 PM, Ken Krugler
<[email protected]> wrote:

Longer term it would be great to not have to worry about handling two
different cases - e.g. by being able to call
while (parser.parse(is, handler, metadata, context)) {
       <process the doc>
}

Though I think this would also require passing in metadata like
RESOURCE_NAME_KEY, CONTENT_TYPE and CONTENT_ENCODING via context,to avoidhaving to worry about selectively clearing out metadata. But Ithink thatwould be better anyway, versus the co-mingling of input & outputdata in the
metadata container.


The second option I gave in my earlier message is now a bit more
straightforward with the parsing context option introduced recently in
Tika trunk. You can now explicitly pass a delegate parser to be used
to process any component documents:

   Parser myComponentParser = new Parser() {
       public void parse(...) throws ... {
           // Process the component document stream
           // in any way you like, optionally passing the
           // extracted text also to the top level parser
           // through the given ContentHandler
       }
   };

   Map<String, Object> context = new HashMap<String, Object>();
   context.put(Parser.class.getName(), myComponentParser);
   parser.parse(stream, handler, metadata, context);

In this example myComponentParser.parse() would get called once for
each component document inside a package.


OK, thanks.

Though I don't think this would address the fundamental question ofhow to generically extract metadata like the title from compounddocuments, right?

You'd still have to know something about how the delegate parserembeds this information in the actual XHTML output.


Thanks,

-- Ken


--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Re: Multiple documents per input stream

Reply via email to