Hi, On 10/19/07, Keith R. Bennett <[EMAIL PROTECTED]> wrote: > The summary and outlinks information may need the text from multiple SAX > events, so their implementation may not be trivial, unless we accumulate all > parsed text in a single string, and then inspect that string (as you did in > ParserPostProcessor). > > Therefore, since fulltext, summary, and outlinks all benefit from all text > being in a single string, why not create a single implementation of > ContentHandler that populates all of them? Then the full text string would > be in only one place in Tika.
The summary and outLinks implementation based on SAX events may be more complex but it's still doable, so I'd rather focus on making that work. The more places we have in Tika that read the content into a single string, the harder it will be to support really large documents. BR, Jukka Zitting
