Jukka -
I committed a temporary fix that disabled the use of the
ParserPostProcessor, but that removed the functionality altogether, so if
you like we can discuss how to restore this functionality as an option.
The summary and outlinks information may need the text from multiple SAX
events, so their implementation may not be trivial, unless we accumulate all
parsed text in a single string, and then inspect that string (as you did in
ParserPostProcessor).
Therefore, since fulltext, summary, and outlinks all benefit from all text
being in a single string, why not create a single implementation of
ContentHandler that populates all of them? Then the full text string would
be in only one place in Tika.
This could also shield the user from some complexity -- this handler would
create the StringWriter itself. Also, memory would be saved because the
same string would be used by both a Metadata.get("fullText") and
XyzContentHandler.getFullText().
If this idea sounds good, what would you suggest naming this handler?
FulltextContentHandler? DefaultContentHandler? Something else?
- Keith
Jukka Zitting wrote:
>
> Hi,
>
> On 10/18/07, Keith R. Bennett <[EMAIL PROTECTED]> wrote:
>> After removing those things, the ParserPostProcessor doesn't do anything.
>> Do you want to remove it altogether? We could also just not instantiate
>> it
>> -- in TikaConfig, we would add the parser implementation without wrapping
>> it
>> in a ParserPostProcessor.
>
> I'd be OK replacing it with SummaryContentHandler and
> OutLinksContentHandler, i.e. ContentHandler classes that would extract
> the summary text and any matched URIs from the text content. This way
> we'd still have all the functionality in Tika.
>
> BR,
>
> Jukka Zitting
>
>
--
View this message in context:
http://www.nabble.com/Fulltext-Metadata-Property--tf4643633.html#a13300082
Sent from the Apache Tika - Development mailing list archive at Nabble.com.