On 5/3/07, Jukka Zitting <[EMAIL PROTECTED]> wrote:
... * processing pipeline: There was a quick idea on possibly organizing
the Tika framework as a pipeline of content detection and extraction
components....
I thought a bit more about that, and maybe a dual-channel pipeline
structure, with generalized filters, might be interesting (ASCII art
ahead):
+------------+ +------------+ +------------+
-------+ +--------+ +--------+ +-------
-------+ F1 +--------+ F2 +--------+ F3 +-------
| | | | | |
-------+ +--------+ +--------+ |
+------------+ +------------+ +------------+
---------- Extracted content, events, metadata, filter options
----------
---------- Binary data
By dual-channel I mean that each filter outputs extracted content,
metadata and events (language change etc) on one "channel", and *can*
output the binary stream as well, on a (conceptually) separate
channel. The last filter in the chain usually only outputs the first
channel.
I think this might be very useful, for example to chain filters which
have different ways of deciding how to process the input stream, and
getting the aggregated metadata which describes their "decisions"
after they have all examined the input.
Or to insert a filter which only cares about detecting the input
encoding, but doesn't know much about content extraction.
In practice, the dual-channel could be implemented by simply adding a
bytes(...) method to the standard ContentHandler interface - but how
we do it is not too important at this design stage.
By "generalized filters" I mean that the interface to all filters is
the same, the Tika pipeline doesn't necessarily impose a two-phase
process, it just chains a series of filters which collaborate to
analyze the input stream.
I haven't done much reality checks on this yet, but I think allowing
the binary stream to be relayed to multiple filters in the chain could
help make things more modular, while adding little complexity.
The main idea is to keep as much information as possible far in the
pipeline to make filters more independent of each others.
-Bertrand