[RAT] Pipelines...
Essentially, Rat is simple. A source (perhaps a file system or a compressed archive) is walked, producing documents. Each document (perhaps a file in a file system, or a resources in an archive) flows through a pipeline - a series of processing steps, enriching with various meta-data. An end point collates the data. It seems to me that the current code fails to express this ... At the moment, IDocumentAnalyser[1] is implemented by most steps in the pipeline (and other stuff too), wired together in a potentially flexible fashion. This now seems over-engineered to me. I think a concrete Pipeline would be more obvious, with controlled extension points at each step of the processing. Opinions...? Objections...? Robert [1] http://svn.apache.org/viewvc/creadur/rat/trunk/apache-rat-core/src/main/java/org/apache/rat/document/IDocumentAnalyser.java?view=markup
Re: [RAT] Pipelines...
On 8/5/2013 10:11 AM, Robert Burrell Donkin wrote: > Essentially, Rat is simple. > > A source (perhaps a file system or a compressed archive) is walked, producing > documents. Each document (perhaps a file in a file system, or a resources in > an archive) flows through a pipeline - a series of processing steps, enriching > with various meta-data. An end point collates the data. > > It seems to me that the current code fails to express this > > ... > > At the moment, IDocumentAnalyser[1] is implemented by most steps in the > pipeline (and other stuff too), wired together in a potentially flexible > fashion. This now seems over-engineered to me. > > I think a concrete Pipeline would be more obvious, with controlled extension > points at each step of the processing. > > Opinions...? > Objections...? Hi, It may be overkill ( :-) ), however, the Apache UIMA project has this very idea of enabling assembly of components in a pipeline, and passing a thing (called the CAS - Common Annotation Structure/System) to each "annotator" component, which may add arbitrary metadata info to the CAS. For intro, see the getting started parts of the documentation at uima.apache.org. -Marshall Schor > > Robert > [1] > http://svn.apache.org/viewvc/creadur/rat/trunk/apache-rat-core/src/main/java/org/apache/rat/document/IDocumentAnalyser.java?view=markup >
Re: [RAT] Pipelines...
On 08/05/13 15:47, Marshall Schor wrote: It may be overkill ( :-) ), however, the Apache UIMA project has this very idea of enabling assembly of components in a pipeline, and passing a thing (called the CAS - Common Annotation Structure/System) to each "annotator" component, which may add arbitrary metadata info to the CAS. For intro, see the getting started parts of the documentation at uima.apache.org. Quite possibly overkill but interesting :-) Thanks for the link, Marshall, and glad to see UIMA seems to be going strong :-) Robert