Sergey Beryozkin created BEAM-2994:
--------------------------------------

             Summary: Refactor TikaIO
                 Key: BEAM-2994
                 URL: https://issues.apache.org/jira/browse/BEAM-2994
             Project: Beam
          Issue Type: Task
          Components: sdk-java-extensions
    Affects Versions: 2.2.0
            Reporter: Sergey Beryozkin
            Assignee: Reuven Lax
             Fix For: 2.2.0


TikaIO is currently implemented as a BoundedSource and asynchronous 
BoundedReader returning individual document's text chunks as Strings, 
eventually passed unordered (and not linked to the original documents) to the 
pipeline functions.

It was decided in the recent beam-dev thread that initially TikaIO should 
support the cases where only a single composite bean per file, capturing the 
file content, location (or name) and metadata, should flow to the pipeline, and 
thus avoiding the need to implement TikaIO as a BoundedSource/Reader.

Enhancing  TikaIO to support the streaming of the content into the pipelines 
may be considered in the next phase, based on the specific use-cases... 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to