[ https://issues.apache.org/jira/browse/BEAM-2994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reuven Lax updated BEAM-2994: ----------------------------- Fix Version/s: (was: 2.2.0) 2.3.0 > Refactor TikaIO > --------------- > > Key: BEAM-2994 > URL: https://issues.apache.org/jira/browse/BEAM-2994 > Project: Beam > Issue Type: Task > Components: sdk-java-extensions > Affects Versions: 2.2.0 > Reporter: Sergey Beryozkin > Assignee: Sergey Beryozkin > Fix For: 2.3.0 > > > TikaIO is currently implemented as a BoundedSource and asynchronous > BoundedReader returning individual document's text chunks as Strings, > eventually passed unordered (and not linked to the original documents) to the > pipeline functions. > It was decided in the recent beam-dev thread that initially TikaIO should > support the cases where only a single composite bean per file, capturing the > file content, location (or name) and metadata, should flow to the pipeline, > and thus avoiding the need to implement TikaIO as a BoundedSource/Reader. > Enhancing TikaIO to support the streaming of the content into the pipelines > may be considered in the next phase, based on the specific use-cases... -- This message was sent by Atlassian JIRA (v6.4.14#64029)