Sergey Beryozkin created BEAM-2994: -------------------------------------- Summary: Refactor TikaIO Key: BEAM-2994 URL: https://issues.apache.org/jira/browse/BEAM-2994 Project: Beam Issue Type: Task Components: sdk-java-extensions Affects Versions: 2.2.0 Reporter: Sergey Beryozkin Assignee: Reuven Lax Fix For: 2.2.0
TikaIO is currently implemented as a BoundedSource and asynchronous BoundedReader returning individual document's text chunks as Strings, eventually passed unordered (and not linked to the original documents) to the pipeline functions. It was decided in the recent beam-dev thread that initially TikaIO should support the cases where only a single composite bean per file, capturing the file content, location (or name) and metadata, should flow to the pipeline, and thus avoiding the need to implement TikaIO as a BoundedSource/Reader. Enhancing TikaIO to support the streaming of the content into the pipelines may be considered in the next phase, based on the specific use-cases... -- This message was sent by Atlassian JIRA (v6.4.14#64029)