Filed JIRAs for the proposed features and linked with the doc:
https://issues.apache.org/jira/browse/BEAM-2511 TextIO should support
reading a PCollection of filenames
https://issues.apache.org/jira/browse/BEAM-2512 TextIO should support
watching for new files
https://issues.apache.org/jira/browse/BEAM-2513 TextIO should support
watching files for new entries

On Fri, Jun 23, 2017 at 4:32 PM Eugene Kirpichov <[email protected]>
wrote:

> Hi all,
>
> I've written up a proposal for incrementally delivering a bunch of useful
> new features in TextIO based on Splittable DoFn. It's applicable to other
> file-based connectors, TextIO is just one good example. Let me know what
> you think!
>
> https://s.apache.org/textio-sdf
>
> Copy of abstract:
>
> Users have often expressed interest in several new features for reading
> files - in particular, incremental reading of log files (streaming of new
> files matching a pattern and new entries in each file) and reading a
> PCollection of filenames (in particular, an unbounded collection arriving
> from a stream such as PubSub or Kafka).
>
> Splittable DoFn <http://s.apache.org/splittable-do-fn> (SDF) enables
> these features. This document proposes an API for them, using the example
> of TextIO, and proposes and a plan for delivering them subject to
> availability of SDF in different runners. Some availability constraints are
> circumvented by Running Splittable DoFn via Source API
> <http://s.apache.org/sdf-via-source>.
>
> TL;DR Read a collection of filepatterns arriving on PubSub via
> files.apply(TextIO.readEach()). Tail a filepattern via
> TextIO.read().watchForNewFiles().watchFilesForNewEntries(). Coming to a
> Beam SDK near you in small pieces.
>
> I think I'm gonna start working on the first steps of the proposed plan,
> in parallel with this discussion, because I'm excited :)
>

Reply via email to