Filed JIRAs for the proposed features and linked with the doc: https://issues.apache.org/jira/browse/BEAM-2511 TextIO should support reading a PCollection of filenames https://issues.apache.org/jira/browse/BEAM-2512 TextIO should support watching for new files https://issues.apache.org/jira/browse/BEAM-2513 TextIO should support watching files for new entries
On Fri, Jun 23, 2017 at 4:32 PM Eugene Kirpichov <[email protected]> wrote: > Hi all, > > I've written up a proposal for incrementally delivering a bunch of useful > new features in TextIO based on Splittable DoFn. It's applicable to other > file-based connectors, TextIO is just one good example. Let me know what > you think! > > https://s.apache.org/textio-sdf > > Copy of abstract: > > Users have often expressed interest in several new features for reading > files - in particular, incremental reading of log files (streaming of new > files matching a pattern and new entries in each file) and reading a > PCollection of filenames (in particular, an unbounded collection arriving > from a stream such as PubSub or Kafka). > > Splittable DoFn <http://s.apache.org/splittable-do-fn> (SDF) enables > these features. This document proposes an API for them, using the example > of TextIO, and proposes and a plan for delivering them subject to > availability of SDF in different runners. Some availability constraints are > circumvented by Running Splittable DoFn via Source API > <http://s.apache.org/sdf-via-source>. > > TL;DR Read a collection of filepatterns arriving on PubSub via > files.apply(TextIO.readEach()). Tail a filepattern via > TextIO.read().watchForNewFiles().watchFilesForNewEntries(). Coming to a > Beam SDK near you in small pieces. > > I think I'm gonna start working on the first steps of the proposed plan, > in parallel with this discussion, because I'm excited :) >
