+1 This is a really nice doc and plan.
On Tue, Jun 27, 2017 at 1:49 AM, Aljoscha Krettek <[email protected]> wrote: > +1 > > This sounds very good and there is a clear implementation path! > > > On 24. Jun 2017, at 20:55, Jean-Baptiste Onofré <[email protected]> wrote: > > > > Fair enough ;) > > > > Let me review the different Jira and provide some feedback. > > > > Regards > > JB > > > > On Jun 24, 2017, 20:54, at 20:54, Eugene Kirpichov > <[email protected]> wrote: > >> Hi JB, > >> I haven't yet thought about how this work can be parallelized. For now > >> I'd > >> like to just get feedback on the approach :) > >> But glad that you're willing to help out - let's discuss this too a bit > >> later! > >> > >> On Sat, Jun 24, 2017 at 11:51 AM Jean-Baptiste Onofré <[email protected]> > >> wrote: > >> > >>> Thanks Eugene > >>> > >>> I will pick up some. > >>> > >>> Regards > >>> JB > >>> > >>> On Jun 24, 2017, 20:00, at 20:00, Eugene Kirpichov > >>> <[email protected]> wrote: > >>>> Filed JIRAs for the proposed features and linked with the doc: > >>>> https://issues.apache.org/jira/browse/BEAM-2511 TextIO should > >> support > >>>> reading a PCollection of filenames > >>>> https://issues.apache.org/jira/browse/BEAM-2512 TextIO should > >> support > >>>> watching for new files > >>>> https://issues.apache.org/jira/browse/BEAM-2513 TextIO should > >> support > >>>> watching files for new entries > >>>> > >>>> On Fri, Jun 23, 2017 at 4:32 PM Eugene Kirpichov > >> <[email protected]> > >>>> wrote: > >>>> > >>>>> Hi all, > >>>>> > >>>>> I've written up a proposal for incrementally delivering a bunch of > >>>> useful > >>>>> new features in TextIO based on Splittable DoFn. It's applicable > >> to > >>>> other > >>>>> file-based connectors, TextIO is just one good example. Let me > >> know > >>>> what > >>>>> you think! > >>>>> > >>>>> https://s.apache.org/textio-sdf > >>>>> > >>>>> Copy of abstract: > >>>>> > >>>>> Users have often expressed interest in several new features for > >>>> reading > >>>>> files - in particular, incremental reading of log files (streaming > >> of > >>>> new > >>>>> files matching a pattern and new entries in each file) and reading > >> a > >>>>> PCollection of filenames (in particular, an unbounded collection > >>>> arriving > >>>>> from a stream such as PubSub or Kafka). > >>>>> > >>>>> Splittable DoFn <http://s.apache.org/splittable-do-fn> (SDF) > >> enables > >>>>> these features. This document proposes an API for them, using the > >>>> example > >>>>> of TextIO, and proposes and a plan for delivering them subject to > >>>>> availability of SDF in different runners. Some availability > >>>> constraints are > >>>>> circumvented by Running Splittable DoFn via Source API > >>>>> <http://s.apache.org/sdf-via-source>. > >>>>> > >>>>> TL;DR Read a collection of filepatterns arriving on PubSub via > >>>>> files.apply(TextIO.readEach()). Tail a filepattern via > >>>>> TextIO.read().watchForNewFiles().watchFilesForNewEntries(). Coming > >> to > >>>> a > >>>>> Beam SDK near you in small pieces. > >>>>> > >>>>> I think I'm gonna start working on the first steps of the proposed > >>>> plan, > >>>>> in parallel with this discussion, because I'm excited :) > >>>>> > >>> > >
