Thanks for the clarification Ismaël and Eugene. +1 for deprecating existing
FooIO.readAll() transforms in favor of FooIO.readFiles().

On Wed, Jan 30, 2019 at 3:25 PM Eugene Kirpichov <kirpic...@google.com>
wrote:

> TextIO.read() and AvroIO.read() indeed perform better than match() +
> readMatches() + readFiles(), due to DWR - so for these two in particular I
> would not recommend such a refactoring.
> However, new file-based IOs that do not support DWR should only provide
> readFiles(). Those that do, should provide read() and readFiles(). When SDF
> supports DWR, then readFiles() will be enough in all cases.
> In general there's no need for readAll() for new file-based IOs - it is
> always equivalent to matchAll() + readMatches() + readFiles() including
> performance-wise. It was included in TextIO/AvroIO before readFiles() was a
> thing.
>
> On Wed, Jan 30, 2019 at 2:41 PM Chamikara Jayalath <chamik...@google.com>
> wrote:
>
>> On Wed, Jan 30, 2019 at 2:37 PM Chamikara Jayalath <chamik...@google.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Jan 30, 2019 at 2:33 PM Ismaël Mejía <ieme...@gmail.com> wrote:
>>>
>>>> Ups slight typo, in the first line of the previous email I meant read
>>>> instead of readAll
>>>>
>>>> On Wed, Jan 30, 2019 at 11:32 PM Ismaël Mejía <ieme...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Reuven is right for the example, readAll at this moment may be faster
>>>> > and also supports Dynamic Work Rebalancing (DWR), but the performance
>>>> > of the other approach may (and must) be improved to be equal, once the
>>>> > internal implementation of TextIO.read moves to a SDF version instead
>>>> > of the FileBasedSource one, and once that runners support DWR through
>>>> > SDF. Of course all of this is future work. Probably Eugene can
>>>> > eventually chime in to give more details in practical performance in
>>>> > his tests in Dataflow.
>>>> >
>>>> > Really interesting topic, but I want to bring back the discussion to
>>>> > the subject of the thread. I think there is some confusion after
>>>> > Jeff's example which should have been:
>>>> >
>>>> >       return input
>>>> >           .apply(TextIO.readAll());
>>>> >
>>>> > to:
>>>> >
>>>> >       return input
>>>> >           .apply(FileIO.match().filepattern(fileSpec))
>>>> >           .apply(FileIO.readMatches())
>>>> >           .apply(TextIO.readFiles());
>>>> >
>>>> > This is the question we are addressing, do we need a readAll transform
>>>> > that replaces the 3 steps or no?
>>>>
>>>
>>> Ismaël, I'm not quite sure how these two are equal.
>>> readFiles() transform returns a PCollection of ReadableFile objects. Users
>>> are expected to read these files in a subsequent ParDo and produce a
>>> PCollection of proper type. FooIO.ReadAll() transforms on the other hand
>>> are tailored to each IO connector and return a PCollection of objects of
>>> type that are supported to be returned by that IO connector.
>>>
>>
>> I assume you meant FileIO.readFiles()  here. Or did you mean
>> TextIO.readFiles() ? If so that seems very similar to TextIO.readAll().
>>
>>>
>>>
>>>
>>>> >
>>>> > On Wed, Jan 30, 2019 at 9:03 PM Robert Bradshaw <rober...@google.com>
>>>> wrote:
>>>> > >
>>>> > > Yes, this is precisely the goal of SDF.
>>>> > >
>>>> > >
>>>> > > On Wed, Jan 30, 2019 at 8:41 PM Kenneth Knowles <k...@google.com>
>>>> wrote:
>>>> > > >
>>>> > > > So is the latter is intended for splittable DoFn but not yet
>>>> using it? The promise of SDF is precisely this composability, isn't it?
>>>> > > >
>>>> > > > Kenn
>>>> > > >
>>>> > > > On Wed, Jan 30, 2019 at 10:16 AM Jeff Klukas <jklu...@mozilla.com>
>>>> wrote:
>>>> > > >>
>>>> > > >> Reuven - Is TextIO.read().from() a more complex case than the
>>>> topic Ismaël is bringing up in this thread? I'm surprised to hear that the
>>>> two examples have different performance characteristics.
>>>> > > >>
>>>> > > >> Reading through the implementation, I guess the fundamental
>>>> difference is whether a given configuration expands to TextIO.ReadAll or to
>>>> io.Read. AFAICT, that detail and the subsequent performance impact is not
>>>> documented.
>>>> > > >>
>>>> > > >> If the above is correct, perhaps it's an argument for IOs to
>>>> provide higher-level methods in cases where they can optimize performance
>>>> compared to what a user might naively put together.
>>>> > > >>
>>>> > > >> On Wed, Jan 30, 2019 at 12:35 PM Reuven Lax <re...@google.com>
>>>> wrote:
>>>> > > >>>
>>>> > > >>> Jeff, what you did here is not simply a refactoring. These two
>>>> are quite different, and will likely have different performance
>>>> characteristics.
>>>> > > >>>
>>>> > > >>> The first evaluates the wildcard, and allows the runner to pick
>>>> appropriate bundling. Bundles might contain multiple files (if they are
>>>> small), and the runner can split the files as appropriate. In the case of
>>>> the Dataflow runner, these bundles can be further split dynamically.
>>>> > > >>>
>>>> > > >>> The second chops of the files inside the the PTransform, and
>>>> processes each chunk in a ParDo. TextIO.readFiles currently chops up each
>>>> file into 64mb chunks (hardcoded), and then processes each chunk in a 
>>>> ParDo.
>>>> > > >>>
>>>> > > >>> Reuven
>>>> > > >>>
>>>> > > >>>
>>>> > > >>> On Wed, Jan 30, 2019 at 9:18 AM Jeff Klukas <
>>>> jklu...@mozilla.com> wrote:
>>>> > > >>>>
>>>> > > >>>> I would prefer we move towards option [2]. I just tried the
>>>> following refactor in my own code from:
>>>> > > >>>>
>>>> > > >>>>       return input
>>>> > > >>>>           .apply(TextIO.read().from(fileSpec));
>>>> > > >>>>
>>>> > > >>>> to:
>>>> > > >>>>
>>>> > > >>>>       return input
>>>> > > >>>>           .apply(FileIO.match().filepattern(fileSpec))
>>>> > > >>>>           .apply(FileIO.readMatches())
>>>> > > >>>>           .apply(TextIO.readFiles());
>>>> > > >>>>
>>>> > > >>>> Yes, the latter is more verbose but not ridiculously so, and
>>>> it's also more instructive about what's happening.
>>>> > > >>>>
>>>> > > >>>> When I first started working with Beam, it took me a while to
>>>> realize that TextIO.read().from() would accept a wildcard. The more verbose
>>>> version involves a method called "filepattern" which makes this much more
>>>> obvious. It also leads me to understand that I could use the same
>>>> FileIO.match() machinery to do other things with filesystems other than
>>>> read file contents.
>>>> > > >>>>
>>>> > > >>>> On Wed, Jan 30, 2019 at 11:26 AM Ismaël Mejía <
>>>> ieme...@gmail.com> wrote:
>>>> > > >>>>>
>>>> > > >>>>> Hello,
>>>> > > >>>>>
>>>> > > >>>>> A ‘recent’ pattern of use in Beam is to have in file based
>>>> IOs a
>>>> > > >>>>> `readAll()` implementation that basically matches a
>>>> `PCollection` of
>>>> > > >>>>> file patterns and reads them, e.g. `TextIO`, `AvroIO`.
>>>> `ReadAll` is
>>>> > > >>>>> implemented by a expand function that matches files with
>>>> FileIO and
>>>> > > >>>>> then reads them using a format specific `ReadFiles` transform
>>>> e.g.
>>>> > > >>>>> TextIO.ReadFiles, AvroIO.ReadFiles. So in the end `ReadAll`
>>>> in the
>>>> > > >>>>> Java implementation is just an user friendly API to hide
>>>> FileIO.match
>>>> > > >>>>> + ReadFiles.
>>>> > > >>>>>
>>>> > > >>>>> Most recent IOs do NOT implement ReadAll to encourage the more
>>>> > > >>>>> composable approach of File + ReadFiles, e.g. XmlIO and
>>>> ParquetIO.
>>>> > > >>>>>
>>>> > > >>>>> Implementing ReadAll as a wrapper is relatively easy and is
>>>> definitely
>>>> > > >>>>> user friendly, but it has an  issue, it may be error-prone
>>>> and it adds
>>>> > > >>>>> more code to maintain (mostly ‘repeated’ code). However
>>>> `readAll` is a
>>>> > > >>>>> more abstract pattern that applies not only to File based IOs
>>>> so it
>>>> > > >>>>> makes sense for example in other transforms that map a
>>>> `Pcollection`
>>>> > > >>>>> of read requests and is the basis for SDF composable style
>>>> APIs like
>>>> > > >>>>> the recent `HBaseIO.readAll()`.
>>>> > > >>>>>
>>>> > > >>>>> So the question is should we:
>>>> > > >>>>>
>>>> > > >>>>> [1] Implement `readAll` in all file based IOs to be user
>>>> friendly and
>>>> > > >>>>> assume the (minor) maintenance cost
>>>> > > >>>>>
>>>> > > >>>>> or
>>>> > > >>>>>
>>>> > > >>>>> [2] Deprecate `readAll` from file based IOs and encourage
>>>> users to use
>>>> > > >>>>> FileIO + `readFiles` (less maintenance and encourage
>>>> composition).
>>>> > > >>>>>
>>>> > > >>>>> I just checked quickly in the python code base but I did not
>>>> find if
>>>> > > >>>>> the File match + ReadFiles pattern applies, but it would be
>>>> nice to
>>>> > > >>>>> see what the python guys think on this too.
>>>> > > >>>>>
>>>> > > >>>>> This discussion comes from a recent slack conversation with
>>>> Łukasz
>>>> > > >>>>> Gajowy, and we wanted to settle into one approach to make the
>>>> IO
>>>> > > >>>>> signatures consistent, so any opinions/preferences?
>>>>
>>>

Reply via email to