Re: splitIntoBundles vs. generateInitialSplits

Stas Levin Mon, 20 Mar 2017 23:53:17 -0700

Indeed, take a look at https://issues.apache.org/jira/browse/BEAM-1272.


On Tue, Mar 21, 2017 at 8:20 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> It makes sense.
>
> Regards
> JB
>
> On 03/20/2017 11:14 PM, Ismaël Mejía wrote:
> > This is an forgotten one, Stas did you create a JIRA about this one? I
> > think this change should be also tagged as First version release,
> > because this is an API change and can break stuff if we do it later
> > on.
> >
> > On Wed, Jan 11, 2017 at 4:30 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
> >> Hi Eugene and Stas,
> >>
> >> Just back from couple of days off and jump on this discussion.
> >>
> >> I agree with Stas: it's worth to create a Jira about that. The only
> >> "semantic" difference is unbounded vs bounded source, but the behavior
> is
> >> the same.
> >>
> >> Regards
> >> JB
> >>
> >>
> >> On 01/11/2017 04:26 PM, Stas Levin wrote:
> >>>
> >>> Eugene, that makes a lot of sense to me.
> >>>
> >>> Do you think it's worth filing a Jira ticket?
> >>>
> >>> On Wed, Jan 11, 2017 at 2:58 AM Eugene Kirpichov
> >>> <kirpic...@google.com.invalid> wrote:
> >>>
> >>> I agree that the methods are named somewhat confusingly, and ideally
> would
> >>> be named the same. Both of the names miss some aspect of the underlying
> >>> concept.
> >>>
> >>> The underlying concept is split the source into smaller sub-sources
> which,
> >>> if you read all of them, would have read the same data as the original
> >>> one.
> >>> "splitIntoBundles" assumes that 1 source = 1 bundle, which is
> completely
> >>> false in streaming, and only partially true in batch (I'm talking about
> >>> the
> >>> Dataflow runner).
> >>> "generateInitialSplits" assumes that this splitting happens only
> >>> "initially", i.e. at job startup time. This is currently true in
> practice
> >>> for all existing runners, but it doesn't have to be - we could
> conceivably
> >>> call it again at some point during the job if we see that some of the
> >>> sub-sources are still too large.
> >>>
> >>> The analogous method in Splittable DoFn (
> >>> https://s.apache.org/splittable-do-fn) is called @SplitRestriction,
> but
> >>> there are no restrictions in source API, only sources.
> >>>
> >>> Perhaps both should be called simply "split", or "splitIntoSubSources".
> >>>
> >>> On Mon, Jan 9, 2017 at 2:12 PM Stas Levin <stasle...@gmail.com> wrote:
> >>>
> >>>> Definitely seems like the formatting got lost in translation, sorry
> about
> >>>> that :)
> >>>>
> >>>> I guess both cases (methods) create splits, which are essentially a
> list
> >>>
> >>> of
> >>>>
> >>>> bounded/unbounded source instances, each responsible for reading
> certain
> >>>> segments (physical or otherwise) of the data.
> >>>>
> >>>> On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk <s...@google.com.invalid
> >
> >>>> wrote:
> >>>>
> >>>>> hi!
> >>>>>
> >>>>> I think your strikethrough got lost due to this being a text-only
> email
> >>>>> list. To make sure, I think you're asking the following:
> >>>>> " would it be reasonable to think of splitIntoBundles as
> generateSplits?
> >>>>
> >>>> "
> >>>>>
> >>>>> (ie, you strikethrough'd Initial)
> >>>>>
> >>>>> They are very similar and I definitely also think of them as
> occupying
> >>>>
> >>>> the
> >>>>>
> >>>>> same niche. I'll let someone else who was around for naming discuss
> >>>>
> >>>> whether
> >>>>>
> >>>>> it was intentional or not. Conceptually, the way that bounded vs
> >>>>
> >>>> streaming
> >>>>>
> >>>>> are handled means that they are doing slightly different things: a
> >>>>
> >>>> bounded
> >>>>>
> >>>>> source is really kind of creating physical chunks of the data,
> whereas
> >>>>
> >>>> the
> >>>>>
> >>>>> streaming source is creating conceptual divisions of the data that
> will
> >>>>
> >>>> be
> >>>>>
> >>>>> used later. I'm not sure that's worth the confusion caused by the
> >>>>> differences.
> >>>>>
> >>>>> One thing to clarify - splitIntoBundles does have an "Initial"
> aspect to
> >>>>> it. I don't believe there is a publicly defined/written down order
> the
> >>>>> Sources & Reader methods are called in, but a runner trying to get
> >>>>> efficiency would be able to use splitIntoBundles during job startup
> to
> >>>
> >>> be
> >>>>>
> >>>>> able to split up the work before creating readers rather than after
> >>>>> creating readers and waiting to use splitAtFraction.
> >>>>>
> >>>>> S
> >>>>>
> >>>>> On Sun, Jan 8, 2017 at 6:06 AM Stas Levin <stasle...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> A short terminology question regarding "bundle", and
> >>>>>> particularly splitIntoBundles vs. generateInitialSplits.
> >>>>>>
> >>>>>> In *BoundedSource* we have:
> >>>>>> List<? extends BoundedSource<T>> *splitIntoBundles*(...)
> >>>>>>
> >>>>>> In *UnboundedSource* we have:
> >>>>>> List<? extends UnboundedSource<OutputT, CheckpointMarkT>>
> >>>>>> *generateInitialSplits*(...)
> >>>>>>
> >>>>>> I was wondering if the names were intentionally made different, i.e.
> >>>>>
> >>>>> "into
> >>>>>>
> >>>>>> bundles" vs "into splits"?
> >>>>>> In a way these two methods carry out a very similar task, would it
> be
> >>>>>> reasonable to think of *splitIntoBundles *as
> *generate*Initial*Splits?
> >>>>
> >>>> *
> >>>>>>
> >>>>>> (strikethrough due to "initial" not being applicable in the case of
> >>>>>
> >>>>> bounded
> >>>>>>
> >>>>>> sources)
> >>>>>>
> >>>>>> Regards,
> >>>>>> Stas
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: splitIntoBundles vs. generateInitialSplits

Reply via email to