Indeed, take a look at https://issues.apache.org/jira/browse/BEAM-1272.
On Tue, Mar 21, 2017 at 8:20 AM Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > It makes sense. > > Regards > JB > > On 03/20/2017 11:14 PM, Ismaël Mejía wrote: > > This is an forgotten one, Stas did you create a JIRA about this one? I > > think this change should be also tagged as First version release, > > because this is an API change and can break stuff if we do it later > > on. > > > > On Wed, Jan 11, 2017 at 4:30 PM, Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> Hi Eugene and Stas, > >> > >> Just back from couple of days off and jump on this discussion. > >> > >> I agree with Stas: it's worth to create a Jira about that. The only > >> "semantic" difference is unbounded vs bounded source, but the behavior > is > >> the same. > >> > >> Regards > >> JB > >> > >> > >> On 01/11/2017 04:26 PM, Stas Levin wrote: > >>> > >>> Eugene, that makes a lot of sense to me. > >>> > >>> Do you think it's worth filing a Jira ticket? > >>> > >>> On Wed, Jan 11, 2017 at 2:58 AM Eugene Kirpichov > >>> <kirpic...@google.com.invalid> wrote: > >>> > >>> I agree that the methods are named somewhat confusingly, and ideally > would > >>> be named the same. Both of the names miss some aspect of the underlying > >>> concept. > >>> > >>> The underlying concept is split the source into smaller sub-sources > which, > >>> if you read all of them, would have read the same data as the original > >>> one. > >>> "splitIntoBundles" assumes that 1 source = 1 bundle, which is > completely > >>> false in streaming, and only partially true in batch (I'm talking about > >>> the > >>> Dataflow runner). > >>> "generateInitialSplits" assumes that this splitting happens only > >>> "initially", i.e. at job startup time. This is currently true in > practice > >>> for all existing runners, but it doesn't have to be - we could > conceivably > >>> call it again at some point during the job if we see that some of the > >>> sub-sources are still too large. > >>> > >>> The analogous method in Splittable DoFn ( > >>> https://s.apache.org/splittable-do-fn) is called @SplitRestriction, > but > >>> there are no restrictions in source API, only sources. > >>> > >>> Perhaps both should be called simply "split", or "splitIntoSubSources". > >>> > >>> On Mon, Jan 9, 2017 at 2:12 PM Stas Levin <stasle...@gmail.com> wrote: > >>> > >>>> Definitely seems like the formatting got lost in translation, sorry > about > >>>> that :) > >>>> > >>>> I guess both cases (methods) create splits, which are essentially a > list > >>> > >>> of > >>>> > >>>> bounded/unbounded source instances, each responsible for reading > certain > >>>> segments (physical or otherwise) of the data. > >>>> > >>>> On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk <s...@google.com.invalid > > > >>>> wrote: > >>>> > >>>>> hi! > >>>>> > >>>>> I think your strikethrough got lost due to this being a text-only > email > >>>>> list. To make sure, I think you're asking the following: > >>>>> " would it be reasonable to think of splitIntoBundles as > generateSplits? > >>>> > >>>> " > >>>>> > >>>>> (ie, you strikethrough'd Initial) > >>>>> > >>>>> They are very similar and I definitely also think of them as > occupying > >>>> > >>>> the > >>>>> > >>>>> same niche. I'll let someone else who was around for naming discuss > >>>> > >>>> whether > >>>>> > >>>>> it was intentional or not. Conceptually, the way that bounded vs > >>>> > >>>> streaming > >>>>> > >>>>> are handled means that they are doing slightly different things: a > >>>> > >>>> bounded > >>>>> > >>>>> source is really kind of creating physical chunks of the data, > whereas > >>>> > >>>> the > >>>>> > >>>>> streaming source is creating conceptual divisions of the data that > will > >>>> > >>>> be > >>>>> > >>>>> used later. I'm not sure that's worth the confusion caused by the > >>>>> differences. > >>>>> > >>>>> One thing to clarify - splitIntoBundles does have an "Initial" > aspect to > >>>>> it. I don't believe there is a publicly defined/written down order > the > >>>>> Sources & Reader methods are called in, but a runner trying to get > >>>>> efficiency would be able to use splitIntoBundles during job startup > to > >>> > >>> be > >>>>> > >>>>> able to split up the work before creating readers rather than after > >>>>> creating readers and waiting to use splitAtFraction. > >>>>> > >>>>> S > >>>>> > >>>>> On Sun, Jan 8, 2017 at 6:06 AM Stas Levin <stasle...@gmail.com> > wrote: > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> A short terminology question regarding "bundle", and > >>>>>> particularly splitIntoBundles vs. generateInitialSplits. > >>>>>> > >>>>>> In *BoundedSource* we have: > >>>>>> List<? extends BoundedSource<T>> *splitIntoBundles*(...) > >>>>>> > >>>>>> In *UnboundedSource* we have: > >>>>>> List<? extends UnboundedSource<OutputT, CheckpointMarkT>> > >>>>>> *generateInitialSplits*(...) > >>>>>> > >>>>>> I was wondering if the names were intentionally made different, i.e. > >>>>> > >>>>> "into > >>>>>> > >>>>>> bundles" vs "into splits"? > >>>>>> In a way these two methods carry out a very similar task, would it > be > >>>>>> reasonable to think of *splitIntoBundles *as > *generate*Initial*Splits? > >>>> > >>>> * > >>>>>> > >>>>>> (strikethrough due to "initial" not being applicable in the case of > >>>>> > >>>>> bounded > >>>>>> > >>>>>> sources) > >>>>>> > >>>>>> Regards, > >>>>>> Stas > >>>>>> > >>>>> > >>>> > >>> > >> > >> -- > >> Jean-Baptiste Onofré > >> jbono...@apache.org > >> http://blog.nanthrax.net > >> Talend - http://www.talend.com > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >