Re: splitIntoBundles vs. generateInitialSplits

Jean-Baptiste Onofré Wed, 11 Jan 2017 07:30:22 -0800

Hi Eugene and Stas,

Just back from couple of days off and jump on this discussion.

I agree with Stas: it's worth to create a Jira about that. The only"semantic" difference is unbounded vs bounded source, but the behavioris the same.


Regards
JB

On 01/11/2017 04:26 PM, Stas Levin wrote:

Eugene, that makes a lot of sense to me.

Do you think it's worth filing a Jira ticket?

On Wed, Jan 11, 2017 at 2:58 AM Eugene Kirpichov
<[email protected]> wrote:

I agree that the methods are named somewhat confusingly, and ideally would
be named the same. Both of the names miss some aspect of the underlying
concept.

The underlying concept is split the source into smaller sub-sources which,
if you read all of them, would have read the same data as the original one.
"splitIntoBundles" assumes that 1 source = 1 bundle, which is completely
false in streaming, and only partially true in batch (I'm talking about the
Dataflow runner).
"generateInitialSplits" assumes that this splitting happens only
"initially", i.e. at job startup time. This is currently true in practice
for all existing runners, but it doesn't have to be - we could conceivably
call it again at some point during the job if we see that some of the
sub-sources are still too large.

The analogous method in Splittable DoFn (
https://s.apache.org/splittable-do-fn) is called @SplitRestriction, but
there are no restrictions in source API, only sources.

Perhaps both should be called simply "split", or "splitIntoSubSources".

On Mon, Jan 9, 2017 at 2:12 PM Stas Levin <[email protected]> wrote:

Definitely seems like the formatting got lost in translation, sorry about
that :)

I guess both cases (methods) create splits, which are essentially a list

of

bounded/unbounded source instances, each responsible for reading certain
segments (physical or otherwise) of the data.

On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk <[email protected]>
wrote:

hi!

I think your strikethrough got lost due to this being a text-only email
list. To make sure, I think you're asking the following:
" would it be reasonable to think of splitIntoBundles as generateSplits?

(ie, you strikethrough'd Initial)

They are very similar and I definitely also think of them as occupying

the

same niche. I'll let someone else who was around for naming discuss

whether

it was intentional or not. Conceptually, the way that bounded vs

streaming

are handled means that they are doing slightly different things: a

bounded

source is really kind of creating physical chunks of the data, whereas

the

streaming source is creating conceptual divisions of the data that will

be

used later. I'm not sure that's worth the confusion caused by the
differences.

One thing to clarify - splitIntoBundles does have an "Initial" aspect to
it. I don't believe there is a publicly defined/written down order the
Sources & Reader methods are called in, but a runner trying to get
efficiency would be able to use splitIntoBundles during job startup to

be

able to split up the work before creating readers rather than after
creating readers and waiting to use splitAtFraction.

S

On Sun, Jan 8, 2017 at 6:06 AM Stas Levin <[email protected]> wrote:

Hi,

A short terminology question regarding "bundle", and
particularly splitIntoBundles vs. generateInitialSplits.

In *BoundedSource* we have:
List<? extends BoundedSource<T>> *splitIntoBundles*(...)

In *UnboundedSource* we have:
List<? extends UnboundedSource<OutputT, CheckpointMarkT>>
*generateInitialSplits*(...)

I was wondering if the names were intentionally made different, i.e.

"into

bundles" vs "into splits"?
In a way these two methods carry out a very similar task, would it be
reasonable to think of *splitIntoBundles *as *generate*Initial*Splits?

(strikethrough due to "initial" not being applicable in the case of

bounded

sources)

Regards,
Stas


--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: splitIntoBundles vs. generateInitialSplits

Reply via email to