Hi Eugene and Stas,
Just back from couple of days off and jump on this discussion.
I agree with Stas: it's worth to create a Jira about that. The only
"semantic" difference is unbounded vs bounded source, but the behavior
is the same.
Regards
JB
On 01/11/2017 04:26 PM, Stas Levin wrote:
Eugene, that makes a lot of sense to me.
Do you think it's worth filing a Jira ticket?
On Wed, Jan 11, 2017 at 2:58 AM Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:
I agree that the methods are named somewhat confusingly, and ideally would
be named the same. Both of the names miss some aspect of the underlying
concept.
The underlying concept is split the source into smaller sub-sources which,
if you read all of them, would have read the same data as the original one.
"splitIntoBundles" assumes that 1 source = 1 bundle, which is completely
false in streaming, and only partially true in batch (I'm talking about the
Dataflow runner).
"generateInitialSplits" assumes that this splitting happens only
"initially", i.e. at job startup time. This is currently true in practice
for all existing runners, but it doesn't have to be - we could conceivably
call it again at some point during the job if we see that some of the
sub-sources are still too large.
The analogous method in Splittable DoFn (
https://s.apache.org/splittable-do-fn) is called @SplitRestriction, but
there are no restrictions in source API, only sources.
Perhaps both should be called simply "split", or "splitIntoSubSources".
On Mon, Jan 9, 2017 at 2:12 PM Stas Levin <stasle...@gmail.com> wrote:
Definitely seems like the formatting got lost in translation, sorry about
that :)
I guess both cases (methods) create splits, which are essentially a list
of
bounded/unbounded source instances, each responsible for reading certain
segments (physical or otherwise) of the data.
On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk <s...@google.com.invalid>
wrote:
hi!
I think your strikethrough got lost due to this being a text-only email
list. To make sure, I think you're asking the following:
" would it be reasonable to think of splitIntoBundles as generateSplits?
"
(ie, you strikethrough'd Initial)
They are very similar and I definitely also think of them as occupying
the
same niche. I'll let someone else who was around for naming discuss
whether
it was intentional or not. Conceptually, the way that bounded vs
streaming
are handled means that they are doing slightly different things: a
bounded
source is really kind of creating physical chunks of the data, whereas
the
streaming source is creating conceptual divisions of the data that will
be
used later. I'm not sure that's worth the confusion caused by the
differences.
One thing to clarify - splitIntoBundles does have an "Initial" aspect to
it. I don't believe there is a publicly defined/written down order the
Sources & Reader methods are called in, but a runner trying to get
efficiency would be able to use splitIntoBundles during job startup to
be
able to split up the work before creating readers rather than after
creating readers and waiting to use splitAtFraction.
S
On Sun, Jan 8, 2017 at 6:06 AM Stas Levin <stasle...@gmail.com> wrote:
Hi,
A short terminology question regarding "bundle", and
particularly splitIntoBundles vs. generateInitialSplits.
In *BoundedSource* we have:
List<? extends BoundedSource<T>> *splitIntoBundles*(...)
In *UnboundedSource* we have:
List<? extends UnboundedSource<OutputT, CheckpointMarkT>>
*generateInitialSplits*(...)
I was wondering if the names were intentionally made different, i.e.
"into
bundles" vs "into splits"?
In a way these two methods carry out a very similar task, would it be
reasonable to think of *splitIntoBundles *as *generate*Initial*Splits?
*
(strikethrough due to "initial" not being applicable in the case of
bounded
sources)
Regards,
Stas
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com