Hi Lukasz, Thanks for your response.
> Each call to split may return a different set of sub sources but they always represent the entire original source. Inconsistent sets of sub-sources prevent runners/engines from calling the split API in a distributed manner during runtime. Besides, the splitAtFraction(double fraction) is only available in BoundedSources. How do you perform dynamic splitting for UnboundedSources? Another question: will Source transforms eventually become deprecated and be replaced by the SplittableParDo? Thanks, Shen On Mon, Mar 26, 2018 at 3:41 PM, Lukasz Cwik <lc...@google.com> wrote: > Contractually, the sources returned by splitting must represent the > original source. Each call to split may return a different set of sub > sources but they always represent the entire original source. > > Note that Dataflow does call split effectively during translation and then > later calls APIs on sources to perform dynamic splitting[1]. > > Note, that this is being replaced with SplittableDoFn. Worthwhile to look > at this doc[2] and presentation[3]. > > 1: https://github.com/apache/beam/blob/28665490f6ad0cad091f4f936a8f11 > 3617fd3f27/sdks/java/core/src/main/java/org/apache/beam/sdk/ > io/BoundedSource.java#L387 > 2: https://s.apache.org/splittable-do-fn > 3: https://conferences.oreilly.com/strata/strata-ca/ > public/schedule/detail/63696 > > > > On Mon, Mar 26, 2018 at 11:33 AM Shen Li <cs.she...@gmail.com> wrote: > >> Hi, >> >> Does the split API in Bounded/UnboundedSource guarantee to return the >> same result if invoked in different parallel instances in a distributed >> environment? >> >> For example, assume the original source can split into 3 sub-sources. Say >> the runner creates 3 parallel source operator instances (perhaps running in >> different servers) and uses each instance to handle 1 of the 3 sub-sources. >> In this case, if each operator instance invokes the split method in a >> distributed manner, will they get the same split result? >> >> My understanding is that the current API does not guarantee the 3 >> operator instances will receive the same split result. It is possible that >> 1 of the 3 instances receive 4 sub-sources and the other two receives 3. >> Or, even if they all get 3 sub-sources, there could be gaps and overlaps in >> the data streams. If so, shall we add an API to indicate that whether a >> source can split at runtime? >> >> One solution is to avoid this problem is to split the source at >> translation time and directly pass sub-sources to operator instances. But >> this is not ideal. The server runs the translation might not have access to >> the source (DB, KV, MQ, etc). Or the application may want to dynamically >> change the source parallel width at runtime. Hence, the runner/engine >> sometimes have to split the source during runtime in a distributed >> environment. >> >> Thanks, >> Shen >> >>