Source split consistency in distributed environment

Shen Li Mon, 26 Mar 2018 11:33:16 -0700

Hi,

Does the split API in Bounded/UnboundedSource guarantee to return the same
result if invoked in different parallel instances in a distributed
environment?


For example, assume the original source can split into 3 sub-sources. Say
the runner creates 3 parallel source operator instances (perhaps running in
different servers) and uses each instance to handle 1 of the 3 sub-sources.
In this case, if each operator instance invokes the split method in a
distributed manner, will they get the same split result?

My understanding is that the current API does not guarantee the 3 operator
instances will receive the same split result. It is possible that 1 of the
3 instances receive 4 sub-sources and the other two receives 3. Or, even if
they all get 3 sub-sources, there could be gaps and overlaps in the data
streams. If so, shall we add an API to indicate that whether a source can
split at runtime?

One solution is to avoid this problem is to split the source at translation
time and directly pass sub-sources to operator instances. But this is not
ideal. The server runs the translation might not have access to the source
(DB, KV, MQ, etc). Or the application may want to dynamically change the
source parallel width at runtime. Hence, the runner/engine sometimes have
to split the source during runtime in a distributed environment.

Thanks,
Shen

Source split consistency in distributed environment

Reply via email to