I'll just remind that Beam already supports (experimental)
@RequiresTimeSortedInput (which has several limitations, mostly in that
it orders only by timestamp and not some - time related - user field;
and of course - missing retractions). An arbitrary sorting seems to be
hard, even per-key, it seems it will always have to be somewhat
time-bounded, as otherwise it might require unbounded states. The batch
case on the other hand typically has a way to order inputs arbitrarily
with virtually zero cost, as many implementations use sort-merge-group
to perform reduction operations.
Jan
On 5/11/21 5:56 PM, Kenneth Knowles wrote:
Per-key ordered delivery makes a ton of sense. I'd guess CDC has the
same needs as retractions, so that the changelog can be applied in
order as it arrives. And since it is per-key you still get parallelism.
Global ordering is quite different. I know that SQL and Dataframes
have global sorting operations. The question has always been how does
"embarassingly paralllel" processing interact with sorting/ordering. I
imagine some other systems have the features so we can look at how it
is used?
Kenn
Kenn
On Mon, May 10, 2021 at 4:39 PM Sam Rohde <[email protected]
<mailto:[email protected]>> wrote:
Awesome, thanks Pablo!
On Mon, May 10, 2021 at 4:05 PM Pablo Estrada <[email protected]
<mailto:[email protected]>> wrote:
CDC would also benefit. I am working on a proposal for this
that is concerned with streaming pipelines, and per-key
ordered delivery. I will share with you as soon as I have a
draft.
Best
-P.
On Mon, May 10, 2021 at 2:56 PM Reuven Lax <[email protected]
<mailto:[email protected]>> wrote:
There has been talk, but nothing concrete.
On Mon, May 10, 2021 at 1:42 PM Sam Rohde
<[email protected] <mailto:[email protected]>> wrote:
Hi All,
I was wondering if there had been any plans for
creating ordered PCollections in the Beam model? Or if
there might be plans for them in the future?
I know that Beam SQL and Beam DataFrames would
directly benefit from an ordered PCollection.
Regards,
Sam