Re: Ordered PCollections eventually?

Jan Lukavský Tue, 11 May 2021 09:19:35 -0700

I'll just remind that Beam already supports (experimental)@RequiresTimeSortedInput (which has several limitations, mostly in thatit orders only by timestamp and not some - time related - user field;and of course - missing retractions). An arbitrary sorting seems to behard, even per-key, it seems it will always have to be somewhattime-bounded, as otherwise it might require unbounded states. The batchcase on the other hand typically has a way to order inputs arbitrarilywith virtually zero cost, as many implementations use sort-merge-groupto perform reduction operations.

Jan


On 5/11/21 5:56 PM, Kenneth Knowles wrote:

Per-key ordered delivery makes a ton of sense. I'd guess CDC has thesame needs as retractions, so that the changelog can be applied inorder as it arrives. And since it is per-key you still get parallelism.
Global ordering is quite different. I know that SQL and Dataframeshave global sorting operations. The question has always been how does"embarassingly paralllel" processing interact with sorting/ordering. Iimagine some other systems have the features so we can look at how itis used?
Kenn

Kenn
On Mon, May 10, 2021 at 4:39 PM Sam Rohde <sro...@google.com<mailto:sro...@google.com>> wrote:
    Awesome, thanks Pablo!

    On Mon, May 10, 2021 at 4:05 PM Pablo Estrada <pabl...@google.com
    <mailto:pabl...@google.com>> wrote:

        CDC would also benefit. I am working on a proposal for this
        that is concerned with streaming pipelines, and per-key
        ordered delivery. I will share with you as soon as I have a
        draft.
        Best
        -P.

        On Mon, May 10, 2021 at 2:56 PM Reuven Lax <re...@google.com
        <mailto:re...@google.com>> wrote:

            There has been talk, but nothing concrete.

            On Mon, May 10, 2021 at 1:42 PM Sam Rohde
            <sro...@google.com <mailto:sro...@google.com>> wrote:

                Hi All,

                I was wondering if there had been any plans for
                creating ordered PCollections in the Beam model? Or if
                there might be plans for them in the future?

                I know that Beam SQL and Beam DataFrames would
                directly benefit from an ordered PCollection.

                Regards,
                Sam

Re: Ordered PCollections eventually?

Reply via email to