Hi all,

I'm trying to get better understanding of Beam's internals for the sake of integration with Euphoria API as a DSL ([1]), and while trying to wrap Euphoria's abstractions of outputs, I came across a little issue, that I'm currently a little stuck with. The issue is not important to this question, but it basically boils down to the following: how could I write a Pipeline, that works like a terasort benchmark ([2]). That is - I have a randomly distributed dataset (let's suppose batch case for simplicity), and I want to sort it so that on output I will have N totally sorted partitions. This implies that I can somehow compare the partitions (or partition IDs) on output, so that the following holds: For each partitions X and Y, if partition X is less to partition Y, then all elements in partition X are less or equal to all elements in partition Y.

So far, I have not been able to find a clean solution in Beam. I can do a group-by-key operation (where the *key* would be partition Id), and then sort the data within the key. But I have issues outputting the sorted data by a ParDo (because it can run in parallel in theory, and therefore I can either loose the sorting, or run to concurrency issues).

Would anyone have an idea about how to do this?

Thanks for any comments,

 Jan

[1] https://github.com/seznam/euphoria

[2] https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/examples/terasort/package-summary.html

Reply via email to