Hi all,
I'm trying to get better understanding of Beam's internals for the sake
of integration with Euphoria API as a DSL ([1]), and while trying to
wrap Euphoria's abstractions of outputs, I came across a little issue,
that I'm currently a little stuck with. The issue is not important to
this question, but it basically boils down to the following: how could I
write a Pipeline, that works like a terasort benchmark ([2]). That is -
I have a randomly distributed dataset (let's suppose batch case for
simplicity), and I want to sort it so that on output I will have N
totally sorted partitions. This implies that I can somehow compare the
partitions (or partition IDs) on output, so that the following holds:
For each partitions X and Y, if partition X is less to partition Y, then
all elements in partition X are less or equal to all elements in
partition Y.
So far, I have not been able to find a clean solution in Beam. I can do
a group-by-key operation (where the *key* would be partition Id), and
then sort the data within the key. But I have issues outputting the
sorted data by a ParDo (because it can run in parallel in theory, and
therefore I can either loose the sorting, or run to concurrency issues).
Would anyone have an idea about how to do this?
Thanks for any comments,
Jan
[1] https://github.com/seznam/euphoria
[2]
https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/examples/terasort/package-summary.html