Terasort-like pipeline

Jan Lukavský Wed, 19 Jul 2017 02:46:42 -0700

Hi all,

I'm trying to get better understanding of Beam's internals for the sakeof integration with Euphoria API as a DSL ([1]), and while trying towrap Euphoria's abstractions of outputs, I came across a little issue,that I'm currently a little stuck with. The issue is not important tothis question, but it basically boils down to the following: how could Iwrite a Pipeline, that works like a terasort benchmark ([2]). That is -I have a randomly distributed dataset (let's suppose batch case forsimplicity), and I want to sort it so that on output I will have Ntotally sorted partitions. This implies that I can somehow compare thepartitions (or partition IDs) on output, so that the following holds:For each partitions X and Y, if partition X is less to partition Y, thenall elements in partition X are less or equal to all elements inpartition Y.

So far, I have not been able to find a clean solution in Beam. I can doa group-by-key operation (where the *key* would be partition Id), andthen sort the data within the key. But I have issues outputting thesorted data by a ParDo (because it can run in parallel in theory, andtherefore I can either loose the sorting, or run to concurrency issues).


Would anyone have an idea about how to do this?

Thanks for any comments,

 Jan

[1] https://github.com/seznam/euphoria

[2]https://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/examples/terasort/package-summary.html

Terasort-like pipeline

Reply via email to