Apache Beam has no concept of an ordered PCollection. The most common
solution is to use a combiner where you sort your values yourself using N
dummy keys and then partitioning the output based upon the dummy key.
Data -> PairWithNumberIn[0,N] -> Combine(sort values using custom combiner)
-> PartitionByKey --> WriteForKey0

                  \-> WriteForKey1

                  ...

                  \-> WriteForKeyN
Note that if you want to write all the data to a single file, you'll have
memory issues with your combiner and have poor performance since you'll
have a single sorter and writer.

There has been some previous discussions[1] with about ordering with
stricter constraints then general ordering that may apply for your use case
though and would be worthwhile to take a look at.

1: https://lists.apache.org/list.html?u...@beam.apache.org:lte=18M:ordering

On Wed, Jul 18, 2018 at 11:46 AM Allie Chen <yifangc...@google.com> wrote:

> Greetings!
>
> I have a quick question. Is there an OrderedPCollection concept in Python
> SDK? Say I have a PCollection of objects that I am going to write to a
> file, but I have to keep them in a certain order. Sorting them within one
> worker is just too costly. Is there a more efficient way?
>
> Thanks for your help!
>
> Allie
>

Reply via email to