Thanks, Lukasz, that helps!
On 2018/07/18 19:44:50, Lukasz Cwik <lc...@google.com> wrote: > Apache Beam has no concept of an ordered PCollection. The most common > solution is to use a combiner where you sort your values yourself using N > dummy keys and then partitioning the output based upon the dummy key. > Data -> PairWithNumberIn[0,N] -> Combine(sort values using custom combiner) > -> PartitionByKey --> WriteForKey0 > > \-> WriteForKey1 > > ... > > \-> WriteForKeyN > Note that if you want to write all the data to a single file, you'll have > memory issues with your combiner and have poor performance since you'll > have a single sorter and writer. > > There has been some previous discussions[1] with about ordering with > stricter constraints then general ordering that may apply for your use case > though and would be worthwhile to take a look at. > > 1: https://lists.apache.org/list.html?u...@beam.apache.org:lte=18M:ordering > > On Wed, Jul 18, 2018 at 11:46 AM Allie Chen <yifangc...@google.com> wrote: > > > Greetings! > > > > I have a quick question. Is there an OrderedPCollection concept in Python > > SDK? Say I have a PCollection of objects that I am going to write to a > > file, but I have to keep them in a certain order. Sorting them within one > > worker is just too costly. Is there a more efficient way? > > > > Thanks for your help! > > > > Allie > > >