Thanks, Lukasz, that helps!

On 2018/07/18 19:44:50, Lukasz Cwik <lc...@google.com> wrote: 
> Apache Beam has no concept of an ordered PCollection. The most common
> solution is to use a combiner where you sort your values yourself using N
> dummy keys and then partitioning the output based upon the dummy key.
> Data -> PairWithNumberIn[0,N] -> Combine(sort values using custom combiner)
> -> PartitionByKey --> WriteForKey0
> 
>                   \-> WriteForKey1
> 
>                   ...
> 
>                   \-> WriteForKeyN
> Note that if you want to write all the data to a single file, you'll have
> memory issues with your combiner and have poor performance since you'll
> have a single sorter and writer.
> 
> There has been some previous discussions[1] with about ordering with
> stricter constraints then general ordering that may apply for your use case
> though and would be worthwhile to take a look at.
> 
> 1: https://lists.apache.org/list.html?u...@beam.apache.org:lte=18M:ordering
> 
> On Wed, Jul 18, 2018 at 11:46 AM Allie Chen <yifangc...@google.com> wrote:
> 
> > Greetings!
> >
> > I have a quick question. Is there an OrderedPCollection concept in Python
> > SDK? Say I have a PCollection of objects that I am going to write to a
> > file, but I have to keep them in a certain order. Sorting them within one
> > worker is just too costly. Is there a more efficient way?
> >
> > Thanks for your help!
> >
> > Allie
> >
> 

Reply via email to