Re: Ordered PCollection

2018-07-18 Thread yifangchen
Thanks, Lukasz, that helps!

On 2018/07/18 19:44:50, Lukasz Cwik  wrote: 
> Apache Beam has no concept of an ordered PCollection. The most common
> solution is to use a combiner where you sort your values yourself using N
> dummy keys and then partitioning the output based upon the dummy key.
> Data -> PairWithNumberIn[0,N] -> Combine(sort values using custom combiner)
> -> PartitionByKey --> WriteForKey0
> 
>   \-> WriteForKey1
> 
>   ...
> 
>   \-> WriteForKeyN
> Note that if you want to write all the data to a single file, you'll have
> memory issues with your combiner and have poor performance since you'll
> have a single sorter and writer.
> 
> There has been some previous discussions[1] with about ordering with
> stricter constraints then general ordering that may apply for your use case
> though and would be worthwhile to take a look at.
> 
> 1: https://lists.apache.org/list.html?u...@beam.apache.org:lte=18M:ordering
> 
> On Wed, Jul 18, 2018 at 11:46 AM Allie Chen  wrote:
> 
> > Greetings!
> >
> > I have a quick question. Is there an OrderedPCollection concept in Python
> > SDK? Say I have a PCollection of objects that I am going to write to a
> > file, but I have to keep them in a certain order. Sorting them within one
> > worker is just too costly. Is there a more efficient way?
> >
> > Thanks for your help!
> >
> > Allie
> >
> 


Re: Ordered PCollection

2018-07-18 Thread Lukasz Cwik
Apache Beam has no concept of an ordered PCollection. The most common
solution is to use a combiner where you sort your values yourself using N
dummy keys and then partitioning the output based upon the dummy key.
Data -> PairWithNumberIn[0,N] -> Combine(sort values using custom combiner)
-> PartitionByKey --> WriteForKey0

  \-> WriteForKey1

  ...

  \-> WriteForKeyN
Note that if you want to write all the data to a single file, you'll have
memory issues with your combiner and have poor performance since you'll
have a single sorter and writer.

There has been some previous discussions[1] with about ordering with
stricter constraints then general ordering that may apply for your use case
though and would be worthwhile to take a look at.

1: https://lists.apache.org/list.html?u...@beam.apache.org:lte=18M:ordering

On Wed, Jul 18, 2018 at 11:46 AM Allie Chen  wrote:

> Greetings!
>
> I have a quick question. Is there an OrderedPCollection concept in Python
> SDK? Say I have a PCollection of objects that I am going to write to a
> file, but I have to keep them in a certain order. Sorting them within one
> worker is just too costly. Is there a more efficient way?
>
> Thanks for your help!
>
> Allie
>


Ordered PCollection

2018-07-18 Thread Allie Chen
Greetings!

I have a quick question. Is there an OrderedPCollection concept in Python
SDK? Say I have a PCollection of objects that I am going to write to a
file, but I have to keep them in a certain order. Sorting them within one
worker is just too costly. Is there a more efficient way?

Thanks for your help!

Allie