Re: GroupByKey with sorted values within key

Etienne Chauchot Thu, 31 May 2018 07:06:37 -0700
I also totally agree with what Luke said including the sort 100 use case. 
@Kenn, regarding to the above, I find the name misleading too: for people that 
do not have a strong big data background
it could sound like the PCollection is sorted in the whole whereas it is only 
sorted locally within group or partition
simply because a whole sorting does not scale.
Etienne
Le mercredi 30 mai 2018 à 08:08 -0700, Kenneth Knowles a écrit :
> I can see a few usability issues here. Totally agree w/ Luke, just noting:
> 
>  - The naming is slightly misleading because SortValues is actually already 
> GBK+SortValues.
>  - It also makes things look less supported when they are in the extensions/ 
> folder. I'd say we should have a better
> place to put such a library if it is the official public implementation. The 
> word "extensions" doesn't seem
> particularly accurate or meaningful to me.
> 
> Q: Does SortValues have a defined & documented URN yet?
> 
> Kenn
> 
> On Wed, May 30, 2018 at 7:52 AM Lukasz Cwik <[email protected]> wrote:
> > Each runner can choose to override the SortValues PTransform with their own 
> > internal offering. For example Spark
> > overrides global combine[1] during pipeline translation. If Spark detected 
> > the SortValues PTransform during
> > translation, it could override the offering with something that used 
> > repartitionAndSortWithinPartitions.
> > 
> > GroupByKeyAndSortValuesOnly inside Dataflow exists to support a specific 
> > use case. Users should rely on SortValues
> > as it is the public implementation for sorting.
> > 
> > 1: 
> > https://github.com/apache/beam/blob/85dcab56268fbac923ffd5885489ee154f097fc5/runners/spark/src/main/java/org/apac
> > he/beam/runners/spark/translation/TransformTranslator.java#L200
> > 
> > As a side note, its uncommon where you need to sort all values, usually top 
> > 100 suffices and can be implemented much
> > more efficiently with a combiner when compared to sorting.
> > On Wed, May 30, 2018 at 3:38 AM <[email protected]> wrote:
> > > Hi,
> > >  I have question I am trying to do translation in dsl-euphoria for 
> > > “GroupByKey with sorted values within key” to
> > > Beam. I am aware of java sdk extensions SortValues, but it doesn’t have 
> > > sufficient abstraction for runners. 
> > > 
> > > I noticed that in DataflowRunner there is translation of batch GroupByKey 
> > > to GroupByKeyAndSortValuesOnly but is it
> > > considered to have it in beam core so for example SparkRunner could 
> > > translate “GroupByKey with sorted values
> > > within key” with their internals such as 
> > > repartitionAndSortWithinPartitions.
> > > Thank you.
> > > Marek Simunek
Re: GroupByKey with sorted values within key

Reply via email to