I also totally agree with what Luke said including the sort 100 use case.
@Kenn, regarding to the above, I find the name misleading too: for people that
do not have a strong big data background
it could sound like the PCollection is sorted in the whole whereas it is only
sorted locally within group or partition
simply because a whole sorting does not scale.
Etienne
Le mercredi 30 mai 2018 à 08:08 -0700, Kenneth Knowles a écrit :
> I can see a few usability issues here. Totally agree w/ Luke, just noting:
>
> - The naming is slightly misleading because SortValues is actually already
> GBK+SortValues.
> - It also makes things look less supported when they are in the extensions/
> folder. I'd say we should have a better
> place to put such a library if it is the official public implementation. The
> word "extensions" doesn't seem
> particularly accurate or meaningful to me.
>
> Q: Does SortValues have a defined & documented URN yet?
>
> Kenn
>
> On Wed, May 30, 2018 at 7:52 AM Lukasz Cwik <[email protected]> wrote:
> > Each runner can choose to override the SortValues PTransform with their own
> > internal offering. For example Spark
> > overrides global combine[1] during pipeline translation. If Spark detected
> > the SortValues PTransform during
> > translation, it could override the offering with something that used
> > repartitionAndSortWithinPartitions.
> >
> > GroupByKeyAndSortValuesOnly inside Dataflow exists to support a specific
> > use case. Users should rely on SortValues
> > as it is the public implementation for sorting.
> >
> > 1:
> > https://github.com/apache/beam/blob/85dcab56268fbac923ffd5885489ee154f097fc5/runners/spark/src/main/java/org/apac
> > he/beam/runners/spark/translation/TransformTranslator.java#L200
> >
> > As a side note, its uncommon where you need to sort all values, usually top
> > 100 suffices and can be implemented much
> > more efficiently with a combiner when compared to sorting.
> > On Wed, May 30, 2018 at 3:38 AM <[email protected]> wrote:
> > > Hi,
> > > I have question I am trying to do translation in dsl-euphoria for
> > > “GroupByKey with sorted values within key” to
> > > Beam. I am aware of java sdk extensions SortValues, but it doesn’t have
> > > sufficient abstraction for runners.
> > >
> > > I noticed that in DataflowRunner there is translation of batch GroupByKey
> > > to GroupByKeyAndSortValuesOnly but is it
> > > considered to have it in beam core so for example SparkRunner could
> > > translate “GroupByKey with sorted values
> > > within key” with their internals such as
> > > repartitionAndSortWithinPartitions.
> > > Thank you.
> > > Marek Simunek