SortValues does not have a defined & documented URN yet. Once a Runner is providing such an override, it will happen. No runner publicly provides one to my knowledge.
On Wed, May 30, 2018 at 8:08 AM Kenneth Knowles <k...@google.com> wrote: > I can see a few usability issues here. Totally agree w/ Luke, just noting: > > - The naming is slightly misleading because SortValues is actually > already GBK+SortValues. > - It also makes things look less supported when they are in the > extensions/ folder. I'd say we should have a better place to put such a > library if it is the official public implementation. The word "extensions" > doesn't seem particularly accurate or meaningful to me. > > Q: Does SortValues have a defined & documented URN yet? > > Kenn > > On Wed, May 30, 2018 at 7:52 AM Lukasz Cwik <lc...@google.com> wrote: > >> Each runner can choose to override the SortValues PTransform with their >> own internal offering. For example Spark overrides global combine[1] during >> pipeline translation. If Spark detected the SortValues PTransform during >> translation, it could override the offering with something that used >> repartitionAndSortWithinPartitions. >> >> GroupByKeyAndSortValuesOnly inside Dataflow exists to support a specific >> use case. Users should rely on SortValues as it is the public >> implementation for sorting. >> >> 1: >> https://github.com/apache/beam/blob/85dcab56268fbac923ffd5885489ee154f097fc5/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java#L200 >> >> As a side note, its uncommon where you need to sort all values, usually >> top 100 suffices and can be implemented much more efficiently with a >> combiner when compared to sorting. >> >> On Wed, May 30, 2018 at 3:38 AM <marek-simu...@seznam.cz> wrote: >> >>> Hi, >>> I have question I am trying to do translation in dsl-euphoria for >>> “GroupByKey with sorted values within key” to Beam. I am aware of java sdk >>> extensions SortValues, but it doesn’t have sufficient abstraction for >>> runners. >>> >>> I noticed that in DataflowRunner there is translation of batch >>> GroupByKey to GroupByKeyAndSortValuesOnly but is it considered to have it >>> in beam core so for example SparkRunner could translate “GroupByKey with >>> sorted values within key” with their internals such as >>> repartitionAndSortWithinPartitions. >>> >>> Thank you. >>> Marek Simunek >>> >>