I can see a few usability issues here. Totally agree w/ Luke, just noting:

 - The naming is slightly misleading because SortValues is actually already
GBK+SortValues.
 - It also makes things look less supported when they are in the
extensions/ folder. I'd say we should have a better place to put such a
library if it is the official public implementation. The word "extensions"
doesn't seem particularly accurate or meaningful to me.

Q: Does SortValues have a defined & documented URN yet?

Kenn

On Wed, May 30, 2018 at 7:52 AM Lukasz Cwik <[email protected]> wrote:

> Each runner can choose to override the SortValues PTransform with their
> own internal offering. For example Spark overrides global combine[1] during
> pipeline translation. If Spark detected the SortValues PTransform during
> translation, it could override the offering with something that used
> repartitionAndSortWithinPartitions.
>
> GroupByKeyAndSortValuesOnly inside Dataflow exists to support a specific
> use case. Users should rely on SortValues as it is the public
> implementation for sorting.
>
> 1:
> https://github.com/apache/beam/blob/85dcab56268fbac923ffd5885489ee154f097fc5/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java#L200
>
> As a side note, its uncommon where you need to sort all values, usually
> top 100 suffices and can be implemented much more efficiently with a
> combiner when compared to sorting.
>
> On Wed, May 30, 2018 at 3:38 AM <[email protected]> wrote:
>
>> Hi,
>>  I have question I am trying to do translation in dsl-euphoria for
>> “GroupByKey with sorted values within key” to Beam. I am aware of java sdk
>> extensions SortValues, but it doesn’t have sufficient abstraction for
>> runners.
>>
>> I noticed that in DataflowRunner there is translation of batch GroupByKey
>> to GroupByKeyAndSortValuesOnly but is it considered to have it in beam core
>> so for example SparkRunner could translate “GroupByKey with sorted values
>> within key” with their internals such as repartitionAndSortWithinPartitions.
>>
>> Thank you.
>> Marek Simunek
>>
>

Reply via email to