Hey all,
I was poking around and looking at `Distinct` and was confused about why it
was implemented the way it was.
Reproduced here:
@ptransform_fn
@typehints.with_input_types(T)
@typehints.with_output_types(T)
def Distinct(pcoll): # pylint: disable=invalid-name
"""Produces a PCollection containing distinct elements of a
PCollection."""
return (
pcoll
| 'ToPairs' >> Map(lambda v: (v, None))
| 'Group' >> CombinePerKey(lambda vs: None)
| 'Distinct' >> Keys())
Could anyone clarify why we'd use a `CombinePerKey` instead of just using
`GroupByKey`?
Cheers,
Joey