I had a question today from one of our users about Beam’s Sample
transform (a Combine with an internal top-like function to produce a
uniform sample of size n of a PCollection). They wanted to obtain also
the rest of the PCollection as an output (the non sampled elements).

My suggestion was to use the sample (since it was little) as a side
input and then reprocess the collection to filter its elements,
however I wonder if this is the ‘best’ solution.

I was thinking also if Combine is essentially GbK + ParDo why we don’t
have a Combine function with multiple outputs (maybe an evolution of
CombineWithContext). I know this sounds weird and I have probably not
thought much about issues or the performance of the translation but I
wanted to see what others thought, does this make sense, do you see
some pros/cons or other ideas.

Thanks,
Ismaël

Reply via email to