Sorry, I should have said that you should Flatten and do a GroupByKey, not
a CoGroupByKey making the pipeline like:
PCollectionA -> Flatten -> GroupByKey -> ParDo(EmitOnlyFirstElementPerKey)
PCollectionB -/
The CoGroupByKey will have one iterable per PCollection containing zero or
more elements
Think this should solve my problem.
Thanks Evan ans Luke!
On Thu, 11 Aug 2022 at 1:49 AM, Luke Cwik via user
wrote:
> Use CoGroupByKey to join the two PCollections and emit only the first
> value of each iterable with the key.
>
> Duplicates will appear as iterables with more then one value
Use CoGroupByKey to join the two PCollections and emit only the first value
of each iterable with the key.
Duplicates will appear as iterables with more then one value while keys
without duplicates will have iterables containing exactly one value.
On Wed, Aug 10, 2022 at 12:25 PM Shivam Singhal
Hi Shivam,
When you say "merge the PCollections" do you mean Flatten, or somehow join?
CoGroupByKey[1] would be a good choice if you need to join based on key.
You would then be able to implement application logic to keep 1 of the 2
records if there is a way to decipher an element from
I have two PCollections, CollectionA & CollectionB of type KV.
I would like to merge them into one PCollection but CollectionA &
CollectionB might have some elements with the same key. In those repeated
cases, I would like to keep the element from CollectionA & drop the
repeated element from