[JAVA] Handling repeated elements when merging two pcollections
I have two PCollections, CollectionA & CollectionB of type KV. I would like to merge them into one PCollection but CollectionA & CollectionB might have some elements with the same key. In those repeated cases, I would like to keep the element from CollectionA & drop the repeated element from CollectionB. Does anyone know a simple method to do this? Thanks, Shivam Singhal
Re: [JAVA] Handling repeated elements when merging two pcollections
Hi Shivam, When you say "merge the PCollections" do you mean Flatten, or somehow join? CoGroupByKey[1] would be a good choice if you need to join based on key. You would then be able to implement application logic to keep 1 of the 2 records if there is a way to decipher an element from CollectionA vs. CollectionB by only examining the elements. If there isn't a natural way of determining which element to keep by only examining the elements themselves, you could further nest the data in a KV ex. If CollectionA holds data like KV and CollectionB is KV you could transform these into something like KV> and KV>. Then when you CoGroupByKey, these elements would be grouped based on both having k1, and the source/origin PCollection could be deciphered based on the key of the inner KV. Thanks, Evan [1] https://beam.apache.org/documentation/transforms/java/aggregation/cogroupbykey/ On Wed, Aug 10, 2022 at 3:25 PM Shivam Singhal wrote: > I have two PCollections, CollectionA & CollectionB of type KV Byte[]>. > > > I would like to merge them into one PCollection but CollectionA & > CollectionB might have some elements with the same key. In those repeated > cases, I would like to keep the element from CollectionA & drop the > repeated element from CollectionB. > > Does anyone know a simple method to do this? > > Thanks, > Shivam Singhal >
Re: [JAVA] Handling repeated elements when merging two pcollections
Use CoGroupByKey to join the two PCollections and emit only the first value of each iterable with the key. Duplicates will appear as iterables with more then one value while keys without duplicates will have iterables containing exactly one value. On Wed, Aug 10, 2022 at 12:25 PM Shivam Singhal wrote: > I have two PCollections, CollectionA & CollectionB of type KV Byte[]>. > > > I would like to merge them into one PCollection but CollectionA & > CollectionB might have some elements with the same key. In those repeated > cases, I would like to keep the element from CollectionA & drop the > repeated element from CollectionB. > > Does anyone know a simple method to do this? > > Thanks, > Shivam Singhal >
Re: [JAVA] Handling repeated elements when merging two pcollections
Think this should solve my problem. Thanks Evan ans Luke! On Thu, 11 Aug 2022 at 1:49 AM, Luke Cwik via user wrote: > Use CoGroupByKey to join the two PCollections and emit only the first > value of each iterable with the key. > > Duplicates will appear as iterables with more then one value while keys > without duplicates will have iterables containing exactly one value. > > On Wed, Aug 10, 2022 at 12:25 PM Shivam Singhal < > shivamsinghal5...@gmail.com> wrote: > >> I have two PCollections, CollectionA & CollectionB of type KV> Byte[]>. >> >> >> I would like to merge them into one PCollection but CollectionA & >> CollectionB might have some elements with the same key. In those repeated >> cases, I would like to keep the element from CollectionA & drop the >> repeated element from CollectionB. >> >> Does anyone know a simple method to do this? >> >> Thanks, >> Shivam Singhal >> >
Re: [JAVA] Handling repeated elements when merging two pcollections
Sorry, I should have said that you should Flatten and do a GroupByKey, not a CoGroupByKey making the pipeline like: PCollectionA -> Flatten -> GroupByKey -> ParDo(EmitOnlyFirstElementPerKey) PCollectionB -/ The CoGroupByKey will have one iterable per PCollection containing zero or more elements depending on how many elements each PCollection had for that key. So yes you could solve it with CoGroupByKey but Flatten+GroupByKey is much simpler. On Wed, Aug 10, 2022 at 1:31 PM Shivam Singhal wrote: > Think this should solve my problem. > > Thanks Evan ans Luke! > > On Thu, 11 Aug 2022 at 1:49 AM, Luke Cwik via user > wrote: > >> Use CoGroupByKey to join the two PCollections and emit only the first >> value of each iterable with the key. >> >> Duplicates will appear as iterables with more then one value while keys >> without duplicates will have iterables containing exactly one value. >> >> On Wed, Aug 10, 2022 at 12:25 PM Shivam Singhal < >> shivamsinghal5...@gmail.com> wrote: >> >>> I have two PCollections, CollectionA & CollectionB of type KV>> Byte[]>. >>> >>> >>> I would like to merge them into one PCollection but CollectionA & >>> CollectionB might have some elements with the same key. In those repeated >>> cases, I would like to keep the element from CollectionA & drop the >>> repeated element from CollectionB. >>> >>> Does anyone know a simple method to do this? >>> >>> Thanks, >>> Shivam Singhal >>> >>