RE: Apache Beam cogroup help

Newport, Billy Wed, 22 Mar 2017 11:09:42 -0700

If I’m reading a parquet or avro file though, I don’t have a KV<K,Data1>, I 
have a Data. Do I need to run a pardo just to extract the keys for this to work?


PCollection<GenericRecord> data;
PCollection<KV<String,GenericRecord>> keyedData = “data par do’ed to create KV 
for each GenericRecord, extracting possibly multiple field PKs encoded as a 
string”

Then do the stuff below. This seems pretty expensive (serialization wise) 
compared with the flink Keyextractor for example or is it similar in practice?

Thanks Thomas.

From: Thomas Groh [mailto:[email protected]]
Sent: Wednesday, March 22, 2017 1:53 PM
To: [email protected]
Subject: Re: Apache Beam cogroup help

This would be implemented via a CoGroupByKey 
(https://beam.apache.org/documentation/sdks/javadoc/0.6.0/org/apache/beam/sdk/transforms/join/CoGroupByKey.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__beam.apache.org_documentation_sdks_javadoc_0.6.0_org_apache_beam_sdk_transforms_join_CoGroupByKey.html&d=DgMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=rlkM70D3djmDN7dGPzzbVKG26ShcTFDMKlX5AWucE5Q&m=VCTyC80DAMVHxYsuHzJWAjE2ZNyla5UuH5hq_n11cz4&s=P1-5J5zUCr8YHmuQLDNbYWP4SrmkoG7KZBR9UbWVw7Q&e=>)

Your transform logic will be mostly the same; after applying the extraction 
(the right side of k1 and k2 in your example), you should have two PCollections 
of KVs -

PCollection<KV<K, Data1>> k1;
PCollection<KV<K, Data2>> k2;

You can construct a KeyedPCollectionTuple containing the two PCollections:

final TupleTag<Data1> data1Tag = new TupleTag<>();
final TupleTag<Data2> data2Tag = new TupleTag<>();
KeyedPCollectionTuple<K> coGroupTuple = KeyedPCollectionTuple.of(data1Tag, 
k1).and(data2Tag, k2);

Then apply the CoGroupByKey:

PColection<KV<K, CoGroupByKeyResult>> coGrouped = 
coGroupTuple.apply(CoGroupByKey.<K>create());

Then you can run an arbitrary ParDo to combine the elements as appropriate. 
You'll need to reuse the TupleTags created above to extract out the 
per-PCollection outputs. As a simple example where the elements have a shared 
supertype CombinedData, and you'd like to add them to a single output list:

PCollection<KV<K, List<CombinedData>> combined = coGrouped.apply(ParDo.of(new 
DoFn<KV<K, CoGroupByKeyResult>, KV<K, List<CombinedData>>>() {
  @ProcessElement
  public void process(ProcessContext context) {
    List<CombinedData> all = new ArrayList<>();
    for (Data1 d1 : context.element().value().getAll(data1Tag)) {
      all.add(d1);
    }
    for (Data2 d2 : context.element().value().getAll(data2Tag)) {
      all.add(d2);
    }
    context.output(all);
  }
}));

On Wed, Mar 22, 2017 at 10:35 AM, Newport, Billy 
<[email protected]<mailto:[email protected]>> wrote:
Trying to port flink code to Apache Beam but I’m having trouble decoding the 
documentation.

I have flink code which looks like:

DataSet<GenericRecord> d1 = Read parquet
DataSet<GenericRecord> d2 = Read Avro
KeyExtractor<GenericRecord> k1 = … (extracts an object containing the key 
fields from d1 records)
KeyExtractor<GenericRecord> k2 = … (extracts an object containing the key 
fields from d2 records)

CoGroup<GenericRecord,GenericRecord,GenericRecord> grouper = (combines values 
for equal keys in to a combined list for that key)

DataSet<GenericRecord> combined = 
d1.coGroup(d2).where(k1).equalTo(k2).with(grouper)

Whats the beam equivalent?

Thanks

RE: Apache Beam cogroup help

Reply via email to