[GitHub] [beam] robertwb commented on a change in pull request #13208: [BEAM-10703] Add an option to GroupIntoBatches to output ShardedKeys. Update Dataflow pipeline translation accordingly.

GitBox Tue, 03 Nov 2020 05:34:48 -0800


robertwb commented on a change in pull request #13208:
URL: https://github.com/apache/beam/pull/13208#discussion_r516178126




##########
File path: 
runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/GroupIntoBatchesOverride.java
##########
@@ -92,9 +96,58 @@ public void process(ProcessContext c) {
     }
   }
 
+  static class BatchGroupIntoBatchesWithShardedKeyOverrideFactory<K, V>

Review comment:
       Yeah, GBK before batch stateful DoFns should definitely be done on the 
runner. 
   
   Should we be using the GBK + Iterators.partition on all runners, not just 
Dataflow?  

##########
File path: 
runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/GroupIntoBatchesOverride.java
##########
@@ -103,43 +156,76 @@ public void process(ProcessContext c) {
     }
 
     @Override
-    public PTransformReplacement<PCollection<KV<K, V>>, PCollection<KV<K, 
Iterable<V>>>>
+    public PTransformReplacement<PCollection<KV<K, V>>, 
PCollection<KV<ShardedKey<K>, Iterable<V>>>>
         getReplacementTransform(
             AppliedPTransform<
-                    PCollection<KV<K, V>>, PCollection<KV<K, Iterable<V>>>, 
GroupIntoBatches<K, V>>
+                    PCollection<KV<K, V>>,
+                    PCollection<KV<ShardedKey<K>, Iterable<V>>>,
+                    GroupIntoBatches<K, V>.WithShardedKey>
                 transform) {
       return PTransformReplacement.of(
           PTransformReplacements.getSingletonMainInput(transform),
-          new StreamingGroupIntoBatches(runner, transform.getTransform()));
+          new StreamingGroupIntoBatchesWithShardedKey<>(runner, 
transform.getTransform()));
     }
 
     @Override
     public Map<PCollection<?>, ReplacementOutput> mapOutputs(
-        Map<TupleTag<?>, PCollection<?>> outputs, PCollection<KV<K, 
Iterable<V>>> newOutput) {
+        Map<TupleTag<?>, PCollection<?>> outputs,
+        PCollection<KV<ShardedKey<K>, Iterable<V>>> newOutput) {
       return ReplacementOutputs.singleton(outputs, newOutput);
     }
   }
 
   /**
-   * Specialized implementation of {@link GroupIntoBatches} for unbounded 
Dataflow pipelines. The
-   * override does the same thing as the original transform but additionally 
record the input to add
-   * corresponding properties during the graph translation.
+   * Specialized implementation of {@link GroupIntoBatches.WithShardedKey} for 
unbounded Dataflow
+   * pipelines. The override does the same thing as the original transform but 
additionally records
+   * the input of {@code GroupIntoBatchesDoFn} in order to append relevant 
step properties during
+   * the graph translation.
    */
-  static class StreamingGroupIntoBatches<K, V>
-      extends PTransform<PCollection<KV<K, V>>, PCollection<KV<K, 
Iterable<V>>>> {
+  static class StreamingGroupIntoBatchesWithShardedKey<K, V>
+      extends PTransform<PCollection<KV<K, V>>, PCollection<KV<ShardedKey<K>, 
Iterable<V>>>> {
 
     private final transient DataflowRunner runner;
-    private final GroupIntoBatches<K, V> original;
+    private final GroupIntoBatches<K, V>.WithShardedKey original;
 
-    public StreamingGroupIntoBatches(DataflowRunner runner, 
GroupIntoBatches<K, V> original) {
+    public StreamingGroupIntoBatchesWithShardedKey(
+        DataflowRunner runner, GroupIntoBatches<K, V>.WithShardedKey original) 
{
       this.runner = runner;
       this.original = original;
     }
 
     @Override
-    public PCollection<KV<K, Iterable<V>>> expand(PCollection<KV<K, V>> input) 
{
-      runner.maybeRecordPCollectionWithAutoSharding(input);
-      return input.apply(original);
+    public PCollection<KV<ShardedKey<K>, Iterable<V>>> 
expand(PCollection<KV<K, V>> input) {
+      PCollection<KV<ShardedKey<K>, V>> intermediate_input = ShardKeys(input);
+
+      runner.maybeRecordPCollectionWithAutoSharding(intermediate_input);
+
+      if (original.getMaxBufferingDuration() != null) {

Review comment:
       Sounds like privacy is getting in the way of writing clean code here. 
What about making it package private, and adding a (package) helper to identify 
it? 

##########
File path: 
runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/GroupIntoBatchesOverride.java
##########
@@ -210,20 +211,23 @@ public StreamingGroupIntoBatchesWithShardedKey(
     }
   }
 
-  private static <K, V> PCollection<KV<ShardedKey<K>, V>> 
ShardKeys(PCollection<KV<K, V>> input) {
+  private static <K, V> PCollection<KV<ShardedKey<K>, V>> 
shardKeys(PCollection<KV<K, V>> input) {
     KvCoder<K, V> inputCoder = (KvCoder<K, V>) input.getCoder();
     org.apache.beam.sdk.coders.Coder<K> keyCoder =
         (org.apache.beam.sdk.coders.Coder<K>) 
inputCoder.getCoderArguments().get(0);
     org.apache.beam.sdk.coders.Coder<V> valueCoder =
         (org.apache.beam.sdk.coders.Coder<V>) 
inputCoder.getCoderArguments().get(1);
     return input
         .apply(
-            "ShardKeys",
+            "Shard Keys",
             MapElements.via(
                 new SimpleFunction<KV<K, V>, KV<ShardedKey<K>, V>>() {
                   @Override
                   public KV<ShardedKey<K>, V> apply(KV<K, V> input) {
-                    return KV.of(ShardedKey.of(input.getKey(), new byte[0]), 
input.getValue());
+                    long tid = Thread.currentThread().getId();

Review comment:
       Thanks. These are likely to be re-used for all workers. Can you add in a 
statically initialized (for each worker) random long as well? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] robertwb commented on a change in pull request #13208: [BEAM-10703] Add an option to GroupIntoBatches to output ShardedKeys. Update Dataflow pipeline translation accordingly.

Reply via email to