scwhittle commented on PR #31608:
URL: https://github.com/apache/beam/pull/31608#issuecomment-2175861387

   The DataflowRunner overrides the pubsub write transform using 
org.apache.beam.runners.dataflow.DataflowRunner.StreamingPubsubIOWrite so 
org.apache.beam.runners.dataflow.worker.PubsubSink is used.  It would be nice 
to prevent using the ordering key for now with the DataflowRunner unless the 
experiment to use the beam implementation is present.
   
   To add support for it to Dataflow, it appears that if 
PUBSUB_SERIALIZED_ATTRIBUTES_FN is set, that maps bytes to PubsubMessage which 
already includes the ordering key.  But for the ordering key to be respected 
for publishing, additional changes would be needed in the dataflow service 
backend.  Currently it looks like it would just be dropped but if it was 
respected the service would also need to be updated to ensure batching doesn't 
occur across ordering keys.
   
   > User configuration of the number of output shards or the use of a single 
output shard for messages with ordering keys (due to 1 MBps throughput limit 
per ordering key) is an open topic.
   
   Are you considering producing to a single ordering key from multiple 
distinct grouped-by keys in parallel?  Doesn't that defeat the purpose of the 
ordering provided? I'm also not sure it would increase the throughput beyond 
the 1Mb per ordering key limit. An alternative would be grouping by 
partitioning of the ordering keys (via deterministic hash buckets for example) 
and then batching just within a bundle. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to