nehsyc commented on a change in pull request #14723:
URL: https://github.com/apache/beam/pull/14723#discussion_r637613617



##########
File path: sdks/python/apache_beam/io/gcp/datastore/v1new/datastoreio.py
##########
@@ -276,15 +277,33 @@ class _Mutate(PTransform):
   Only idempotent Datastore mutation operations (upsert and delete) are
   supported, as the commits are retried when failures occur.
   """
-  def __init__(self, mutate_fn):
+
+  # Default hint for the expected number of workers in the ramp-up throttling
+  # step for write or delete operations.
+  _DEFAULT_HINT_NUM_WORKERS = 500

Review comment:
       Thanks Cham for looping me in. 
   
   I am not too sure about the default num workers hint. My _intuition_ is that 
it could impact Dataflow autoscaling behavior in the following ways:
   
   - default workers >> the actual workers a Dataflow job starts with
   The ramp up process would be slow (probably much slower than necessary) and 
Dataflow may 
     - upscale the workers aggressively due to the slow progress if 
`throttling-msecs` is not reported. In your test, it was not a big problem 
since you set max number of workers, but imagine if there was no limit set, how 
many workers Dataflow might allocate.
     - or cap the worker pool size at a small number due to throttling if 
`throttling-msecs` is reported. In your another test, I would imagine that the 
rate limit was low (i.e., BUDGET / 500) even though there was only one worker, 
which might already trigger throttling signals and therefore limit upsizing to 
more than 1.
    
   - default workers << the actual workers a Dataflow job starts with
   Dataflow may 
     - upscale the workers moderately if `throttling-msecs` is not reported.
     - cap the workers but not by too much if `throttling-msecs` is reported.
    
     However, the mutation rate could increase faster than what's recommended 
in the best practice, possibly causing Firestore to be overwhelmed.
   
   Ideally the per-worker budget is set by the runner given a global limit. I 
don't know that is already a thing in Beam. If we have to keep this default 
number of workers, can we make it a configurable option and override it with 
num_workers if supplied by users? Otherwise it might be safe to assume that a 
Dataflow job always start with a small number of (or a single) workers. 
`throttling-msecs` might still be needed to avoid excessive upsizing. But 
anyway these are just based on me thinking :p We should probably run more 
experiments to verify.
   
   (Another idea is to keep the default num of workers hint and keep reporting 
`throttling-msecs` but in a reduced frequency. The purpose is to give Dataflow 
some chance to grow the num of workers but also let it be aware of some 
implicit rate limit. But I would not bother if a simpler solution can work.)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to