[GitHub] [beam] danthev commented on a change in pull request #14723: [BEAM-12272] Python - Backport Firestore connector's ramp-up throttling to Datastore connector

GitBox Tue, 25 May 2021 10:00:03 -0700


danthev commented on a change in pull request #14723:
URL: https://github.com/apache/beam/pull/14723#discussion_r638246638




##########
File path: sdks/python/apache_beam/io/gcp/datastore/v1new/datastoreio.py
##########
@@ -276,15 +277,33 @@ class _Mutate(PTransform):
   Only idempotent Datastore mutation operations (upsert and delete) are
   supported, as the commits are retried when failures occur.
   """
-  def __init__(self, mutate_fn):
+
+  # Default hint for the expected number of workers in the ramp-up throttling
+  # step for write or delete operations.
+  _DEFAULT_HINT_NUM_WORKERS = 500

Review comment:
       That's a good point, simply keeping or dropping `throttling-msecs` might 
under- or overshoot the target... 
   
   The hint is already configurable, and with the warning messages and ramp-up 
as a separate step it should be easy for the user to recognize the throttling 
and adjust the value. Even egregious misconfiguration becomes irrelevant after 
~1 hour as the ramp-up is exponential, so expected impact is low.  
   It's also possible to turn off ramp-up, though that is obviously not 
recommended. 
   
   However, the dynamic worker count in case of autoscaling certainly 
complicates this. The crux of the issue is that we can't get an up-to-date 
worker count from the runner or Beam in general, so we're going by a 
(configurable) rough estimate. I've looked into auto-filling the value from 
`maxNumWorkers` if the runner is Dataflow, but that introduces a circular 
dependency at least in Java.  
   All of this should be fine I think if autoscaling scales to the maximum in a 
reasonable amount of time as it did in my experiment, but I didn't consider the 
possibility of overscaling if there is no limit. My other experiment that 
reported `throttling-msecs` had `maxNumWorkers` as well as the hint set to 50, 
so the budget wasn't quite as large, but I did have basically 2.5 hours of just 
one worker. Weirdly though as I just found out, `throttling-msecs` actually 
started stagnating (as expected), but Dataflow still didn't scale until 2 hours 
later. Is there a long memory on that metric, and is that something that could 
be fixed by reducing reporting frequency?  
   This is what this looked like on the "report throttling-msecs" experiment 
(note the timestamps): 
   
![image](https://user-images.githubusercontent.com/76013657/119402371-b3ef0d00-bc91-11eb-8ef9-ab7766814131.png)
   I'd be happy to run more experiments or jump on GVC to explain my 
observations if you'd like.
   
   Edit:
   After another day of testing various settings, I have to slightly revise my 
findings. It seems reporting throttling-msecs doesn't actually affect scaling 
that much, but autoscaling with Datastore sometimes simply doesn't get enough 
signals, whether with ramp-up or not, likely because CPU usage stayed 
relatively low. I now have experiments with no ramp-up that stay at 1 worker, 
and I have experiments that report throttling-msecs easily scaling up. It 
appears to be more of a general autoscaling issue, so I think I would keep 
`throttling-msecs` in this PR.  
   I also haven't observed any issues when starting with 2 or more workers, 
apparently that gives enough signals to behave normally.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] danthev commented on a change in pull request #14723: [BEAM-12272] Python - Backport Firestore connector's ramp-up throttling to Datastore connector

Reply via email to