Hi Beam Team,

Can somebody help me understand what are the factors behind SDK Harness
memory usage? My first guess is that the SDK Harness memory usage depends
on:

1. User code (i.e. DoFns)
2. Bundle size

Basically, the maximum memory usage an SDK Harness needs is however much
memory it takes for the user DoFn to process the largest bundle size. And
the bundle size is determined by the Runner. So to limit SDK Harness memory
usage, we have to ensure that our Runner selects small bundle sizes.

However, looking through some design and the code, it seems like:

   - sdk_worker.py
   
<https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/sdk_worker.py#L385>
seems
   to be have multiple active bundle processors at the same time
   - The Fn API: How to send and receive data
   
<https://docs.google.com/document/d/1IGduUqmhWDi_69l9nG8kw73HZ5WI5wOps9Tshl5wpQA/edit#heading=h.u78ozd9rrlsf>
design
   doc seems to describe multiplexing multiple logical streams over a gRPC
   connection

Does this mean that the SDK Harnesses process multiple bundles at the same
time? If so, how are the number of concurrent bundles limited?

Or in general, what suggestions do you have to reduce memory usage of SDK
Harnesses?

Thanks,

Arwin

-- 


*Confidentiality Note:* We care about protecting our proprietary 
information, confidential material, and trade secrets. This message may 
contain some or all of those things. Cruise will suffer material harm if 
anyone other than the intended recipient disseminates or takes any action 
based on this message. If you have received this message (including any 
attachments) in error, please delete it immediately and notify the sender 
promptly.

Reply via email to