lostluck commented on issue #21817:
URL: https://github.com/apache/beam/issues/21817#issuecomment-1160834721

   While that will probably help in your specific case, that's working as 
intended. It could have adverse effects on user jobs without benchmarking such 
a change at scale. The SDK is supposed to cache the iterable part of a GBK to 
improve throughput in the more common cases.
   
   Beam does have a facility to have larger iterables paged a smaller chunk at 
a time in over the state API, but it needs to be supported by the runner in 
question. I know Dataflow supports it. I don't know about the open source 
runners. The paging in doesn't break the semantics, but does reduce the memory  
peak when applicable.
   
   That's handled here:
   
https://sourcegraph.com/github.com/apache/beam@de5c56a5b8a8a030e7e67323a696d52495e37f7f/-/blob/sdks/go/pkg/beam/core/runtime/exec/datasource.go?L219
   
   I feel that the single machine / small scale case isn't one that Beam has 
put serious effort into, in both testing and production scenarios. Open source 
gets contributed to from the needs of it's users, so it's possible that need 
hasn't existed so far for the density of processing you're dealing with.
   
   Personally, I'm hacking together a replacement for the Go Direct runner, but 
it has more test oriented goals rather than performance for the time being.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to