[GitHub] [beam] lostluck commented on issue #21817: [Bug]: Difficult to write Go SDK pipelines that stay within memory constraints

GitBox Mon, 20 Jun 2022 14:50:12 -0700


lostluck commented on issue #21817:
URL: https://github.com/apache/beam/issues/21817#issuecomment-1160834721

While that will probably help in your specific case, that's working as
intended. It could have adverse effects on user jobs without benchmarking such
a change at scale. The SDK is supposed to cache the iterable part of a GBK to
improve throughput in the more common cases.

Beam does have a facility to have larger iterables paged a smaller chunk at
a time in over the state API, but it needs to be supported by the runner in
question. I know Dataflow supports it. I don't know about the open source
runners. The paging in doesn't break the semantics, but does reduce the memory
peak when applicable.

That's handled here:

https://sourcegraph.com/github.com/apache/beam@de5c56a5b8a8a030e7e67323a696d52495e37f7f/-/blob/sdks/go/pkg/beam/core/runtime/exec/datasource.go?L219

I feel that the single machine / small scale case isn't one that Beam has
put serious effort into, in both testing and production scenarios. Open source
gets contributed to from the needs of it's users, so it's possible that need
hasn't existed so far for the density of processing you're dealing with.

Personally, I'm hacking together a replacement for the Go Direct runner, but
it has more test oriented goals rather than performance for the time being.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] lostluck commented on issue #21817: [Bug]: Difficult to write Go SDK pipelines that stay within memory constraints

Reply via email to