Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/21366#discussion_r194574775 --- Diff: pom.xml --- @@ -760,6 +760,12 @@ <version>1.10.19</version> <scope>test</scope> </dependency> + <dependency> --- End diff -- > One question is how about performance at scale when you get events from hundreds of executors at once which framework should work best? Should we worry about this? One already runs into this problem since we open a Watch that streams all events down anyways. In any implementation where we want events to be processed at different intervals, there needs to be some buffering or else we choose to ignore some events and only look at the most up to date snapshot at the given intervals. As discussed in https://github.com/apache/spark/pull/21366#discussion_r194181797 we really want to process as many events as we get as possible, so we're stuck with buffering somewhere, and regardless of the observables or reactive programming framework we pick we still have to effectively store `O(E)` items, `E` being the number of events. And aside from the buffering we'd need to consider the scale of the stream of events flowing from the persistent HTTP connection backing the Watch. In this regard we are no different from the other custom controllers in the Kubernetes ecosystem which have to handle managing large number of pods.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org