Github user mccheah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21366#discussion_r194574775
  
    --- Diff: pom.xml ---
    @@ -760,6 +760,12 @@
             <version>1.10.19</version>
             <scope>test</scope>
           </dependency>
    +      <dependency>
    --- End diff --
    
    > One question is how about performance at scale when you get events from 
hundreds of executors at once which framework should work best? Should we worry 
about this?
    
    One already runs into this problem since we open a Watch that streams all 
events down anyways. In any implementation where we want events to be processed 
at different intervals, there needs to be some buffering or else we choose to 
ignore some events and only look at the most up to date snapshot at the given 
intervals. As discussed in 
https://github.com/apache/spark/pull/21366#discussion_r194181797 we really want 
to process as many events as we get as possible, so we're stuck with buffering 
somewhere, and regardless of the observables or reactive programming framework 
we pick we still have to effectively store `O(E)` items, `E` being the number 
of events. And aside from the buffering we'd need to consider the scale of the 
stream of events flowing from the persistent HTTP connection backing the Watch. 
In this regard we are no different from the other custom controllers in the 
Kubernetes ecosystem which have to handle managing large number of pods.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to