Github user squito commented on the pull request: https://github.com/apache/spark/pull/11105#issuecomment-193976996 Hi @holdenk, thanks for working on this and sorry for such a long delay on my end. I think this looks really promising, and would fill a big hole. The api seems a reasonable compromise given all the tradeoffs. I'm a little concerned about how this will work with concurrent jobs that share rdds -- seems like it will work but want to think it over a bit more. My take on the higher level stuff: > 1) What do we call consistent accumulators? Data property accumulators? Something else? yeah I'm as bad at naming as the next person ... I do like to think of these as being a property of the data, as opposed to the computation, so I'd like something with that in the name. Going a step further -- I also think that this actually what spark users want in almost all cases, and I'd almost prefer this was the default, rather than requiring a special flag. I think if you are actually trying to measure something about the computation, you're an advanced user / writing a framework / know the internals. That said, I am still a little nervous about changing something like this with 2.0, just mentioning it in case everyone else feels similarly. > 2) Right now they only work inside of MapPartitionRDDs (which covers the places where user code would want to put data property accumulators) - but do we want something like this for internal use and if so what situations do we want them in? what about `reduceByKey` and friends? and `treeReduce`? User defined rdds? Should this really be put into *all* RDDs, eg., by adding a new `private[spark] doCompute()` which wraps the rdd's `compute()` or something like that? 3) Do we want to allow the user to retrieve the value per RDD for consistent accumulators? Even more than that, I wish accumulators made it easy to tell when they were "ready". That doesn't mean anything in the general case, since there could always be more updates coming in future actions -- but you could at least tell if updates from all partitions to one RDD had been included. That said, I'm ok if folks would prefer to keep out of the initial version to keep things small. Without that feature, I dont' really see the point in getting the value per-RDD. You can always make a separate accumulator per RDD (and probably should if you want to), I don't see much benefit to letting users query per-RDD. 4) Do we want to expose a Java API right away as well? I know folks that want this in java & python too, its such an important feature that I think we need to provide them all :) But that part of the implementation should be more straightforward, imo you can wait on that till we iron out the larger design issues. 5) Right now the user facing API just allows users to create accumulators rather than general accumuables - do we want to offer both? I don't see any reason why we shouldn't offer both. The one compelling use I can think of is if you have an array of counters, rather than a bunch of individually named counters (users bucketed into categories, and you want to count how many are in each category). You could make an array of accumulators instead of an accumulator of arrays, but gets little trickier if you have custom types or something. Again, I think that can wait on design discussion, and also ok if its added separately. (less important than java & python imo.)
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org