Github user squito commented on the pull request:

    https://github.com/apache/spark/pull/11105#issuecomment-193976996
  
    Hi @holdenk, thanks for working on this and sorry for such a long delay on 
my end.  I think this looks really promising, and would fill a big hole.  The 
api seems a reasonable compromise given all the tradeoffs.  I'm a little 
concerned about how this will work with concurrent jobs that share rdds -- 
seems like it will work but want to think it over a bit more.
    
    My take on the higher level stuff:
    
    > 1) What do we call consistent accumulators? Data property accumulators? 
Something else?
    
    yeah I'm as bad at naming as the next person ... I do like to think of 
these as being a property of the data, as opposed to the computation, so I'd 
like something with that in the name.
    
    Going a step further -- I also think that this actually what spark users 
want in almost all cases, and I'd almost prefer this was the default, rather 
than requiring a special flag.  I think if you are actually trying to measure 
something about the computation, you're an advanced user / writing a framework 
/ know the internals.  That said, I am still a little nervous about changing 
something like this with 2.0, just mentioning it in case everyone else feels 
similarly.
    
    > 2) Right now they only work inside of MapPartitionRDDs (which covers the 
places where user code would want to put data property accumulators) - but do 
we want something like this for internal use and if so what situations do we 
want them in?
    
    what about `reduceByKey` and friends?  and `treeReduce`?  User defined 
rdds? Should this really be put into *all* RDDs, eg., by adding a new 
`private[spark] doCompute()` which wraps the rdd's `compute()` or something 
like that?
    
    3) Do we want to allow the user to retrieve the value per RDD for 
consistent accumulators?
    
    Even more than that, I wish accumulators made it easy to tell when they 
were "ready".  That doesn't mean anything in the general case, since there 
could always be more updates coming in future actions -- but you could at least 
tell if updates from all partitions to one RDD had been included.
    
    That said, I'm ok if folks would prefer to keep out of the initial version 
to keep things small.
    
    Without that feature, I dont' really see the point in getting the value 
per-RDD.  You can always make a separate accumulator per RDD (and probably 
should if you want to), I don't see much benefit to letting users query per-RDD.
    
    4) Do we want to expose a Java API right away as well?
    
    I know folks that want this in java & python too, its such an important 
feature that I think we need to provide them all :)  But that part of the 
implementation should be more straightforward, imo you can wait on that till we 
iron out the larger design issues.
    
    5) Right now the user facing API just allows users to create accumulators 
rather than general accumuables - do we want to offer both?
    
    I don't see any reason why we shouldn't offer both.  The one compelling use 
I can think of is if you have an array of counters, rather than a bunch of 
individually named counters (users bucketed into categories, and you want to 
count how many are in each category).  You could make an array of accumulators 
instead of an accumulator of arrays, but gets  little trickier if you have 
custom types or something.  Again,  I think that can wait on design discussion, 
and also ok if its added separately.  (less important than java & python imo.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to