[ https://issues.apache.org/jira/browse/SPARK-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212017#comment-14212017 ]
Andrew Ash commented on SPARK-603: ---------------------------------- Hi Anonymous (not sure how you can read this but...), I'm not sure how a counters API could solve the delayed evaluation problem and be fundamentally different from the existing Accumulator feature. You'd have to make accessing a counter force evaluation of every RDD it's accessed from, and you'd have issues with recomputing the RDD later on when it's actually needed, causing a duplicate access. I can see some kind of API that offers a hook into the RDD evaluation DAG, but that operates at a stage level rather than an operation level (for example multiple maps are pipelined together) so the mapping for available hook points wouldn't be 1:1 with RDD operations, which would be quite tricky. I don't see a way to implement what you propose with Counters without compromising major parts of the Spark API contract (RDD laziness) so propose to close. Especially given that I haven no way to contact you given your information doesn't appear on the prior Atlassian Jira either: https://spark-project.atlassian.net/browse/SPARK-603 [~rxin] are you ok with closing this? > add simple Counter API > ---------------------- > > Key: SPARK-603 > URL: https://issues.apache.org/jira/browse/SPARK-603 > Project: Spark > Issue Type: New Feature > Priority: Minor > > Users need a very simple way to create counters in their jobs. Accumulators > provide a way to do this, but are a little clunky, for two reasons: > 1) the setup is a nuisance > 2) w/ delayed evaluation, you don't know when it will actually run, so its > hard to look at the values > consider this code: > {code} > def filterBogus(rdd:RDD[MyCustomClass], sc: SparkContext) = { > val filterCount = sc.accumulator(0) > val filtered = rdd.filter{r => > if (isOK(r)) true else {filterCount += 1; false} > } > println("removed " + filterCount.value + " records) > filtered > } > {code} > The println will always say 0 records were filtered, because its printed > before anything has actually run. I could print out the value later on, but > note that it would destroy the modularity of the method -- kinda ugly to > return the accumulator just so that it can get printed later on. (and of > course, the caller in turn might not know when the filter is going to get > applied, and would have to pass the accumulator up even further ...) > I'd like to have Counters which just automatically get printed out whenever a > stage has been run, and also with some api to get them back. I realize this > is tricky b/c a stage can get re-computed, so maybe you should only increment > the counters once. > Maybe a more general way to do this is to provide some callback for whenever > an RDD is computed -- by default, you would just print the counters, but the > user could replace w/ a custom handler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org