[ https://issues.apache.org/jira/browse/SPARK-603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-603: ---------------------------- Component/s: Spark Core > add simple Counter API > ---------------------- > > Key: SPARK-603 > URL: https://issues.apache.org/jira/browse/SPARK-603 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Priority: Minor > > Users need a very simple way to create counters in their jobs. Accumulators > provide a way to do this, but are a little clunky, for two reasons: > 1) the setup is a nuisance > 2) w/ delayed evaluation, you don't know when it will actually run, so its > hard to look at the values > consider this code: > {code} > def filterBogus(rdd:RDD[MyCustomClass], sc: SparkContext) = { > val filterCount = sc.accumulator(0) > val filtered = rdd.filter{r => > if (isOK(r)) true else {filterCount += 1; false} > } > println("removed " + filterCount.value + " records) > filtered > } > {code} > The println will always say 0 records were filtered, because its printed > before anything has actually run. I could print out the value later on, but > note that it would destroy the modularity of the method -- kinda ugly to > return the accumulator just so that it can get printed later on. (and of > course, the caller in turn might not know when the filter is going to get > applied, and would have to pass the accumulator up even further ...) > I'd like to have Counters which just automatically get printed out whenever a > stage has been run, and also with some api to get them back. I realize this > is tricky b/c a stage can get re-computed, so maybe you should only increment > the counters once. > Maybe a more general way to do this is to provide some callback for whenever > an RDD is computed -- by default, you would just print the counters, but the > user could replace w/ a custom handler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org