[jira] [Commented] (SPARK-603) add simple Counter API

Andrew Ash (JIRA) Fri, 14 Nov 2014 00:56:07 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212017#comment-14212017
 ]


Andrew Ash commented on SPARK-603:
----------------------------------

Hi Anonymous (not sure how you can read this but...),

I'm not sure how a counters API could solve the delayed evaluation problem and 
be fundamentally different from the existing Accumulator feature.  You'd have 
to make accessing a counter force evaluation of every RDD it's accessed from, 
and you'd have issues with recomputing the RDD later on when it's actually 
needed, causing a duplicate access.

I can see some kind of API that offers a hook into the RDD evaluation DAG, but 
that operates at a stage level rather than an operation level (for example 
multiple maps are pipelined together) so the mapping for available hook points 
wouldn't be 1:1 with RDD operations, which would be quite tricky.

I don't see a way to implement what you propose with Counters without 
compromising major parts of the Spark API contract (RDD laziness) so propose to 
close.  Especially given that I haven no way to contact you given your 
information doesn't appear on the prior Atlassian Jira either: 
https://spark-project.atlassian.net/browse/SPARK-603

[~rxin] are you ok with closing this?

> add simple Counter API
> ----------------------
>
>                 Key: SPARK-603
>                 URL: https://issues.apache.org/jira/browse/SPARK-603
>             Project: Spark
>          Issue Type: New Feature
>            Priority: Minor
>
> Users need a very simple way to create counters in their jobs.  Accumulators 
> provide a way to do this, but are a little clunky, for two reasons:
> 1) the setup is a nuisance
> 2) w/ delayed evaluation, you don't know when it will actually run, so its 
> hard to look at the values
> consider this code:
> {code}
> def filterBogus(rdd:RDD[MyCustomClass], sc: SparkContext) = {
>   val filterCount = sc.accumulator(0)
>   val filtered = rdd.filter{r =>
>     if (isOK(r)) true else {filterCount += 1; false}
>   }
>   println("removed " + filterCount.value + " records)
>   filtered
> }
> {code}
> The println will always say 0 records were filtered, because its printed 
> before anything has actually run.  I could print out the value later on, but 
> note that it would destroy the modularity of the method -- kinda ugly to 
> return the accumulator just so that it can get printed later on.  (and of 
> course, the caller in turn might not know when the filter is going to get 
> applied, and would have to pass the accumulator up even further ...)
> I'd like to have Counters which just automatically get printed out whenever a 
> stage has been run, and also with some api to get them back.  I realize this 
> is tricky b/c a stage can get re-computed, so maybe you should only increment 
> the counters once.
> Maybe a more general way to do this is to provide some callback for whenever 
> an RDD is computed -- by default, you would just print the counters, but the 
> user could replace w/ a custom handler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-603) add simple Counter API

Reply via email to