[ 
https://issues.apache.org/jira/browse/SPARK-10801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gianmario Spacagna updated SPARK-10801:
---------------------------------------
    Description: 
The current implementation of  org.apache.spark.util.StatCounter is mutable and 
not thread-safe.
The API for creating it is also limiting since that it only expose the 
constructor using a TraversableOnce[Double].
More over the current implementation does not offer any equality.

My proposal is to use case classes to store the minimum amount of fields 
necessary to compute the statistics and make it so that it would be easy to 
apply the Monoid pattern to reduce an RDD or a Scala collection of StatCounter 
into a single StatCounter.

I have re-implemented and tested StatCounter at my work after I found a bug 
when trying to merge multiple stat counter in parallel using Scalaz Monoid. I 
would like to send a pull request of that functional, clean and concise 
re-implementation.

This would be the declaration of the class:

case class StatCounter(n: Long, sum: Double, sos: Double, min: Double, max: 
Double)

That would also change the implementation of variance into a single line:
def variance = (sos - n * mean * mean) / (n - 1)

  was:
The current implementation of StatCounter is mutable and not thread-safe.
The API for creating it is also limiting since that it only expose the 
constructor using a TraversableOnce[Double].
More over the current implementation does not offer any equality.

My proposal is to use case classes to store the minimum amount of fields 
necessary to compute the statistics and make it so that it would be easy to 
apply the Monoid pattern to reduce an RDD or a Scala collection of StatCounter 
into a single StatCounter.

I have re-implemented and tested StatCounter at my work after I found a bug 
when trying to merge multiple stat counter in parallel using Scalaz Monoid. I 
would like to send a pull request of that functional, clean and concise 
re-implementation.

This would be the declaration of the class:

case class StatCounter(n: Long, sum: Double, sos: Double, min: Double, max: 
Double)

That would also change the implementation of variance into a single line:
def variance = (sos - n * mean * mean) / (n - 1)


> StatCounter uses mutability and is not thread-safe
> --------------------------------------------------
>
>                 Key: SPARK-10801
>                 URL: https://issues.apache.org/jira/browse/SPARK-10801
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Gianmario Spacagna
>
> The current implementation of  org.apache.spark.util.StatCounter is mutable 
> and not thread-safe.
> The API for creating it is also limiting since that it only expose the 
> constructor using a TraversableOnce[Double].
> More over the current implementation does not offer any equality.
> My proposal is to use case classes to store the minimum amount of fields 
> necessary to compute the statistics and make it so that it would be easy to 
> apply the Monoid pattern to reduce an RDD or a Scala collection of 
> StatCounter into a single StatCounter.
> I have re-implemented and tested StatCounter at my work after I found a bug 
> when trying to merge multiple stat counter in parallel using Scalaz Monoid. I 
> would like to send a pull request of that functional, clean and concise 
> re-implementation.
> This would be the declaration of the class:
> case class StatCounter(n: Long, sum: Double, sos: Double, min: Double, max: 
> Double)
> That would also change the implementation of variance into a single line:
> def variance = (sos - n * mean * mean) / (n - 1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to