[ https://issues.apache.org/jira/browse/SPARK-21057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-21057: ------------------------------------ Assignee: (was: Apache Spark) > Do not use a PascalDistribution in countApprox > ---------------------------------------------- > > Key: SPARK-21057 > URL: https://issues.apache.org/jira/browse/SPARK-21057 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.1.1 > Reporter: Lovasoa > > I was reading the source of Spark, and found this: > https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50-L72 > This is the function that estimates the probability distribution of the total > count of elements in an RDD given the count of only some partitions. > This function does a strange thing: when the number of elements counted so > far is less than 10 000, it models the total count with a negative binomial > (Pascal) law, else, it models it with a Poisson law. > Modeling our number of uncounted elements with a negative binomial law is > like saying that we ran over elements, counting only some, and stopping after > having counted a given number of elements. > But this does not model what really happened. Our counting was limited in > time, not in number of counted elements, and we can't count only some of the > elements in a partition. > I propose to use the Poisson distribution in every case, as it can be > justified under the hypothesis that the number of elements in each partition > is independent and follows a Poisson law. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org