[jira] [Commented] (SPARK-21057) Do not use a PascalDistribution in countApprox

2017-06-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046402#comment-16046402
 ] 

Apache Spark commented on SPARK-21057:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/18276

> Do not use a PascalDistribution in countApprox
> --
>
> Key: SPARK-21057
> URL: https://issues.apache.org/jira/browse/SPARK-21057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lovasoa
>
> I was reading the source of Spark, and found this:
> https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50-L72
> This is the function that estimates the probability distribution of the total 
> count of elements in an RDD given the count of only some partitions.
> This function does a strange thing: when the number of elements counted so 
> far is less than 10 000, it models the total count with a negative binomial 
> (Pascal) law, else, it models it with a Poisson law.
> Modeling our number of uncounted elements with a negative binomial law is 
> like saying that we ran over elements, counting only some, and stopping after 
> having counted a given number of elements.
> But this does not model what really happened.  Our counting was limited in 
> time, not in number of counted elements, and we can't count only some of the 
> elements in a partition.
> I propose to use the Poisson distribution in every case, as it can be 
> justified under the hypothesis that the number of elements in each partition 
> is independent and follows a Poisson law.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21057) Do not use a PascalDistribution in countApprox

2017-06-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046150#comment-16046150
 ] 

Sean Owen commented on SPARK-21057:
---

I don't know if that's relevant to the analysis. Nothing is literally picking 
elements with or without replacement. Neither does the Poisson model literally 
model what's actually taking place, either. The argument isn't solely that the 
expect value is correct, of course. Still I don't see the reason NB was 
introduced here, and it looks no longer useful or even valid. I have a PR ready 
anyway as I needed to test it.

> Do not use a PascalDistribution in countApprox
> --
>
> Key: SPARK-21057
> URL: https://issues.apache.org/jira/browse/SPARK-21057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lovasoa
>
> I was reading the source of Spark, and found this:
> https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50-L72
> This is the function that estimates the probability distribution of the total 
> count of elements in an RDD given the count of only some partitions.
> This function does a strange thing: when the number of elements counted so 
> far is less than 10 000, it models the total count with a negative binomial 
> (Pascal) law, else, it models it with a Poisson law.
> Modeling our number of uncounted elements with a negative binomial law is 
> like saying that we ran over elements, counting only some, and stopping after 
> having counted a given number of elements.
> But this does not model what really happened.  Our counting was limited in 
> time, not in number of counted elements, and we can't count only some of the 
> elements in a partition.
> I propose to use the Poisson distribution in every case, as it can be 
> justified under the hypothesis that the number of elements in each partition 
> is independent and follows a Poisson law.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21057) Do not use a PascalDistribution in countApprox

2017-06-11 Thread Lovasoa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046145#comment-16046145
 ] 

Lovasoa commented on SPARK-21057:
-

{quote}
The hypothetical is that you pick elements from the total data set at random, 
and a fraction p of the time you 'succeed ' in picking from among the elements 
already counted.
The rest of the time you 'fail'. But if this goes on long enough to reach the 
observed count as the number of successes, then the number of failures models 
the size of the rest of the data.
{quote}

The thing is, if you pick elements until you've picked all the ones that were 
counted, and don't want to pick the same element twice, then the probability of 
picking a 'counted' element is not constant. The more you pick counted 
elements, the less the probability to pick a counted element next time. 

If you don't care about picking the same element twice, then you may well pick 
every time the same counted element and stop.

This model doesn't model the random process we are trying to describe. The fact 
it has a correct expected value (and thus doesn’t give nonsensical results) 
doesn't mean anything. Most probability laws could be made to fit our expected 
value.

I currently don’t have a lot of free time as I'm finishing my master thesis, 
but I will make a pull request as soon as I find the time.

> Do not use a PascalDistribution in countApprox
> --
>
> Key: SPARK-21057
> URL: https://issues.apache.org/jira/browse/SPARK-21057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lovasoa
>
> I was reading the source of Spark, and found this:
> https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50-L72
> This is the function that estimates the probability distribution of the total 
> count of elements in an RDD given the count of only some partitions.
> This function does a strange thing: when the number of elements counted so 
> far is less than 10 000, it models the total count with a negative binomial 
> (Pascal) law, else, it models it with a Poisson law.
> Modeling our number of uncounted elements with a negative binomial law is 
> like saying that we ran over elements, counting only some, and stopping after 
> having counted a given number of elements.
> But this does not model what really happened.  Our counting was limited in 
> time, not in number of counted elements, and we can't count only some of the 
> elements in a partition.
> I propose to use the Poisson distribution in every case, as it can be 
> justified under the hypothesis that the number of elements in each partition 
> is independent and follows a Poisson law.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21057) Do not use a PascalDistribution in countApprox

2017-06-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046051#comment-16046051
 ] 

Sean Owen commented on SPARK-21057:
---

That's not the intended interpretation of the negative binomial. The 
hypothetical is that you pick elements from the total data set at random, and a 
fraction p of the time you 'succeed ' in picking from among the elements 
already counted. The rest of the time you 'fail'. But if this goes on long 
enough to reach the observed count as the number of successes, then the number 
of failures models the size of the rest of the data.

Certainly, they both have the same expected value (good). I recall this had 
something to do with problems for very small counts, but this doesn't appear to 
be a problem in this code, now. If anything the condition should be a function 
of r and p. I also can't go back and see a clear theoretical justification. I 
agree, the Poisson analysis seems fine even for small p and r.

Try the change and see how it affects the tests.

> Do not use a PascalDistribution in countApprox
> --
>
> Key: SPARK-21057
> URL: https://issues.apache.org/jira/browse/SPARK-21057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lovasoa
>
> I was reading the source of Spark, and found this:
> https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50-L72
> This is the function that estimates the probability distribution of the total 
> count of elements in an RDD given the count of only some partitions.
> This function does a strange thing: when the number of elements counted so 
> far is less than 10 000, it models the total count with a negative binomial 
> (Pascal) law, else, it models it with a Poisson law.
> Modeling our number of uncounted elements with a negative binomial law is 
> like saying that we ran over elements, counting only some, and stopping after 
> having counted a given number of elements.
> But this does not model what really happened.  Our counting was limited in 
> time, not in number of counted elements, and we can't count only some of the 
> elements in a partition.
> I propose to use the Poisson distribution in every case, as it can be 
> justified under the hypothesis that the number of elements in each partition 
> is independent and follows a Poisson law.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org