[GitHub] spark pull request: SPARK-1939 Refactor takeSample method in RDD t...

colorant Fri, 13 Jun 2014 00:07:18 -0700

Github user colorant commented on the pull request:

    https://github.com/apache/spark/pull/916#issuecomment-45981871
  
    @dorx Do you think this works for extreme large data set with really small 
sample size? e.g. n = 1.0x10^11 while sample = 1 ? in that case, the final 
adjusted fraction lead to around 1.2x10^-9,  by theory, there are still 99.99 
chance to get sample. But since Double also has precision issue, do you think 
it is enough to guarantee 99.99 chance under this extreme condition? I am 
wondering about this is because, Actually, in the very case, the original code 
(3x(1+1)) / total will give a fraction around 6x10^-10, which is just about 
half size of the new code. And under that fraction value. it keep loop for ever 
and never did get a chance to return that 1 sample.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1939 Refactor takeSample method in RDD t...

Reply via email to