[ 
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14708658#comment-14708658
 ] 

ASF GitHub Bot commented on FLINK-1901:
---------------------------------------

Github user ChengXiangLi commented on the pull request:

    https://github.com/apache/flink/pull/949#issuecomment-133990256
  
    Hi, @sachingoel0101 , while sample with fraction, it's not easy to verify 
whether the DataSet is sampled with input fraction. In the test, i take 5 times 
sample, use the average size to computer the result fraction, and then compare 
the result fraction with input fraction, verify their difference is not more 
than 10% percent. The following case may happens as well, Sampler sample the 
DataSet with input fraction, but the sampled result size is too small or too 
large that beyond our verification condition, it happens, just with very little 
possibility, say less than 0.001 in this test. it should be ok if this failure 
happens very occasionally, please let me know if you found it's not.


> Create sample operator for Dataset
> ----------------------------------
>
>                 Key: FLINK-1901
>                 URL: https://issues.apache.org/jira/browse/FLINK-1901
>             Project: Flink
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Theodore Vasiloudis
>            Assignee: Chengxiang Li
>
> In order to be able to implement Stochastic Gradient Descent and a number of 
> other machine learning algorithms we need to have a way to take a random 
> sample from a Dataset.
> We need to be able to sample with or without replacement from the Dataset, 
> choose the relative or exact size of the sample, set a seed for 
> reproducibility, and support sampling within iterations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to