[
https://issues.apache.org/jira/browse/FLINK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14708658#comment-14708658
]
ASF GitHub Bot commented on FLINK-1901:
---------------------------------------
Github user ChengXiangLi commented on the pull request:
https://github.com/apache/flink/pull/949#issuecomment-133990256
Hi, @sachingoel0101 , while sample with fraction, it's not easy to verify
whether the DataSet is sampled with input fraction. In the test, i take 5 times
sample, use the average size to computer the result fraction, and then compare
the result fraction with input fraction, verify their difference is not more
than 10% percent. The following case may happens as well, Sampler sample the
DataSet with input fraction, but the sampled result size is too small or too
large that beyond our verification condition, it happens, just with very little
possibility, say less than 0.001 in this test. it should be ok if this failure
happens very occasionally, please let me know if you found it's not.
> Create sample operator for Dataset
> ----------------------------------
>
> Key: FLINK-1901
> URL: https://issues.apache.org/jira/browse/FLINK-1901
> Project: Flink
> Issue Type: Improvement
> Components: Core
> Reporter: Theodore Vasiloudis
> Assignee: Chengxiang Li
>
> In order to be able to implement Stochastic Gradient Descent and a number of
> other machine learning algorithms we need to have a way to take a random
> sample from a Dataset.
> We need to be able to sample with or without replacement from the Dataset,
> choose the relative or exact size of the sample, set a seed for
> reproducibility, and support sampling within iterations.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)