Hi all, Thank you all for your answers. By the way, I also recognized that Flink doesn't support "stratified sampling" function (only simple random sampling) for DataSet. It would be nice if someone can create a Jira for it, and assign the task to me so that I can work for it.
Thank you, Do On Mon, Jul 11, 2016 at 11:44 AM, Vasiliki Kalavri < vasilikikala...@gmail.com> wrote: > Hi Do, > > Paris and Martha worked on sampling techniques for data streams on Flink > last year. If you want to implement your own samplers, you might find > Martha's master thesis helpful [1]. > > -Vasia. > > [1]: http://kth.diva-portal.org/smash/get/diva2:910695/FULLTEXT01.pdf > > On 11 July 2016 at 11:31, Kostas Kloudas <k.klou...@data-artisans.com> > wrote: > > > Hi Do, > > > > In DataStream you can always implement your own > > sampling function, hopefully without too much effort. > > > > Adding such functionality it to the API could be a good idea. > > But given that in sampling there is no “one-size-fits-all” > > solution (as not every use case needs random sampling and not > > all random samplers fit to all workloads), I am not sure if we > > should start adding different sampling operators. > > > > Thanks, > > Kostas > > > > > On Jul 9, 2016, at 5:43 PM, Greg Hogan <c...@greghogan.com> wrote: > > > > > > Hi Do, > > > > > > DataSet provides a stable @Public interface. DataSetUtils is marked > > > @PublicEvolving which is intended for public use, has stable behavior, > > but > > > method signatures may change. It's also good to limit DataSet to common > > > methods whereas the utility methods tend to be used for specific > > > applications. > > > > > > I don't have the pulse of streaming but this sounds like a useful > feature > > > that could be added. > > > > > > Greg > > > > > > On Sat, Jul 9, 2016 at 10:47 AM, Le Quoc Do <lequo...@gmail.com> > wrote: > > > > > >> Hi all, > > >> > > >> I'm working on approximate computing using sampling techniques. I > > >> recognized that Flink supports the sample function for Dataset > > >> (org/apache/flink/api/java/utils/DataSetUtils.java). I'm just > wondering > > why > > >> you didn't merge the function to > org/apache/flink/api/java/DataSet.java > > >> since the sample function works as a transformation operator? > > >> > > >> The second question is that are you planning to support the sample > > >> function for DataStream (within windows) since I did not see it in > > >> DataStream code ? > > >> > > >> Thank you, > > >> Do > > >> > > > > >