Stratified sampling would also be beneficial for the DataSet API. I think
it would be best if this method is also added to DataSetUtils or made
available via the flink-contrib module. Furthermore, I think that it would
be easiest if you created the JIRA for this feature, because you know what
you want to add. For that you have to register at
https://issues.apache.org/jira, if you haven't done this, and then we can
add you as a contributor. Based on the JIRA description we can
discuss possible implementations then.

Cheers,
Till

On Tue, Jul 12, 2016 at 12:11 PM, Paris Carbone <par...@kth.se> wrote:

> Hey Do,
>
> I think that more sophisticated samplers could make a better fit in the ML
> library and not in the core API but I am not very familiar with the
> milestones there.
> Maybe the maintainers of the batch ML library could check if sampling
> techniques could be useful there I guess.
>
> Paris
>
> > On 11 Jul 2016, at 16:15, Le Quoc Do <lequo...@gmail.com> wrote:
> >
> > Hi all,
> >
> > Thank you all for your answers.
> > By the way, I also recognized that Flink doesn't support  "stratified
> > sampling" function (only simple random sampling) for DataSet.
> > It would be nice if someone can create a Jira for it, and assign the task
> > to me so that I can work for it.
> >
> > Thank you,
> > Do
> >
> > On Mon, Jul 11, 2016 at 11:44 AM, Vasiliki Kalavri <
> > vasilikikala...@gmail.com> wrote:
> >
> >> Hi Do,
> >>
> >> Paris and Martha worked on sampling techniques for data streams on Flink
> >> last year. If you want to implement your own samplers, you might find
> >> Martha's master thesis helpful [1].
> >>
> >> -Vasia.
> >>
> >> [1]: http://kth.diva-portal.org/smash/get/diva2:910695/FULLTEXT01.pdf
> >>
> >> On 11 July 2016 at 11:31, Kostas Kloudas <k.klou...@data-artisans.com>
> >> wrote:
> >>
> >>> Hi Do,
> >>>
> >>> In DataStream you can always implement your own
> >>> sampling function, hopefully without too much effort.
> >>>
> >>> Adding such functionality it to the API could be a good idea.
> >>> But given that in sampling there is no “one-size-fits-all”
> >>> solution (as not every use case needs random sampling and not
> >>> all random samplers fit to all workloads), I am not sure if we
> >>> should start adding different sampling operators.
> >>>
> >>> Thanks,
> >>> Kostas
> >>>
> >>>> On Jul 9, 2016, at 5:43 PM, Greg Hogan <c...@greghogan.com> wrote:
> >>>>
> >>>> Hi Do,
> >>>>
> >>>> DataSet provides a stable @Public interface. DataSetUtils is marked
> >>>> @PublicEvolving which is intended for public use, has stable behavior,
> >>> but
> >>>> method signatures may change. It's also good to limit DataSet to
> common
> >>>> methods whereas the utility methods tend to be used for specific
> >>>> applications.
> >>>>
> >>>> I don't have the pulse of streaming but this sounds like a useful
> >> feature
> >>>> that could be added.
> >>>>
> >>>> Greg
> >>>>
> >>>> On Sat, Jul 9, 2016 at 10:47 AM, Le Quoc Do <lequo...@gmail.com>
> >> wrote:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> I'm working on approximate computing using sampling techniques. I
> >>>>> recognized that Flink supports the sample function for Dataset
> >>>>> (org/apache/flink/api/java/utils/DataSetUtils.java). I'm just
> >> wondering
> >>> why
> >>>>> you didn't merge the function to
> >> org/apache/flink/api/java/DataSet.java
> >>>>> since the sample function works as a transformation operator?
> >>>>>
> >>>>> The second question is that are you planning to support the sample
> >>>>> function for DataStream (within windows) since I did not see it in
> >>>>> DataStream code ?
> >>>>>
> >>>>> Thank you,
> >>>>> Do
> >>>>>
> >>>
> >>>
> >>
>
>

Reply via email to