Re: sampling function

2016-07-13 Thread Le Quoc Do
Hi Till,

I have created the JIRA: https://issues.apache.org/jira/browse/FLINK-4205

Thank you,
Do

On Tue, Jul 12, 2016 at 6:05 PM, Till Rohrmann <trohrm...@apache.org> wrote:

> Stratified sampling would also be beneficial for the DataSet API. I think
> it would be best if this method is also added to DataSetUtils or made
> available via the flink-contrib module. Furthermore, I think that it would
> be easiest if you created the JIRA for this feature, because you know what
> you want to add. For that you have to register at
> https://issues.apache.org/jira, if you haven't done this, and then we can
> add you as a contributor. Based on the JIRA description we can
> discuss possible implementations then.
>
> Cheers,
> Till
>
> On Tue, Jul 12, 2016 at 12:11 PM, Paris Carbone <par...@kth.se> wrote:
>
> > Hey Do,
> >
> > I think that more sophisticated samplers could make a better fit in the
> ML
> > library and not in the core API but I am not very familiar with the
> > milestones there.
> > Maybe the maintainers of the batch ML library could check if sampling
> > techniques could be useful there I guess.
> >
> > Paris
> >
> > > On 11 Jul 2016, at 16:15, Le Quoc Do <lequo...@gmail.com> wrote:
> > >
> > > Hi all,
> > >
> > > Thank you all for your answers.
> > > By the way, I also recognized that Flink doesn't support  "stratified
> > > sampling" function (only simple random sampling) for DataSet.
> > > It would be nice if someone can create a Jira for it, and assign the
> task
> > > to me so that I can work for it.
> > >
> > > Thank you,
> > > Do
> > >
> > > On Mon, Jul 11, 2016 at 11:44 AM, Vasiliki Kalavri <
> > > vasilikikala...@gmail.com> wrote:
> > >
> > >> Hi Do,
> > >>
> > >> Paris and Martha worked on sampling techniques for data streams on
> Flink
> > >> last year. If you want to implement your own samplers, you might find
> > >> Martha's master thesis helpful [1].
> > >>
> > >> -Vasia.
> > >>
> > >> [1]: http://kth.diva-portal.org/smash/get/diva2:910695/FULLTEXT01.pdf
> > >>
> > >> On 11 July 2016 at 11:31, Kostas Kloudas <k.klou...@data-artisans.com
> >
> > >> wrote:
> > >>
> > >>> Hi Do,
> > >>>
> > >>> In DataStream you can always implement your own
> > >>> sampling function, hopefully without too much effort.
> > >>>
> > >>> Adding such functionality it to the API could be a good idea.
> > >>> But given that in sampling there is no “one-size-fits-all”
> > >>> solution (as not every use case needs random sampling and not
> > >>> all random samplers fit to all workloads), I am not sure if we
> > >>> should start adding different sampling operators.
> > >>>
> > >>> Thanks,
> > >>> Kostas
> > >>>
> > >>>> On Jul 9, 2016, at 5:43 PM, Greg Hogan <c...@greghogan.com> wrote:
> > >>>>
> > >>>> Hi Do,
> > >>>>
> > >>>> DataSet provides a stable @Public interface. DataSetUtils is marked
> > >>>> @PublicEvolving which is intended for public use, has stable
> behavior,
> > >>> but
> > >>>> method signatures may change. It's also good to limit DataSet to
> > common
> > >>>> methods whereas the utility methods tend to be used for specific
> > >>>> applications.
> > >>>>
> > >>>> I don't have the pulse of streaming but this sounds like a useful
> > >> feature
> > >>>> that could be added.
> > >>>>
> > >>>> Greg
> > >>>>
> > >>>> On Sat, Jul 9, 2016 at 10:47 AM, Le Quoc Do <lequo...@gmail.com>
> > >> wrote:
> > >>>>
> > >>>>> Hi all,
> > >>>>>
> > >>>>> I'm working on approximate computing using sampling techniques. I
> > >>>>> recognized that Flink supports the sample function for Dataset
> > >>>>> (org/apache/flink/api/java/utils/DataSetUtils.java). I'm just
> > >> wondering
> > >>> why
> > >>>>> you didn't merge the function to
> > >> org/apache/flink/api/java/DataSet.java
> > >>>>> since the sample function works as a transformation operator?
> > >>>>>
> > >>>>> The second question is that are you planning to support the sample
> > >>>>> function for DataStream (within windows) since I did not see it in
> > >>>>> DataStream code ?
> > >>>>>
> > >>>>> Thank you,
> > >>>>> Do
> > >>>>>
> > >>>
> > >>>
> > >>
> >
> >
>


Re: sampling function

2016-07-11 Thread Le Quoc Do
Hi all,

Thank you all for your answers.
By the way, I also recognized that Flink doesn't support  "stratified
sampling" function (only simple random sampling) for DataSet.
It would be nice if someone can create a Jira for it, and assign the task
to me so that I can work for it.

Thank you,
Do

On Mon, Jul 11, 2016 at 11:44 AM, Vasiliki Kalavri <
vasilikikala...@gmail.com> wrote:

> Hi Do,
>
> Paris and Martha worked on sampling techniques for data streams on Flink
> last year. If you want to implement your own samplers, you might find
> Martha's master thesis helpful [1].
>
> -Vasia.
>
> [1]: http://kth.diva-portal.org/smash/get/diva2:910695/FULLTEXT01.pdf
>
> On 11 July 2016 at 11:31, Kostas Kloudas <k.klou...@data-artisans.com>
> wrote:
>
> > Hi Do,
> >
> > In DataStream you can always implement your own
> > sampling function, hopefully without too much effort.
> >
> > Adding such functionality it to the API could be a good idea.
> > But given that in sampling there is no “one-size-fits-all”
> > solution (as not every use case needs random sampling and not
> > all random samplers fit to all workloads), I am not sure if we
> > should start adding different sampling operators.
> >
> > Thanks,
> > Kostas
> >
> > > On Jul 9, 2016, at 5:43 PM, Greg Hogan <c...@greghogan.com> wrote:
> > >
> > > Hi Do,
> > >
> > > DataSet provides a stable @Public interface. DataSetUtils is marked
> > > @PublicEvolving which is intended for public use, has stable behavior,
> > but
> > > method signatures may change. It's also good to limit DataSet to common
> > > methods whereas the utility methods tend to be used for specific
> > > applications.
> > >
> > > I don't have the pulse of streaming but this sounds like a useful
> feature
> > > that could be added.
> > >
> > > Greg
> > >
> > > On Sat, Jul 9, 2016 at 10:47 AM, Le Quoc Do <lequo...@gmail.com>
> wrote:
> > >
> > >> Hi all,
> > >>
> > >> I'm working on approximate computing using sampling techniques. I
> > >> recognized that Flink supports the sample function for Dataset
> > >> (org/apache/flink/api/java/utils/DataSetUtils.java). I'm just
> wondering
> > why
> > >> you didn't merge the function to
> org/apache/flink/api/java/DataSet.java
> > >> since the sample function works as a transformation operator?
> > >>
> > >> The second question is that are you planning to support the sample
> > >> function for DataStream (within windows) since I did not see it in
> > >> DataStream code ?
> > >>
> > >> Thank you,
> > >> Do
> > >>
> >
> >
>


sampling function

2016-07-09 Thread Le Quoc Do
Hi all,

I'm working on approximate computing using sampling techniques. I
recognized that Flink supports the sample function for Dataset
(org/apache/flink/api/java/utils/DataSetUtils.java). I'm just wondering why
you didn't merge the function to org/apache/flink/api/java/DataSet.java
since the sample function works as a transformation operator?

The second question is that are you planning to support the sample function
for DataStream (within windows) since I did not see it in DataStream code ?

Thank you,
Do