Re: sampling function

2016-07-13 Thread Le Quoc Do
Hi Till,

I have created the JIRA: https://issues.apache.org/jira/browse/FLINK-4205

Thank you,
Do

On Tue, Jul 12, 2016 at 6:05 PM, Till Rohrmann  wrote:

> Stratified sampling would also be beneficial for the DataSet API. I think
> it would be best if this method is also added to DataSetUtils or made
> available via the flink-contrib module. Furthermore, I think that it would
> be easiest if you created the JIRA for this feature, because you know what
> you want to add. For that you have to register at
> https://issues.apache.org/jira, if you haven't done this, and then we can
> add you as a contributor. Based on the JIRA description we can
> discuss possible implementations then.
>
> Cheers,
> Till
>
> On Tue, Jul 12, 2016 at 12:11 PM, Paris Carbone  wrote:
>
> > Hey Do,
> >
> > I think that more sophisticated samplers could make a better fit in the
> ML
> > library and not in the core API but I am not very familiar with the
> > milestones there.
> > Maybe the maintainers of the batch ML library could check if sampling
> > techniques could be useful there I guess.
> >
> > Paris
> >
> > > On 11 Jul 2016, at 16:15, Le Quoc Do  wrote:
> > >
> > > Hi all,
> > >
> > > Thank you all for your answers.
> > > By the way, I also recognized that Flink doesn't support  "stratified
> > > sampling" function (only simple random sampling) for DataSet.
> > > It would be nice if someone can create a Jira for it, and assign the
> task
> > > to me so that I can work for it.
> > >
> > > Thank you,
> > > Do
> > >
> > > On Mon, Jul 11, 2016 at 11:44 AM, Vasiliki Kalavri <
> > > vasilikikala...@gmail.com> wrote:
> > >
> > >> Hi Do,
> > >>
> > >> Paris and Martha worked on sampling techniques for data streams on
> Flink
> > >> last year. If you want to implement your own samplers, you might find
> > >> Martha's master thesis helpful [1].
> > >>
> > >> -Vasia.
> > >>
> > >> [1]: http://kth.diva-portal.org/smash/get/diva2:910695/FULLTEXT01.pdf
> > >>
> > >> On 11 July 2016 at 11:31, Kostas Kloudas  >
> > >> wrote:
> > >>
> > >>> Hi Do,
> > >>>
> > >>> In DataStream you can always implement your own
> > >>> sampling function, hopefully without too much effort.
> > >>>
> > >>> Adding such functionality it to the API could be a good idea.
> > >>> But given that in sampling there is no “one-size-fits-all”
> > >>> solution (as not every use case needs random sampling and not
> > >>> all random samplers fit to all workloads), I am not sure if we
> > >>> should start adding different sampling operators.
> > >>>
> > >>> Thanks,
> > >>> Kostas
> > >>>
> >  On Jul 9, 2016, at 5:43 PM, Greg Hogan  wrote:
> > 
> >  Hi Do,
> > 
> >  DataSet provides a stable @Public interface. DataSetUtils is marked
> >  @PublicEvolving which is intended for public use, has stable
> behavior,
> > >>> but
> >  method signatures may change. It's also good to limit DataSet to
> > common
> >  methods whereas the utility methods tend to be used for specific
> >  applications.
> > 
> >  I don't have the pulse of streaming but this sounds like a useful
> > >> feature
> >  that could be added.
> > 
> >  Greg
> > 
> >  On Sat, Jul 9, 2016 at 10:47 AM, Le Quoc Do 
> > >> wrote:
> > 
> > > Hi all,
> > >
> > > I'm working on approximate computing using sampling techniques. I
> > > recognized that Flink supports the sample function for Dataset
> > > (org/apache/flink/api/java/utils/DataSetUtils.java). I'm just
> > >> wondering
> > >>> why
> > > you didn't merge the function to
> > >> org/apache/flink/api/java/DataSet.java
> > > since the sample function works as a transformation operator?
> > >
> > > The second question is that are you planning to support the sample
> > > function for DataStream (within windows) since I did not see it in
> > > DataStream code ?
> > >
> > > Thank you,
> > > Do
> > >
> > >>>
> > >>>
> > >>
> >
> >
>


Re: sampling function

2016-07-12 Thread Till Rohrmann
Stratified sampling would also be beneficial for the DataSet API. I think
it would be best if this method is also added to DataSetUtils or made
available via the flink-contrib module. Furthermore, I think that it would
be easiest if you created the JIRA for this feature, because you know what
you want to add. For that you have to register at
https://issues.apache.org/jira, if you haven't done this, and then we can
add you as a contributor. Based on the JIRA description we can
discuss possible implementations then.

Cheers,
Till

On Tue, Jul 12, 2016 at 12:11 PM, Paris Carbone  wrote:

> Hey Do,
>
> I think that more sophisticated samplers could make a better fit in the ML
> library and not in the core API but I am not very familiar with the
> milestones there.
> Maybe the maintainers of the batch ML library could check if sampling
> techniques could be useful there I guess.
>
> Paris
>
> > On 11 Jul 2016, at 16:15, Le Quoc Do  wrote:
> >
> > Hi all,
> >
> > Thank you all for your answers.
> > By the way, I also recognized that Flink doesn't support  "stratified
> > sampling" function (only simple random sampling) for DataSet.
> > It would be nice if someone can create a Jira for it, and assign the task
> > to me so that I can work for it.
> >
> > Thank you,
> > Do
> >
> > On Mon, Jul 11, 2016 at 11:44 AM, Vasiliki Kalavri <
> > vasilikikala...@gmail.com> wrote:
> >
> >> Hi Do,
> >>
> >> Paris and Martha worked on sampling techniques for data streams on Flink
> >> last year. If you want to implement your own samplers, you might find
> >> Martha's master thesis helpful [1].
> >>
> >> -Vasia.
> >>
> >> [1]: http://kth.diva-portal.org/smash/get/diva2:910695/FULLTEXT01.pdf
> >>
> >> On 11 July 2016 at 11:31, Kostas Kloudas 
> >> wrote:
> >>
> >>> Hi Do,
> >>>
> >>> In DataStream you can always implement your own
> >>> sampling function, hopefully without too much effort.
> >>>
> >>> Adding such functionality it to the API could be a good idea.
> >>> But given that in sampling there is no “one-size-fits-all”
> >>> solution (as not every use case needs random sampling and not
> >>> all random samplers fit to all workloads), I am not sure if we
> >>> should start adding different sampling operators.
> >>>
> >>> Thanks,
> >>> Kostas
> >>>
>  On Jul 9, 2016, at 5:43 PM, Greg Hogan  wrote:
> 
>  Hi Do,
> 
>  DataSet provides a stable @Public interface. DataSetUtils is marked
>  @PublicEvolving which is intended for public use, has stable behavior,
> >>> but
>  method signatures may change. It's also good to limit DataSet to
> common
>  methods whereas the utility methods tend to be used for specific
>  applications.
> 
>  I don't have the pulse of streaming but this sounds like a useful
> >> feature
>  that could be added.
> 
>  Greg
> 
>  On Sat, Jul 9, 2016 at 10:47 AM, Le Quoc Do 
> >> wrote:
> 
> > Hi all,
> >
> > I'm working on approximate computing using sampling techniques. I
> > recognized that Flink supports the sample function for Dataset
> > (org/apache/flink/api/java/utils/DataSetUtils.java). I'm just
> >> wondering
> >>> why
> > you didn't merge the function to
> >> org/apache/flink/api/java/DataSet.java
> > since the sample function works as a transformation operator?
> >
> > The second question is that are you planning to support the sample
> > function for DataStream (within windows) since I did not see it in
> > DataStream code ?
> >
> > Thank you,
> > Do
> >
> >>>
> >>>
> >>
>
>


Re: sampling function

2016-07-12 Thread Paris Carbone
Hey Do,

I think that more sophisticated samplers could make a better fit in the ML 
library and not in the core API but I am not very familiar with the milestones 
there.
Maybe the maintainers of the batch ML library could check if sampling 
techniques could be useful there I guess.

Paris

> On 11 Jul 2016, at 16:15, Le Quoc Do  wrote:
> 
> Hi all,
> 
> Thank you all for your answers.
> By the way, I also recognized that Flink doesn't support  "stratified
> sampling" function (only simple random sampling) for DataSet.
> It would be nice if someone can create a Jira for it, and assign the task
> to me so that I can work for it.
> 
> Thank you,
> Do
> 
> On Mon, Jul 11, 2016 at 11:44 AM, Vasiliki Kalavri <
> vasilikikala...@gmail.com> wrote:
> 
>> Hi Do,
>> 
>> Paris and Martha worked on sampling techniques for data streams on Flink
>> last year. If you want to implement your own samplers, you might find
>> Martha's master thesis helpful [1].
>> 
>> -Vasia.
>> 
>> [1]: http://kth.diva-portal.org/smash/get/diva2:910695/FULLTEXT01.pdf
>> 
>> On 11 July 2016 at 11:31, Kostas Kloudas 
>> wrote:
>> 
>>> Hi Do,
>>> 
>>> In DataStream you can always implement your own
>>> sampling function, hopefully without too much effort.
>>> 
>>> Adding such functionality it to the API could be a good idea.
>>> But given that in sampling there is no “one-size-fits-all”
>>> solution (as not every use case needs random sampling and not
>>> all random samplers fit to all workloads), I am not sure if we
>>> should start adding different sampling operators.
>>> 
>>> Thanks,
>>> Kostas
>>> 
 On Jul 9, 2016, at 5:43 PM, Greg Hogan  wrote:
 
 Hi Do,
 
 DataSet provides a stable @Public interface. DataSetUtils is marked
 @PublicEvolving which is intended for public use, has stable behavior,
>>> but
 method signatures may change. It's also good to limit DataSet to common
 methods whereas the utility methods tend to be used for specific
 applications.
 
 I don't have the pulse of streaming but this sounds like a useful
>> feature
 that could be added.
 
 Greg
 
 On Sat, Jul 9, 2016 at 10:47 AM, Le Quoc Do 
>> wrote:
 
> Hi all,
> 
> I'm working on approximate computing using sampling techniques. I
> recognized that Flink supports the sample function for Dataset
> (org/apache/flink/api/java/utils/DataSetUtils.java). I'm just
>> wondering
>>> why
> you didn't merge the function to
>> org/apache/flink/api/java/DataSet.java
> since the sample function works as a transformation operator?
> 
> The second question is that are you planning to support the sample
> function for DataStream (within windows) since I did not see it in
> DataStream code ?
> 
> Thank you,
> Do
> 
>>> 
>>> 
>> 



Re: sampling function

2016-07-11 Thread Le Quoc Do
Hi all,

Thank you all for your answers.
By the way, I also recognized that Flink doesn't support  "stratified
sampling" function (only simple random sampling) for DataSet.
It would be nice if someone can create a Jira for it, and assign the task
to me so that I can work for it.

Thank you,
Do

On Mon, Jul 11, 2016 at 11:44 AM, Vasiliki Kalavri <
vasilikikala...@gmail.com> wrote:

> Hi Do,
>
> Paris and Martha worked on sampling techniques for data streams on Flink
> last year. If you want to implement your own samplers, you might find
> Martha's master thesis helpful [1].
>
> -Vasia.
>
> [1]: http://kth.diva-portal.org/smash/get/diva2:910695/FULLTEXT01.pdf
>
> On 11 July 2016 at 11:31, Kostas Kloudas 
> wrote:
>
> > Hi Do,
> >
> > In DataStream you can always implement your own
> > sampling function, hopefully without too much effort.
> >
> > Adding such functionality it to the API could be a good idea.
> > But given that in sampling there is no “one-size-fits-all”
> > solution (as not every use case needs random sampling and not
> > all random samplers fit to all workloads), I am not sure if we
> > should start adding different sampling operators.
> >
> > Thanks,
> > Kostas
> >
> > > On Jul 9, 2016, at 5:43 PM, Greg Hogan  wrote:
> > >
> > > Hi Do,
> > >
> > > DataSet provides a stable @Public interface. DataSetUtils is marked
> > > @PublicEvolving which is intended for public use, has stable behavior,
> > but
> > > method signatures may change. It's also good to limit DataSet to common
> > > methods whereas the utility methods tend to be used for specific
> > > applications.
> > >
> > > I don't have the pulse of streaming but this sounds like a useful
> feature
> > > that could be added.
> > >
> > > Greg
> > >
> > > On Sat, Jul 9, 2016 at 10:47 AM, Le Quoc Do 
> wrote:
> > >
> > >> Hi all,
> > >>
> > >> I'm working on approximate computing using sampling techniques. I
> > >> recognized that Flink supports the sample function for Dataset
> > >> (org/apache/flink/api/java/utils/DataSetUtils.java). I'm just
> wondering
> > why
> > >> you didn't merge the function to
> org/apache/flink/api/java/DataSet.java
> > >> since the sample function works as a transformation operator?
> > >>
> > >> The second question is that are you planning to support the sample
> > >> function for DataStream (within windows) since I did not see it in
> > >> DataStream code ?
> > >>
> > >> Thank you,
> > >> Do
> > >>
> >
> >
>


Re: sampling function

2016-07-11 Thread Vasiliki Kalavri
Hi Do,

Paris and Martha worked on sampling techniques for data streams on Flink
last year. If you want to implement your own samplers, you might find
Martha's master thesis helpful [1].

-Vasia.

[1]: http://kth.diva-portal.org/smash/get/diva2:910695/FULLTEXT01.pdf

On 11 July 2016 at 11:31, Kostas Kloudas 
wrote:

> Hi Do,
>
> In DataStream you can always implement your own
> sampling function, hopefully without too much effort.
>
> Adding such functionality it to the API could be a good idea.
> But given that in sampling there is no “one-size-fits-all”
> solution (as not every use case needs random sampling and not
> all random samplers fit to all workloads), I am not sure if we
> should start adding different sampling operators.
>
> Thanks,
> Kostas
>
> > On Jul 9, 2016, at 5:43 PM, Greg Hogan  wrote:
> >
> > Hi Do,
> >
> > DataSet provides a stable @Public interface. DataSetUtils is marked
> > @PublicEvolving which is intended for public use, has stable behavior,
> but
> > method signatures may change. It's also good to limit DataSet to common
> > methods whereas the utility methods tend to be used for specific
> > applications.
> >
> > I don't have the pulse of streaming but this sounds like a useful feature
> > that could be added.
> >
> > Greg
> >
> > On Sat, Jul 9, 2016 at 10:47 AM, Le Quoc Do  wrote:
> >
> >> Hi all,
> >>
> >> I'm working on approximate computing using sampling techniques. I
> >> recognized that Flink supports the sample function for Dataset
> >> (org/apache/flink/api/java/utils/DataSetUtils.java). I'm just wondering
> why
> >> you didn't merge the function to org/apache/flink/api/java/DataSet.java
> >> since the sample function works as a transformation operator?
> >>
> >> The second question is that are you planning to support the sample
> >> function for DataStream (within windows) since I did not see it in
> >> DataStream code ?
> >>
> >> Thank you,
> >> Do
> >>
>
>


Re: sampling function

2016-07-11 Thread Kostas Kloudas
Hi Do,

In DataStream you can always implement your own 
sampling function, hopefully without too much effort. 

Adding such functionality it to the API could be a good idea. 
But given that in sampling there is no “one-size-fits-all”
solution (as not every use case needs random sampling and not
all random samplers fit to all workloads), I am not sure if we 
should start adding different sampling operators.

Thanks,
Kostas

> On Jul 9, 2016, at 5:43 PM, Greg Hogan  wrote:
> 
> Hi Do,
> 
> DataSet provides a stable @Public interface. DataSetUtils is marked
> @PublicEvolving which is intended for public use, has stable behavior, but
> method signatures may change. It's also good to limit DataSet to common
> methods whereas the utility methods tend to be used for specific
> applications.
> 
> I don't have the pulse of streaming but this sounds like a useful feature
> that could be added.
> 
> Greg
> 
> On Sat, Jul 9, 2016 at 10:47 AM, Le Quoc Do  wrote:
> 
>> Hi all,
>> 
>> I'm working on approximate computing using sampling techniques. I
>> recognized that Flink supports the sample function for Dataset
>> (org/apache/flink/api/java/utils/DataSetUtils.java). I'm just wondering why
>> you didn't merge the function to org/apache/flink/api/java/DataSet.java
>> since the sample function works as a transformation operator?
>> 
>> The second question is that are you planning to support the sample
>> function for DataStream (within windows) since I did not see it in
>> DataStream code ?
>> 
>> Thank you,
>> Do
>> 



Re: sampling function

2016-07-09 Thread Greg Hogan
Hi Do,

DataSet provides a stable @Public interface. DataSetUtils is marked
@PublicEvolving which is intended for public use, has stable behavior, but
method signatures may change. It's also good to limit DataSet to common
methods whereas the utility methods tend to be used for specific
applications.

I don't have the pulse of streaming but this sounds like a useful feature
that could be added.

Greg

On Sat, Jul 9, 2016 at 10:47 AM, Le Quoc Do  wrote:

> Hi all,
>
> I'm working on approximate computing using sampling techniques. I
> recognized that Flink supports the sample function for Dataset
> (org/apache/flink/api/java/utils/DataSetUtils.java). I'm just wondering why
> you didn't merge the function to org/apache/flink/api/java/DataSet.java
> since the sample function works as a transformation operator?
>
> The second question is that are you planning to support the sample
> function for DataStream (within windows) since I did not see it in
> DataStream code ?
>
> Thank you,
> Do
>