Re: Incremental checkpoint branch

2017-03-03 Thread SHI Xiaogang
Hi Vinshnu,

We have obtained an initial design of incremental checkpointing [1] and
will start working on incremental checkpointing the next week. You can
watch the issue  FLINK-5053 [2] to get timely notification of the updates.
All suggestions are welcome.

[1]
https://docs.google.com/document/d/1VvvPp09gGdVb9D2wHx6NX99yQK0jSUMWHQPrQ_mn520/edit#heading=h.2gshh42txc4z
[2] https://issues.apache.org/jira/browse/FLINK-5053

Regards,
Xiaogang



2017-03-04 15:43 GMT+08:00 Shaoxuan Wang :

> Vinshnu,
> You can find the latest design discussion for incremental checkpoint in
> http://apache-flink-mailing-list-archive.1008284.n3.
> nabble.com/DISCUSS-Support-Incremental-Checkpointing-in-Flink-td15931.html
>  @Stefan Richter
> 
> and @Xiaogang
> Shi
> 
> might
> be the right person to provide you its on-going status.
>
> cheers,
> Shaoxuan
>
> On Sat, Mar 4, 2017 at 5:14 AM, Vishnu Viswanath <
> vishnu.viswanat...@gmail.com> wrote:
>
> > Hi,
> >
> > Can someone point me to the branch where the ongoing work for incremental
> > checkpoint is going on, I would like to try it out even if the work is
> not
> > complete.
> >
> > I have a use case where the state size increase about ~1gb every 5
> minutes.
> >
> > Thanks,
> > Vishnu
> >
>


Re: Machine Learning on Flink - Next steps

2017-03-03 Thread Roberto Bentivoglio
Hi All,

I'd like to start working on:
 - Offline learning with Streaming API
 - Online learning

I think also that using a new organisation on github, as Theodore propsed,
to keep an initial indipendency to speed up the prototyping and development
phases it's really interesting.

I totally agree with Katherin, we need offline learning, but my opinion is
that it will be more straightforward to fix the streaming issues than batch
issues because we will have more support on that by the Flink community.

Thanks and have a nice weekend,
Roberto

On 3 March 2017 at 20:20, amir bahmanyari 
wrote:

> Great points to start:- Online learning
>   - Offline learning with the streaming API
>
> Thanks + have a great weekend.
>
>   From: Katherin Eri 
>  To: dev@flink.apache.org
>  Sent: Friday, March 3, 2017 7:41 AM
>  Subject: Re: Machine Learning on Flink - Next steps
>
> Thank you, Theodore.
>
> Shortly speaking I vote for:
> 1) Online learning
> 2) Low-latency prediction serving -> Offline learning with the batch API
>
> In details:
> 1) If streaming is strong side of Flink lets use it, and try to support
> some online learning or light weight inmemory learning algorithms. Try to
> build pipeline for them.
>
> 2) I think that Flink should be part of production ecosystem, and if now
> productions require ML support, multiple models deployment and so on, we
> should serve this. But in my opinion we shouldn’t compete with such
> projects like PredictionIO, but serve them, to be an execution core. But
> that means a lot:
>
> a. Offline training should be supported, because typically most of ML algs
> are for offline training.
> b. Model lifecycle should be supported:
> ETL+transformation+training+scoring+exploitation quality monitoring
>
> I understand that batch world is full of competitors, but for me that
> doesn’t mean that batch should be ignored. I think that separated
> streaming/batching applications causes additional deployment and
> exploitation overhead which typically tried to be avoided. That means that
> we should attract community to this problem in my opinion.
>
>
> пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com>:
>
> Hello all,
>
> From our previous discussion started by Stavros, we decided to start a
> planning document [1]
> to figure out possible next steps for ML on Flink.
>
> Our concerns where mainly ensuring active development while satisfying the
> needs of
> the community.
>
> We have listed a number of proposals for future work in the document. In
> short they are:
>
>   - Offline learning with the batch API
>   - Online learning
>   - Offline learning with the streaming API
>   - Low-latency prediction serving
>
> I saw there is a number of people willing to work on ML for Flink, but the
> truth is that we cannot
> cover all of these suggestions without fragmenting the development too
> much.
>
> So my recommendation is to pick out 2 of these options, create design
> documents and build prototypes for each library.
> We can then assess their viability and together with the community decide
> if we should try
> to include one (or both) of them in the main Flink distribution.
>
> So I invite people to express their opinion about which task they would be
> willing to contribute
> and hopefully we can settle on two of these options.
>
> Once that is done we can decide how we do the actual work. Since this is
> highly experimental
> I would suggest we work on repositories where we have complete control.
>
> For that purpose I have created an organization [2] on Github which we can
> use to create repositories and teams that work on them in an organized
> manner.
> Once enough work has accumulated we can start discussing contributing the
> code
> to the main distribution.
>
> Regards,
> Theodore
>
> [1]
> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U
> d06MIRhahtJ6dw/
> [2] https://github.com/flinkml
>
> --
>
> *Yours faithfully, *
>
> *Kate Eri.*
>
>
>



-- 
Roberto Bentivoglio
e. roberto.bentivog...@radicalbit.io
radicalbit.io


Re: Incremental checkpoint branch

2017-03-03 Thread Shaoxuan Wang
Vinshnu,
You can find the latest design discussion for incremental checkpoint in
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Support-Incremental-Checkpointing-in-Flink-td15931.html
 @Stefan Richter

and @Xiaogang
Shi

might
be the right person to provide you its on-going status.

cheers,
Shaoxuan

On Sat, Mar 4, 2017 at 5:14 AM, Vishnu Viswanath <
vishnu.viswanat...@gmail.com> wrote:

> Hi,
>
> Can someone point me to the branch where the ongoing work for incremental
> checkpoint is going on, I would like to try it out even if the work is not
> complete.
>
> I have a use case where the state size increase about ~1gb every 5 minutes.
>
> Thanks,
> Vishnu
>


[jira] [Created] (FLINK-5963) Remove preparation mapper of DataSetAggregate

2017-03-03 Thread Fabian Hueske (JIRA)
Fabian Hueske created FLINK-5963:


 Summary: Remove preparation mapper of DataSetAggregate
 Key: FLINK-5963
 URL: https://issues.apache.org/jira/browse/FLINK-5963
 Project: Flink
  Issue Type: Sub-task
  Components: Table API & SQL
Affects Versions: 1.3.0
Reporter: Fabian Hueske
Priority: Minor


With the new UDAGG interface we do not need the preparation mapper anymore. It 
adds overhead because 

- it is another operator
- it prevents to use {{AggregateFunction.accumulate()}} in a combiner or 
reducer.

Hence, it should be removed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Incremental checkpoint branch

2017-03-03 Thread Vishnu Viswanath
Hi,

Can someone point me to the branch where the ongoing work for incremental
checkpoint is going on, I would like to try it out even if the work is not
complete.

I have a use case where the state size increase about ~1gb every 5 minutes.

Thanks,
Vishnu


Re: Machine Learning on Flink - Next steps

2017-03-03 Thread amir bahmanyari
Great points to start:    - Online learning
  - Offline learning with the streaming API

Thanks + have a great weekend.

  From: Katherin Eri 
 To: dev@flink.apache.org 
 Sent: Friday, March 3, 2017 7:41 AM
 Subject: Re: Machine Learning on Flink - Next steps
   
Thank you, Theodore.

Shortly speaking I vote for:
1) Online learning
2) Low-latency prediction serving -> Offline learning with the batch API

In details:
1) If streaming is strong side of Flink lets use it, and try to support
some online learning or light weight inmemory learning algorithms. Try to
build pipeline for them.

2) I think that Flink should be part of production ecosystem, and if now
productions require ML support, multiple models deployment and so on, we
should serve this. But in my opinion we shouldn’t compete with such
projects like PredictionIO, but serve them, to be an execution core. But
that means a lot:

a. Offline training should be supported, because typically most of ML algs
are for offline training.
b. Model lifecycle should be supported:
ETL+transformation+training+scoring+exploitation quality monitoring

I understand that batch world is full of competitors, but for me that
doesn’t mean that batch should be ignored. I think that separated
streaming/batching applications causes additional deployment and
exploitation overhead which typically tried to be avoided. That means that
we should attract community to this problem in my opinion.


пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis <
theodoros.vasilou...@gmail.com>:

Hello all,

>From our previous discussion started by Stavros, we decided to start a
planning document [1]
to figure out possible next steps for ML on Flink.

Our concerns where mainly ensuring active development while satisfying the
needs of
the community.

We have listed a number of proposals for future work in the document. In
short they are:

  - Offline learning with the batch API
  - Online learning
  - Offline learning with the streaming API
  - Low-latency prediction serving

I saw there is a number of people willing to work on ML for Flink, but the
truth is that we cannot
cover all of these suggestions without fragmenting the development too much.

So my recommendation is to pick out 2 of these options, create design
documents and build prototypes for each library.
We can then assess their viability and together with the community decide
if we should try
to include one (or both) of them in the main Flink distribution.

So I invite people to express their opinion about which task they would be
willing to contribute
and hopefully we can settle on two of these options.

Once that is done we can decide how we do the actual work. Since this is
highly experimental
I would suggest we work on repositories where we have complete control.

For that purpose I have created an organization [2] on Github which we can
use to create repositories and teams that work on them in an organized
manner.
Once enough work has accumulated we can start discussing contributing the
code
to the main distribution.

Regards,
Theodore

[1]
https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/
[2] https://github.com/flinkml

-- 

*Yours faithfully, *

*Kate Eri.*

   

[jira] [Created] (FLINK-5962) Cancel checkpoint canceller tasks in CheckpointCoordinator

2017-03-03 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-5962:


 Summary: Cancel checkpoint canceller tasks in CheckpointCoordinator
 Key: FLINK-5962
 URL: https://issues.apache.org/jira/browse/FLINK-5962
 Project: Flink
  Issue Type: Bug
  Components: State Backends, Checkpointing
Affects Versions: 1.2.0, 1.3.0
Reporter: Till Rohrmann
Priority: Critical


The {{CheckpointCoordinator}} register a canceller task for each running 
checkpoint. The canceller task's responsibility is to cancel a checkpoint if it 
takes too long to complete. We should cancel this task as soon as the 
checkpoint has been completed, because otherwise we will keep many canceller 
tasks around. This can eventually lead to an OOM exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5961) Queryable State is broken for HeapKeyedStateBackend

2017-03-03 Thread Stefan Richter (JIRA)
Stefan Richter created FLINK-5961:
-

 Summary: Queryable State is broken for HeapKeyedStateBackend
 Key: FLINK-5961
 URL: https://issues.apache.org/jira/browse/FLINK-5961
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Stefan Richter


The current implementation of queryable state on `HeapKeyedStateBackend` 
attempts to handle concurrency by using `ConcurrentHashMap`s as datastructure.

However, the implementation has at least two issues:

1) Concurrent modifications of state objects: state can be modified 
concurrently to a query, e.g. an element being removed from a list. This can 
result in exceptions or incorrect results.

2) StateDescriptor is indicating whether a `ConcurrentHashMap` is required 
because queryable state is active. On restore, this information is unknown at 
first and the implementation always uses plain hash maps. When the state is 
then finally registered, all previously existing maps are not thread-safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5960) Make CheckpointCoordinator less blocking

2017-03-03 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-5960:


 Summary: Make CheckpointCoordinator less blocking
 Key: FLINK-5960
 URL: https://issues.apache.org/jira/browse/FLINK-5960
 Project: Flink
  Issue Type: Improvement
  Components: State Backends, Checkpointing
Affects Versions: 1.2.0, 1.3.0
Reporter: Till Rohrmann


Currently the {{CheckpointCoordinator}} locks its operation under a global 
lock. This also includes writing checkpoint data out to a state storage. If 
this operation blocks, then the whole checkpoint operator stands still. I think 
we should rework the {{CheckpointCoordinator}} to make fewer assumptions about 
external systems to tolerate write failures and timeouts. Furthermore, we 
should try to limit the scope of locking and the execution of potentially 
blocking operation under the lock. This will improve the runtime behaviour of 
the {{CheckpointCoordinator}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5959) Verify that mesos-appmaster.sh respects env.java.opts(.jobmanager)

2017-03-03 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-5959:


 Summary: Verify that mesos-appmaster.sh respects 
env.java.opts(.jobmanager)
 Key: FLINK-5959
 URL: https://issues.apache.org/jira/browse/FLINK-5959
 Project: Flink
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.2.0, 1.3.0
Reporter: Till Rohrmann
Priority: Minor


The {{mesos-appmaster.sh}} script should respect Flink's configuration option 
{{env.java.opts(.jobmanager)}} since it hosts the job manager. We should test 
whether this is the case and if not, then change it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Machine Learning on Flink - Next steps

2017-03-03 Thread Theodore Vasiloudis
Hello all,

>From our previous discussion started by Stavros, we decided to start a
planning document [1]
to figure out possible next steps for ML on Flink.

Our concerns where mainly ensuring active development while satisfying the
needs of
the community.

We have listed a number of proposals for future work in the document. In
short they are:

   - Offline learning with the batch API
   - Online learning
   - Offline learning with the streaming API
   - Low-latency prediction serving

I saw there is a number of people willing to work on ML for Flink, but the
truth is that we cannot
cover all of these suggestions without fragmenting the development too much.

So my recommendation is to pick out 2 of these options, create design
documents and build prototypes for each library.
We can then assess their viability and together with the community decide
if we should try
to include one (or both) of them in the main Flink distribution.

So I invite people to express their opinion about which task they would be
willing to contribute
and hopefully we can settle on two of these options.

Once that is done we can decide how we do the actual work. Since this is
highly experimental
I would suggest we work on repositories where we have complete control.

For that purpose I have created an organization [2] on Github which we can
use to create repositories and teams that work on them in an organized
manner.
Once enough work has accumulated we can start discussing contributing the
code
to the main distribution.

Regards,
Theodore

[1]
https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/
[2] https://github.com/flinkml


Re: [DISCUSS] Flink ML roadmap

2017-03-03 Thread Theodore Vasiloudis
It seems like a relatively new project, backed by Intel.

My impression from the doc Roberto linked is that they might switch to
using Beam instead of Spark (?)

I'm cc'ing Soila who is developer of TAP and has worked on FlinkML in the
past, perhaps she has some input on how they plan to work with streaming
and ML in TAP.

Repos:
[1] https://github.com/tapanalyticstoolkit/

On Fri, Mar 3, 2017 at 12:24 PM, Stavros Kontopoulos <
st.kontopou...@gmail.com> wrote:

> Interesting  thanx @Roberto.  I see that only TAP Analytics Toolkit
> supports streaming. I am not aware of its market share, anyone?
>
> Best,
> Stavros
>
> On Fri, Mar 3, 2017 at 11:50 AM, Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com> wrote:
>
> > Thank you for the links Roberto I did not know that Beam was working on
> an
> > ML abstraction as well. I'm sure we can learn from that.
> >
> > I'll start another thread today where we can discuss next steps and
> action
> > points now that we have a few different paths to follow listed on the
> > shared doc,
> > since our deadline was today. We welcome further discussions of course.
> >
> > Regards,
> > Theodore
> >
> > On Thu, Mar 2, 2017 at 10:52 AM, Roberto Bentivoglio <
> > roberto.bentivog...@radicalbit.io> wrote:
> >
> > > Hi All,
> > >
> > > First of all I'd like to introduce myself: my name is Roberto
> Bentivoglio
> > > and I'm currently working for Radicalbit as Andrea Spina (he already
> > wrote
> > > on this thread).
> > > I didn't have the chance to directly contribute on Flink up to now, but
> > > some colleagues of mine are doing that since at least one year (they
> > > contributed also on the machine learning library).
> > >
> > > I hope I'm not jumping into discussione too late, it's really
> interesting
> > > and the analysis document is depicting really well the scenarios
> > currently
> > > available. Many thanks for your effort!
> > >
> > > If I can add my two cents to the discussion I'd like to add the
> > following:
> > >  - it's clear that currently the Flink community is deeply focused on
> > > streaming features than batch features. For this reason I think that
> > > implement "Offline learning with Streaming API" is really a great idea.
> > >  - I think that the "Online learning" option is really a good fit for
> > > Flink, but maybe we could give at the beginning an higher priority to
> the
> > > "Offline learning with Streaming API" option. However I think that this
> > > option will be the main goal for the mid/long term.
> > >  - we implemented a library based on jpmml-evaluator[1] and flink
> called
> > > "flink-jpmml". Using this library you can train the models on external
> > > systems and use those models, after you've exported in a PMML standard
> > > format, to run evaluations on top of DataStream API. We don't have open
> > > sourced this library up to now, but we're planning to do this in the
> next
> > > weeks. We'd like to complete the documentation and the final code
> reviews
> > > before to share it. I hope it will be helpful for the community to
> > enhance
> > > the ML support on Flink
> > >  - I'd like also to mention that the Apache Beam community is thiking
> on
> > a
> > > ML DSL. There is a design document and a couple of Jira tasks for that
> > > [2][3]
> > >
> > > We're really keen to focus our effort to improve the ML support on
> Flink
> > in
> > > Radicalbit, we will contribute on this effort for sure on a regular
> basis
> > > with our team.
> > >
> > > Looking forward to work with you!
> > >
> > > Many thanks,
> > > Roberto
> > >
> > > [1] - https://github.com/jpmml/jpmml-evaluator
> > > [2] -
> > > https://docs.google.com/document/d/17cRZk_
> yqHm3C0fljivjN66MbLkeKS1yjo4PB
> > > ECHb-xA
> > > [3] - https://issues.apache.org/jira/browse/BEAM-303
> > >
> > > On 28 February 2017 at 19:35, Gábor Hermann 
> > wrote:
> > >
> > > > Hi Philipp,
> > > >
> > > > It's great to hear you are interested in Flink ML!
> > > >
> > > > Based on your description, your prototype seems like an interesting
> > > > approach for combining online+offline learning. If you're interested,
> > we
> > > > might find a way to integrate your work, or at least your ideas, into
> > > Flink
> > > > ML if we decide on a direction that fits your approach. I think your
> > work
> > > > could be relevant for almost all the directions listed there (if I
> > > > understand correctly you'd even like to serve predictions on
> unlabeled
> > > > data).
> > > >
> > > > Feel free to join the discussion in the docs you've mentioned :)
> > > >
> > > > Cheers,
> > > > Gabor
> > > >
> > > >
> > > > On 2017-02-27 18:39, Philipp Zehnder wrote:
> > > >
> > > > Hello all,
> > > >>
> > > >> I’m new to this mailing list and I wanted to introduce myself. My
> name
> > > is
> > > >> Philipp Zehnder and I’m a Masters Student in Computer Science at the
> > > >> Karlsruhe Institute of Technology in Germany currently writing on my
> > > >> master’s thesis with the main 

Re: [DISCUSS] Flink ML roadmap

2017-03-03 Thread Stavros Kontopoulos
Interesting  thanx @Roberto.  I see that only TAP Analytics Toolkit
supports streaming. I am not aware of its market share, anyone?

Best,
Stavros

On Fri, Mar 3, 2017 at 11:50 AM, Theodore Vasiloudis <
theodoros.vasilou...@gmail.com> wrote:

> Thank you for the links Roberto I did not know that Beam was working on an
> ML abstraction as well. I'm sure we can learn from that.
>
> I'll start another thread today where we can discuss next steps and action
> points now that we have a few different paths to follow listed on the
> shared doc,
> since our deadline was today. We welcome further discussions of course.
>
> Regards,
> Theodore
>
> On Thu, Mar 2, 2017 at 10:52 AM, Roberto Bentivoglio <
> roberto.bentivog...@radicalbit.io> wrote:
>
> > Hi All,
> >
> > First of all I'd like to introduce myself: my name is Roberto Bentivoglio
> > and I'm currently working for Radicalbit as Andrea Spina (he already
> wrote
> > on this thread).
> > I didn't have the chance to directly contribute on Flink up to now, but
> > some colleagues of mine are doing that since at least one year (they
> > contributed also on the machine learning library).
> >
> > I hope I'm not jumping into discussione too late, it's really interesting
> > and the analysis document is depicting really well the scenarios
> currently
> > available. Many thanks for your effort!
> >
> > If I can add my two cents to the discussion I'd like to add the
> following:
> >  - it's clear that currently the Flink community is deeply focused on
> > streaming features than batch features. For this reason I think that
> > implement "Offline learning with Streaming API" is really a great idea.
> >  - I think that the "Online learning" option is really a good fit for
> > Flink, but maybe we could give at the beginning an higher priority to the
> > "Offline learning with Streaming API" option. However I think that this
> > option will be the main goal for the mid/long term.
> >  - we implemented a library based on jpmml-evaluator[1] and flink called
> > "flink-jpmml". Using this library you can train the models on external
> > systems and use those models, after you've exported in a PMML standard
> > format, to run evaluations on top of DataStream API. We don't have open
> > sourced this library up to now, but we're planning to do this in the next
> > weeks. We'd like to complete the documentation and the final code reviews
> > before to share it. I hope it will be helpful for the community to
> enhance
> > the ML support on Flink
> >  - I'd like also to mention that the Apache Beam community is thiking on
> a
> > ML DSL. There is a design document and a couple of Jira tasks for that
> > [2][3]
> >
> > We're really keen to focus our effort to improve the ML support on Flink
> in
> > Radicalbit, we will contribute on this effort for sure on a regular basis
> > with our team.
> >
> > Looking forward to work with you!
> >
> > Many thanks,
> > Roberto
> >
> > [1] - https://github.com/jpmml/jpmml-evaluator
> > [2] -
> > https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1yjo4PB
> > ECHb-xA
> > [3] - https://issues.apache.org/jira/browse/BEAM-303
> >
> > On 28 February 2017 at 19:35, Gábor Hermann 
> wrote:
> >
> > > Hi Philipp,
> > >
> > > It's great to hear you are interested in Flink ML!
> > >
> > > Based on your description, your prototype seems like an interesting
> > > approach for combining online+offline learning. If you're interested,
> we
> > > might find a way to integrate your work, or at least your ideas, into
> > Flink
> > > ML if we decide on a direction that fits your approach. I think your
> work
> > > could be relevant for almost all the directions listed there (if I
> > > understand correctly you'd even like to serve predictions on unlabeled
> > > data).
> > >
> > > Feel free to join the discussion in the docs you've mentioned :)
> > >
> > > Cheers,
> > > Gabor
> > >
> > >
> > > On 2017-02-27 18:39, Philipp Zehnder wrote:
> > >
> > > Hello all,
> > >>
> > >> I’m new to this mailing list and I wanted to introduce myself. My name
> > is
> > >> Philipp Zehnder and I’m a Masters Student in Computer Science at the
> > >> Karlsruhe Institute of Technology in Germany currently writing on my
> > >> master’s thesis with the main goal to integrate reusable machine
> > learning
> > >> components into a stream processing network. One part of my thesis is
> to
> > >> create an API for distributed online machine learning.
> > >>
> > >> I saw that there are some recent discussions how to continue the
> > >> development of Flink ML [1] and I want to share some of my experiences
> > and
> > >> maybe get some feedback from the community for my ideas.
> > >>
> > >> As I am new to open source projects I hope this is the right place for
> > >> this.
> > >>
> > >> In the beginning, I had a look at different already existing
> frameworks
> > >> like Apache SAMOA for example, which is great and has a lot of useful
> > >> resources. However, 

Re: Dataset and select/split functionality

2017-03-03 Thread CPC
Hi Fabian,

Thank you for your explanation. Also can you give an example on how the
optimizer behaves on the assumption that the outputs of a function are
replicated?

Thank you...

On 3 March 2017 at 13:52, Fabian Hueske  wrote:

> Hi CPC,
>
> we had several requests in the past to add this features. However, adding
> select/split for DataSet is much! more work than you would expect.
> As you pointed out, we have to go through the optimizer, which assumes that
> the outputs of a function are replicated.
> This is pretty much wired in and you would have to touch a lot of code.
>
> I'm sorry, but am not comfortable doing such a big change.
> IMO, the potential gains are not worth the effort of implementation and
> verification and the risk of breaking something.
>
> Best, Fabian
>
>
>
> 2017-03-02 16:31 GMT+01:00 CPC :
>
> > Hi all,
> >
> > We will try to implement select/split functionality for batch api. We
> > looked at streaming side and understand how it works but since streaming
> > side does not include an optimizer it was easier. Since adding such a
> > runtime operator will affect optimizer layer as well, is there a part
> that
> > you want us to pay particular attention to?
> >
> > Thanks...
> >
>


[jira] [Created] (FLINK-5958) Asyncronous snapshots for heap-based keyed state backends

2017-03-03 Thread Stefan Richter (JIRA)
Stefan Richter created FLINK-5958:
-

 Summary: Asyncronous snapshots for heap-based keyed state backends
 Key: FLINK-5958
 URL: https://issues.apache.org/jira/browse/FLINK-5958
 Project: Flink
  Issue Type: Improvement
  Components: State Backends, Checkpointing
Affects Versions: 1.3.0
Reporter: Stefan Richter
Assignee: Stefan Richter


The synchronous checkpointing mechanism of all heap-based keyed state backends 
is often painful for users because it stops processing for the duration of the 
checkpoint.

We could implement an option for heap-based keyed state backends that allows 
for asynchronous checkpoints, using copy-on-write.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Dataset and select/split functionality

2017-03-03 Thread Fabian Hueske
Hi CPC,

we had several requests in the past to add this features. However, adding
select/split for DataSet is much! more work than you would expect.
As you pointed out, we have to go through the optimizer, which assumes that
the outputs of a function are replicated.
This is pretty much wired in and you would have to touch a lot of code.

I'm sorry, but am not comfortable doing such a big change.
IMO, the potential gains are not worth the effort of implementation and
verification and the risk of breaking something.

Best, Fabian



2017-03-02 16:31 GMT+01:00 CPC :

> Hi all,
>
> We will try to implement select/split functionality for batch api. We
> looked at streaming side and understand how it works but since streaming
> side does not include an optimizer it was easier. Since adding such a
> runtime operator will affect optimizer layer as well, is there a part that
> you want us to pay particular attention to?
>
> Thanks...
>


[jira] [Created] (FLINK-5957) Remove `getAccumulatorType` method from `AggregateFunction`

2017-03-03 Thread sunjincheng (JIRA)
sunjincheng created FLINK-5957:
--

 Summary: Remove  `getAccumulatorType` method from 
`AggregateFunction`
 Key: FLINK-5957
 URL: https://issues.apache.org/jira/browse/FLINK-5957
 Project: Flink
  Issue Type: Sub-task
Reporter: sunjincheng
Assignee: sunjincheng


Build-in aggregateFunction need not implement the  `getAccumulatorType` method. 
We can get TypeInformation by  `TypeInformation.of() ` or 
`TypeInformation.of(new TypeHint[AGG.type](){})`. 
What do you think? [~fhueske] [~shaoxuan] 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: [DISCUSS] Flink ML roadmap

2017-03-03 Thread Theodore Vasiloudis
Thank you for the links Roberto I did not know that Beam was working on an
ML abstraction as well. I'm sure we can learn from that.

I'll start another thread today where we can discuss next steps and action
points now that we have a few different paths to follow listed on the
shared doc,
since our deadline was today. We welcome further discussions of course.

Regards,
Theodore

On Thu, Mar 2, 2017 at 10:52 AM, Roberto Bentivoglio <
roberto.bentivog...@radicalbit.io> wrote:

> Hi All,
>
> First of all I'd like to introduce myself: my name is Roberto Bentivoglio
> and I'm currently working for Radicalbit as Andrea Spina (he already wrote
> on this thread).
> I didn't have the chance to directly contribute on Flink up to now, but
> some colleagues of mine are doing that since at least one year (they
> contributed also on the machine learning library).
>
> I hope I'm not jumping into discussione too late, it's really interesting
> and the analysis document is depicting really well the scenarios currently
> available. Many thanks for your effort!
>
> If I can add my two cents to the discussion I'd like to add the following:
>  - it's clear that currently the Flink community is deeply focused on
> streaming features than batch features. For this reason I think that
> implement "Offline learning with Streaming API" is really a great idea.
>  - I think that the "Online learning" option is really a good fit for
> Flink, but maybe we could give at the beginning an higher priority to the
> "Offline learning with Streaming API" option. However I think that this
> option will be the main goal for the mid/long term.
>  - we implemented a library based on jpmml-evaluator[1] and flink called
> "flink-jpmml". Using this library you can train the models on external
> systems and use those models, after you've exported in a PMML standard
> format, to run evaluations on top of DataStream API. We don't have open
> sourced this library up to now, but we're planning to do this in the next
> weeks. We'd like to complete the documentation and the final code reviews
> before to share it. I hope it will be helpful for the community to enhance
> the ML support on Flink
>  - I'd like also to mention that the Apache Beam community is thiking on a
> ML DSL. There is a design document and a couple of Jira tasks for that
> [2][3]
>
> We're really keen to focus our effort to improve the ML support on Flink in
> Radicalbit, we will contribute on this effort for sure on a regular basis
> with our team.
>
> Looking forward to work with you!
>
> Many thanks,
> Roberto
>
> [1] - https://github.com/jpmml/jpmml-evaluator
> [2] -
> https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1yjo4PB
> ECHb-xA
> [3] - https://issues.apache.org/jira/browse/BEAM-303
>
> On 28 February 2017 at 19:35, Gábor Hermann  wrote:
>
> > Hi Philipp,
> >
> > It's great to hear you are interested in Flink ML!
> >
> > Based on your description, your prototype seems like an interesting
> > approach for combining online+offline learning. If you're interested, we
> > might find a way to integrate your work, or at least your ideas, into
> Flink
> > ML if we decide on a direction that fits your approach. I think your work
> > could be relevant for almost all the directions listed there (if I
> > understand correctly you'd even like to serve predictions on unlabeled
> > data).
> >
> > Feel free to join the discussion in the docs you've mentioned :)
> >
> > Cheers,
> > Gabor
> >
> >
> > On 2017-02-27 18:39, Philipp Zehnder wrote:
> >
> > Hello all,
> >>
> >> I’m new to this mailing list and I wanted to introduce myself. My name
> is
> >> Philipp Zehnder and I’m a Masters Student in Computer Science at the
> >> Karlsruhe Institute of Technology in Germany currently writing on my
> >> master’s thesis with the main goal to integrate reusable machine
> learning
> >> components into a stream processing network. One part of my thesis is to
> >> create an API for distributed online machine learning.
> >>
> >> I saw that there are some recent discussions how to continue the
> >> development of Flink ML [1] and I want to share some of my experiences
> and
> >> maybe get some feedback from the community for my ideas.
> >>
> >> As I am new to open source projects I hope this is the right place for
> >> this.
> >>
> >> In the beginning, I had a look at different already existing frameworks
> >> like Apache SAMOA for example, which is great and has a lot of useful
> >> resources. However, as Flink is currently focusing on streaming, from my
> >> point of view it makes sense to also have a streaming machine learning
> API
> >> as part of the Flink ecosystem.
> >>
> >> I’m currently working on building a prototype for a distributed
> streaming
> >> machine learning library based on Flink that can be used for online and
> >> “classical” offline learning.
> >>
> >> The machine learning algorithm takes labeled and non-labeled data. On a
> >> labeled data point 

[jira] [Created] (FLINK-5956) Add retract method into the aggregateFunction

2017-03-03 Thread Shaoxuan Wang (JIRA)
Shaoxuan Wang created FLINK-5956:


 Summary: Add retract method into the aggregateFunction
 Key: FLINK-5956
 URL: https://issues.apache.org/jira/browse/FLINK-5956
 Project: Flink
  Issue Type: Sub-task
  Components: Table API & SQL
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang


Retraction method is help for processing updated message. It will also very 
helpful for window Aggregation. This PR will first add retraction methods into 
the aggregateFunctions, such that on-going over window Aggregation can get 
benefit from it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)