Re: Opening a discussion on FlinkML

2016-02-14 Thread Martin Neumann
I think the focus of this discussion should be how we proceed not what to
do. The what comes from the committers anyway.

There are several people who like to commit, including people from the
Streamline project. Having pull requests that are older than 6 Month is not
good for any project.
The main question is how can we develop the library further with high
standards but without creating a bottleneck that holds things back to much.

In my opinion it would be best if we find enough resources to keep things
inside Flink. However if we have to depend on people who are
already stretched for time, splitting it out might be the better option.
(path 1 from Theos original mail)

cheers Martin




On Fri, Feb 12, 2016 at 3:54 PM, Suneel Marthi  wrote:

> On Fri, Feb 12, 2016 at 9:40 AM, Simone Robutti <
> simone.robu...@radicalbit.io> wrote:
>
> > @Suneel
> >
> > 1) Totally agree, as I wrote before.
> >
> > 2)I agree that support for PMML is premature but we shouldn't
> underestimate
> > the variety and complexity of the uses of ML models in the industry. The
> > adoption of Flink, hopefully, will grow and reach less innovative
> realities
> > where Random Forests and SVMs are still the main algorithms in use. In
> > these same realities there are legacies that justify the use of PMML to
> > port models. Still, FlinkML is still in an early stage so as you said, it
> > doesn't make sense to spend time right now on such a feature.
> >
>
> +1, as I mentioned earlier the PMML spec only supports classification and
> clustering (I last checked this in Aug 2015, pretty sure it would not have
> changed since then); hence 'Yes' it has some limited uses; 'No' - its too
> premature to even talk about it given the present state of FlinkML.
>
> >
> > 3)This would be really interesting. How do you imagine that the
> integration
> > with a distributed processing engine would work?
> >
>
> I am not sure yet, we r still exploring this on Mahout project to add to
> Mahout-Samsara - most of the statistics and probabilistic modeling would
> then be supported by Figaro (Bayesian, MCMC etc) and hence can be external
> to FlinkML.
>
> Figaro is Scala based. See https://github.com/p2t2/figaro
>
> I believe there are few other similar DSLs out there, need to dig up my old
> emails.
>
> (Not sure if its ASLv2 License, need verification here)
>
>
> >
> > 5) Agree on this one too. To my knowledge it would be the best option
> > together with SAMOA (for the streaming part).
> >
>
> There's already Flink - Samoa integration in place IIRC.
>
>
> >
> > 2016-02-12 15:25 GMT+01:00 Suneel Marthi :
> >
> > > My 2 cents as someone who's done ML over the years - having worked on
> > Oryx
> > > 2.0 and Mahout and having used Spark MlLib (read as "had no choice due
> to
> > > strict workplace enforcement") and understands well their limitations.
> > >
> > > 1. FlinkML in its present form seems like "do it like how Spark did
> it".
> > >
> > > 2. The recent discussion about PMML support in Flink to my mind is a
> > clear
> > > example of putting the cart before the horse.  Why are we even talking
> > PMML
> > > when there ain't much ML algos in FlinkML?
> > >
> > > For a real good implementation of PMML and how its being used (with
> > jPMML),
> > > suggest look at the Oryx 2.0 project. The PMML implementation in Oryx
> 2.0
> > > predates Spark and is a clean example of separating PMML from the
> > > underlying framework (Spark or Flink).
> > >
> > > We have had PMML discussions on the Mahout project in the past, but the
> > > idea never gained any traction in large part due to PMML spec
> limitations
> > > (mostly for clustering and classification algorithms) and the lack of
> > > adoption within the community.
> > >
> > > See the discussion here and specifically Ted Dunning's comment on PMML
> -
> > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/mahout-dev/201503.mbox/%3CCAJwFCa1%3DAw%2B3G54FgkYdTH%3DoNQBRqfeU-SS19iCFKMWbAfWzOQ%40mail.gmail.com%3E
> > >
> > > Most of the ML in practice (deployed in production) today are
> > Recommenders
> > > and Deep Learning - both of which are not supported by the PMML spec.
> > >
> > > 3. Leveraging a probabilistic programming language like Figaro might
> be a
> > > good way to go (just my thought) - that way most of the ML groundwork
> > would
> > > be external to Flink.
> > >
> > > 4. Within the Mahout community, we had been talking (and are working)
> on
> > > redoing the Samsara Distributed linear algebra framework to support
> Flink
> > > (in large part we realized that Flink is a better platform than the
> more
> > > popular one out there that Slim wouldn't wanna talk about :) ).
> > >
> > >  We should be having a release out in the next few weeks (depending on
> > > committers' availability). It would be great if FlinkML had something
> > like
> > > it.
> > >
> > > There was a good audience to Sebastian's talk on this subject at #FF15
> in
> > > October.
> > >
> > > 5. Its a good idea to add Flink su

[jira] [Created] (FLINK-3397) Failed streaming jobs should fall back to the most recent checkpoint/savepoint

2016-02-14 Thread Gyula Fora (JIRA)
Gyula Fora created FLINK-3397:
-

 Summary: Failed streaming jobs should fall back to the most recent 
checkpoint/savepoint
 Key: FLINK-3397
 URL: https://issues.apache.org/jira/browse/FLINK-3397
 Project: Flink
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Gyula Fora
Priority: Minor


The current fallback behaviour in case of a streaming job failure is slightly 
counterintuitive:

If a job fails it will fall back to the most recent checkpoint (if any) even if 
there were more recent savepoint taken. This means that savepoints are not 
regarded as checkpoints by the system only points from where a job can be 
manually restarted.

I suggest to change this so that savepoints are also regarded as checkpoints in 
case of a failure and they will also be used to automatically restore the 
streaming job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [ANNOUNCE] Flink 0.10.2 Released

2016-02-14 Thread Henry Saputra
Yep, great work, Ufuk

On Friday, February 12, 2016, Kostas Kloudas 
wrote:

> Yes thanks a lot Ufuk!
>
> > On Feb 12, 2016, at 3:09 PM, Till Rohrmann  > wrote:
> >
> > Thanks for being our release manager Ufuk :-) Great work!
> >
> > On Fri, Feb 12, 2016 at 2:15 PM, Robert Metzger  > wrote:
> >
> >> Thank you for doing a release Ufuk!
> >>
> >> I just tweeted about it:
> >> https://twitter.com/ApacheFlink/status/698130110709428224
> >>
> >>
> >> On Fri, Feb 12, 2016 at 2:13 PM, Maximilian Michels  >
> >> wrote:
> >>
> >>> Bravo! Thank you Ufuk for managing the release!
> >>>
> >>> On Fri, Feb 12, 2016 at 2:02 PM, Fabian Hueske  >
> >> wrote:
>  Thanks Ufuk!
> 
>  2016-02-12 12:57 GMT+01:00 Ufuk Celebi 
> >:
> 
> > The Flink PMC is pleased to announce the availability of Flink
> 0.10.2.
> >
> > On behalf of the Flink PMC, I would like to thank everybody who
> >>> contributed
> > to the release.
> >
> > The official release announcement:
> > http://flink.apache.org/news/2016/02/11/release-0.10.2.html
> >
> > Release binaries:
> > http://apache.openmirror.de/flink/flink-0.10.2/
> >
> > Please update your Maven dependencies to the new 0.10.2 version and
> >>> update
> > your binaries.
> >
> >>>
> >>
>
>


[jira] [Created] (FLINK-3398) Flink Kafka consumer should support auto-commit opt-outs

2016-02-14 Thread Shikhar Bhushan (JIRA)
Shikhar Bhushan created FLINK-3398:
--

 Summary: Flink Kafka consumer should support auto-commit opt-outs
 Key: FLINK-3398
 URL: https://issues.apache.org/jira/browse/FLINK-3398
 Project: Flink
  Issue Type: Bug
Reporter: Shikhar Bhushan


Currently the Kafka source will commit consumer offsets to Zookeeper, either 
upon a checkpoint if checkpointing is enabled, otherwise periodically based on 
{{auto.commit.interval.ms}}

It should be possible to opt-out of committing consumer offsets to Zookeeper. 
Kafka has this config as 'auto.commit.enable' (0.8) and 'enable.auto.commit' 
(0.9).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-3399) Count with timeout trigger

2016-02-14 Thread Shikhar Bhushan (JIRA)
Shikhar Bhushan created FLINK-3399:
--

 Summary: Count with timeout trigger
 Key: FLINK-3399
 URL: https://issues.apache.org/jira/browse/FLINK-3399
 Project: Flink
  Issue Type: Improvement
Reporter: Shikhar Bhushan
Priority: Minor


I created an implementation of a trigger that I'd like to contribute, 
https://gist.github.com/shikhar/2cb9f1b792be31b7c16e

An example application - if a sink function operates more efficiently if it is 
writing in a batched fashion, then the windowing mechanism + this trigger can 
be used. Count to have an upper bound on batch size & better control on memory 
usage, and timeout to ensure timeliness of the outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)