Re: [ml] Convergence Criterias

2015-07-07 Thread Till Rohrmann
I think Sachin wants to provide something similar to the LossFunction but
for the convergence criterion. This would mean that the user can specify a
convergence calculator, for example to the optimization framework, which is
used from within a iterateWithTermination call.

I think this is a good idea and it would be a nice addition to the
optimization framework, to begin with. I think that with (data,
previousModel, currentModel) one can already model a lot of different
convergence criteria. @Sachin, do you want to take the lead?

However, I don’t think that a convergence criterion makes sense for the
Predictor, because there are also algorithms which are non-iterative, e.g.
some linear regression implementations. However, those which are, can take
a convergence calculator as a parameter value.

Cheers,
Till
​

On Mon, Jul 6, 2015 at 6:17 PM, Theodore Vasiloudis 
theodoros.vasilou...@gmail.com wrote:

 
  The point is to provide user with the solution before an iteration and
 

 Am I correct to assume that by user you mean library developers here?
 Regular users who just use the API are unlikely to write their own
 convergence
 criterion function, yes? They would just set a value, for example the
 relative
 error change in gradient descent, perhaps after choosing the criterion from
 a few available options.

 We can very well employ the iterateWithTermination
  semantics even under this by setting the second term in the return value
 to
  originalSolution.filter(x =  !converged)


 Yes, we use this approach in the GradientDescent code, where we check for
 convergence using the relative loss between iterations.

 So assuming that this is aimed at developers and checking for convergence
 can be done quite efficiently using the above technique, what extra
 functionality
 would these proposed functions provide?

 I expect any kind of syntactic sugar aimed at developers will still have to
 use
 iterateWithTermination underneath.

 On Mon, Jul 6, 2015 at 4:47 PM, Sachin Goel sachingoel0...@gmail.com
 wrote:

  Sure.
  Usually, the convergence criterion can be user defined. For example, for
 a
  linear regression problem, user might want to run the training until the
  relative change in squared error falls below a specific threshold, or the
  weights fail to  shift by a relative or absolute percentage.
  Similarly, for example, in the kmeans problem, we again have several
  different convergence criteria based on the change in wcss value, or the
  relative change in centroids.
 
  The point is to provide user with the solution before an iteration and
  solution after an iteration and let them decide whether it's time to just
  be done with iterating. We can very well employ the
 iterateWithTermination
  semantics even under this by setting the second term in the return value
 to
  originalSolution.filter(x =  !converged)
  where converged is determined by the  user defined convergence criteria.
 Of
  course, we're free to use our own convergence criteria in case the user
  doesn't specify any.
 
  This achieves the desired effect.
 
  This way user has more fine grained control over the training phase.
  Of course, to aid the user in defining their own convergence criteria, we
  can provide some generic functions in the Predictor itself, for example,
 to
  calculate the current value of the objective function. After this, rest
 is
  upto the imagination of the user.
 
  Thinking more about this, I'd actually like to drop the idea of providing
  an iteration state to the user. That only makes it more complicated and
  further requires user to know what exactly goes in the algorithm.
 Usually,
  the before and after solutions should suffice. I got too hung up on my
  decision tree implementation and wanted to incorporate the convergence
  criteria used there too.
 
  Cheers!
  Sachin
 
  [Written from a mobile device. Might contain some typos or grammatical
  errors]
  On Jul 6, 2015 1:31 PM, Theodore Vasiloudis 
  theodoros.vasilou...@gmail.com wrote:
 
   Hello Sachin,
  
   could you share the motivation behind this? The iterateWithTermination
   function provides us with a means of checking for convergence during
   iterations, and checking for convergence depends highly on the
 algorithm
   being implemented. It could be the relative change in error, it could
   depend on the state (error+weights) history, or relative or absolute
  change
   in the model etc.
  
   Could you provide an example where having this function makes
 development
   easier? My concern is that this is a hard problem to generalize
 properly,
   given the dependence on the specific algorithm, model, and data.
  
   Regards,
   Theodore
  
   On Wed, Jul 1, 2015 at 9:23 PM, Sachin Goel sachingoel0...@gmail.com
   wrote:
  
Hi all
I'm trying to work out a general convergence framework for Machine
   Learning
Algorithms which utilize iterations for optimization. For now, I can
   think
of three kinds of convergence 

[jira] [Created] (FLINK-2324) Rework partitioned state storage

2015-07-07 Thread Gyula Fora (JIRA)
Gyula Fora created FLINK-2324:
-

 Summary: Rework partitioned state storage
 Key: FLINK-2324
 URL: https://issues.apache.org/jira/browse/FLINK-2324
 Project: Flink
  Issue Type: Improvement
Reporter: Gyula Fora
Assignee: Gyula Fora


Partitioned states are currently stored per-key in statehandles. This is 
alright for in-memory storage but is very inefficient for HDFS. 

The logic behind the current mechanism is that this approach provides a way to 
repartition a state without fetching the data from the external storage and 
only manipulating handles.

We should come up with a solution that can achieve both.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2323) Rename OperatorState methods to .value() and .update(..)

2015-07-07 Thread Gyula Fora (JIRA)
Gyula Fora created FLINK-2323:
-

 Summary: Rename OperatorState methods to .value() and .update(..)
 Key: FLINK-2323
 URL: https://issues.apache.org/jira/browse/FLINK-2323
 Project: Flink
  Issue Type: Improvement
Reporter: Gyula Fora
Assignee: Gyula Fora
Priority: Minor


We should rename OperatorState methods to .value() and .update(..) from 
getState and updateState to make it more clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Flink on Wikipedia

2015-07-07 Thread Stephan Ewen
It was just a suggestion.

@Matthias: You wrote the article, you decide. If you want to keep it,
that's fine!

On Tue, Jul 7, 2015 at 4:57 PM, Matthias J. Sax 
mj...@informatik.hu-berlin.de wrote:

 I agree with Kostas and don't see much danger that people get confused.
 Nevertheless, I will update the history section accordingly.


 On 07/07/2015 04:48 PM, Kostas Tzoumas wrote:
  I think it is clear to most people that the only official and (hopefully)
  up-to-date description of an Apache project is its Apache website, and
 any
  paper can get outdated. Perhaps we can change the link to a more
 up-to-date
  paper when we have one.
 
  I like the article, thanks Matthias!
 
  Kostas
 
  On Tue, Jul 7, 2015 at 4:43 PM, Stephan Ewen se...@apache.org wrote:
 
  Okay, I wrote a lot there
 
  tl:dr = Let's make sure people understand that the Stratosphere paper
 does
  not describe Flink.
 
  On Tue, Jul 7, 2015 at 4:33 PM, Matthias J. Sax 
  mj...@informatik.hu-berlin.de wrote:
 
  I can't follow. Stratosphere is only mentioned in the History part.
 Of
  course, we can strike out Stratosphere II and make clear that Flink
 is
  a fork on Stratosphere. But that is minor.
 
  And adding the Stratosphere papers as a reference, was the requirement
  to get the article accepted in the first place. Thus, it would not make
  sense to remove them. Currently, there are no other reliable and
  notable source (in term of Wikipedia guidelines) that can be used as
  references.
 
  Of course, the article is super short and needs to be extended. But the
  project is moving fast and only stable stuff should be on Wikipedia.
  It is hard enough to keep the project web page up to data. ;) Maybe, we
  can discuss what should be included and what not.
 
 
  -Matthias
 
 
 
  On 07/07/2015 04:17 PM, Stephan Ewen wrote:
  Thanks, Matthias, for starting this.
 
  It looks a bit like the article talks more about the Stratosphere
  project
  than Flink right now.
  I think we need to make a few things clear, to not confuse people:
 
  1) Flink != Stratosphere. When looking at the Stratosphere Paper and
  when
  looking at Flink, you look at two completely different things. It is
  not
  just that a name changed.
 
  2) It is not Stratosphere that became an Apache project. The project
  that
  became Apache Flink was forked from a subset of the Stratosphere
  project
  (core runtime only). That subset covered 1.5 out of the 5 areas of
  Stratosphere.
 
  When Flink became an Apache project, the only features it contained
  that
  were developed as part of Stratosphere, were the iterations, and the
  optimizer. Everything else had been re-developed.
 
  3) Flink has nothing to do with Stratosphere II. Even though that
  research
  project talks about streaming, they have a very different angle to
  that.
 
 
  I am writing this, because we are working heavy on making it easier
 for
  people to grasp what Flink is, and what it is not.
  A system with so many features and so much different technology is
  bound
  to
  be hard to grasp.
 
  The article gives people sources that seemingly describe what Flink
 is,
  but
  have little to do with what it is.
 
 
  The relationship between Flink and Stratosphere is actually the
  following:
  Flink is based on some research outcomes of Stratosphere (I), and was
  bootstrapped from a fork of the Stratosphere code base of late 2013.
  Ever
  since, these two have been proceeding independently.
 
  I would suggest to change the article to clearly reflect that.
  Otherwise,
  we will not do Flink a favor, but confuse people.
 
  Greetings,
  Stephan
 
 
 
  On Tue, Jul 7, 2015 at 10:40 AM, Henry Saputra 
  henry.sapu...@gmail.com
  wrote:
 
  Nice work indeed!
 
  - Henry
 
  On Tue, Jul 7, 2015 at 1:25 AM, Chiwan Park chiwanp...@apache.org
  wrote:
  Great! Nice start. :)
  The logo is shown now.
 
  Regards,
  Chiwan Park
 
  On Jul 7, 2015, at 5:06 PM, Maximilian Michels m...@apache.org
  wrote:
 
  Cool. Nice work, Matthias, and thanks for starting it off.
 
  On Mon, Jul 6, 2015 at 11:45 PM, Matthias J. Sax 
  mj...@informatik.hu-berlin.de wrote:
 
  Hi squirrels,
 
  I am happy to announce Flink on Wikipedia:
  https://en.wikipedia.org/wiki/Apache_Flink
 
  The Logo Request is still pending, but should be online soon.
 
 
  -Matthias
 
 
 
 
 
 
 
 
 
 
 
 




Re: Read 727 gz files ()

2015-07-07 Thread Felix Neutatz
Yes, that's maybe the problem. The user max is set to 100.000 open files.

2015-07-06 15:55 GMT+02:00 Stephan Ewen se...@apache.org:

 4 mio file handles should be enough ;-)

 Is that the system global max, or the user's max? If the user's max us
 lower, this may be the issue...

 On Mon, Jul 6, 2015 at 3:50 PM, Felix Neutatz neut...@googlemail.com
 wrote:

  So do you know how to solve this issue apart from increasing the current
  file-max (4748198)?
 
  2015-07-06 15:35 GMT+02:00 Stephan Ewen se...@apache.org:
 
   I think the error is pretty much exactly in the stack trace:
  
   Caused by: java.io.FileNotFoundException:
   /data/4/hadoop/tmp/flink-io-0e2460bf-964b-4883-8eee-12869b9476ab/
   995a38a2c92536383d0057e3482999a9.000329.channel
   (Too many open files in system)
  
  
  
  
   On Mon, Jul 6, 2015 at 3:31 PM, Felix Neutatz neut...@googlemail.com
   wrote:
  
Hi,
   
I want to do some simple aggregations on 727 gz files (68 GB total)
  from
HDFS. See code here:
   
   
   
  
 
 https://github.com/FelixNeutatz/wikiTrends/blob/master/extraction/src/main/scala/io/sanfran/wikiTrends/extraction/flink/Stats.scala
   
We are using a Flink-0.9 SNAPSHOT.
   
I get the following error:
   
Caused by: java.lang.Exception: The data preparation for task
'Reduce(Reduce at
   
   
  
 
 org.apache.flink.api.scala.GroupedDataSet.reduce(GroupedDataSet.scala:293))'
, caused an e
rror: Error obtaining the sorted input: Thread 'SortMerger spilling
   thread'
terminated due to an exception: Channel to path
'/data/4/hadoop/tmp/flink-io-0e2460bf-964b-488
3-8eee-12869b9476ab/995a38a2c92536383d0057e3482999a9.000329.channel'
   could
not be opened.
at
   
   
  
 
 org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:471)
at
   
   
  
 
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
at
 org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:853)
Caused by: java.lang.RuntimeException: Error obtaining the sorted
  input:
Thread 'SortMerger spilling thread' terminated due to an exception:
   Channel
to path '/data/4/hado
   
   
  
 
 op/tmp/flink-io-0e2460bf-964b-4883-8eee-12869b9476ab/995a38a2c92536383d0057e3482999a9.000329.channel'
could not be opened.
at
   
   
  
 
 org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:607)
at
   
   
  
 
 org.apache.flink.runtime.operators.RegularPactTask.getInput(RegularPactTask.java:1145)
at
   
   
  
 
 org.apache.flink.runtime.operators.ReduceDriver.prepare(ReduceDriver.java:93)
at
   
   
  
 
 org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:466)
... 3 more
Caused by: java.io.IOException: Thread 'SortMerger spilling thread'
terminated due to an exception: Channel to path
'/data/4/hadoop/tmp/flink-io-0e2460bf-964b-4883-8eee-1
2869b9476ab/995a38a2c92536383d0057e3482999a9.000329.channel' could
 not
  be
opened.
at
   
   
  
 
 org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:784)
Caused by: java.io.IOException: Channel to path
   
   
  
 
 '/data/4/hadoop/tmp/flink-io-0e2460bf-964b-4883-8eee-12869b9476ab/995a38a2c92536383d0057e3482999a9.000329.channel'
could n
ot be opened.
at
   
   
  
 
 org.apache.flink.runtime.io.disk.iomanager.AbstractFileIOChannel.init(AbstractFileIOChannel.java:61)
at
   
   
  
 
 org.apache.flink.runtime.io.disk.iomanager.AsynchronousFileIOChannel.init(AsynchronousFileIOChannel.java:86)
at
   
   
  
 
 org.apache.flink.runtime.io.disk.iomanager.AsynchronousBlockWriterWithCallback.init(AsynchronousBlockWriterWithCallback.java:42)
at
   
   
  
 
 org.apache.flink.runtime.io.disk.iomanager.AsynchronousBlockWriter.init(AsynchronousBlockWriter.java:44)
at
   
   
  
 
 org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync.createBlockChannelWriter(IOManagerAsync.java:195)
at
   
   
  
 
 org.apache.flink.runtime.io.disk.iomanager.IOManager.createBlockChannelWriter(IOManager.java:218)
at
   
   
  
 
 org.apache.flink.runtime.operators.sort.UnilateralSortMerger$SpillingThread.go(UnilateralSortMerger.java:1318)
at
   
   
  
 
 org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:781)
Caused by: java.io.FileNotFoundException:
   
   
  
 
 /data/4/hadoop/tmp/flink-io-0e2460bf-964b-4883-8eee-12869b9476ab/995a38a2c92536383d0057e3482999a9.000329.channel
(Too many open
files in system)
at java.io.RandomAccessFile.init(RandomAccessFile.java:252)
at java.io.RandomAccessFile.init(RandomAccessFile.java:133)
at
   
   
  
 
 

Re: Flink on Wikipedia

2015-07-07 Thread Matthias J. Sax
Well. It is not my article. It is on Wikipedia. Anyone can (and
should) improve it!

On 07/07/2015 05:08 PM, Stephan Ewen wrote:
 It was just a suggestion.
 
 @Matthias: You wrote the article, you decide. If you want to keep it,
 that's fine!
 
 On Tue, Jul 7, 2015 at 4:57 PM, Matthias J. Sax 
 mj...@informatik.hu-berlin.de wrote:
 
 I agree with Kostas and don't see much danger that people get confused.
 Nevertheless, I will update the history section accordingly.


 On 07/07/2015 04:48 PM, Kostas Tzoumas wrote:
 I think it is clear to most people that the only official and (hopefully)
 up-to-date description of an Apache project is its Apache website, and
 any
 paper can get outdated. Perhaps we can change the link to a more
 up-to-date
 paper when we have one.

 I like the article, thanks Matthias!

 Kostas

 On Tue, Jul 7, 2015 at 4:43 PM, Stephan Ewen se...@apache.org wrote:

 Okay, I wrote a lot there

 tl:dr = Let's make sure people understand that the Stratosphere paper
 does
 not describe Flink.

 On Tue, Jul 7, 2015 at 4:33 PM, Matthias J. Sax 
 mj...@informatik.hu-berlin.de wrote:

 I can't follow. Stratosphere is only mentioned in the History part.
 Of
 course, we can strike out Stratosphere II and make clear that Flink
 is
 a fork on Stratosphere. But that is minor.

 And adding the Stratosphere papers as a reference, was the requirement
 to get the article accepted in the first place. Thus, it would not make
 sense to remove them. Currently, there are no other reliable and
 notable source (in term of Wikipedia guidelines) that can be used as
 references.

 Of course, the article is super short and needs to be extended. But the
 project is moving fast and only stable stuff should be on Wikipedia.
 It is hard enough to keep the project web page up to data. ;) Maybe, we
 can discuss what should be included and what not.


 -Matthias



 On 07/07/2015 04:17 PM, Stephan Ewen wrote:
 Thanks, Matthias, for starting this.

 It looks a bit like the article talks more about the Stratosphere
 project
 than Flink right now.
 I think we need to make a few things clear, to not confuse people:

 1) Flink != Stratosphere. When looking at the Stratosphere Paper and
 when
 looking at Flink, you look at two completely different things. It is
 not
 just that a name changed.

 2) It is not Stratosphere that became an Apache project. The project
 that
 became Apache Flink was forked from a subset of the Stratosphere
 project
 (core runtime only). That subset covered 1.5 out of the 5 areas of
 Stratosphere.

 When Flink became an Apache project, the only features it contained
 that
 were developed as part of Stratosphere, were the iterations, and the
 optimizer. Everything else had been re-developed.

 3) Flink has nothing to do with Stratosphere II. Even though that
 research
 project talks about streaming, they have a very different angle to
 that.


 I am writing this, because we are working heavy on making it easier
 for
 people to grasp what Flink is, and what it is not.
 A system with so many features and so much different technology is
 bound
 to
 be hard to grasp.

 The article gives people sources that seemingly describe what Flink
 is,
 but
 have little to do with what it is.


 The relationship between Flink and Stratosphere is actually the
 following:
 Flink is based on some research outcomes of Stratosphere (I), and was
 bootstrapped from a fork of the Stratosphere code base of late 2013.
 Ever
 since, these two have been proceeding independently.

 I would suggest to change the article to clearly reflect that.
 Otherwise,
 we will not do Flink a favor, but confuse people.

 Greetings,
 Stephan



 On Tue, Jul 7, 2015 at 10:40 AM, Henry Saputra 
 henry.sapu...@gmail.com
 wrote:

 Nice work indeed!

 - Henry

 On Tue, Jul 7, 2015 at 1:25 AM, Chiwan Park chiwanp...@apache.org
 wrote:
 Great! Nice start. :)
 The logo is shown now.

 Regards,
 Chiwan Park

 On Jul 7, 2015, at 5:06 PM, Maximilian Michels m...@apache.org
 wrote:

 Cool. Nice work, Matthias, and thanks for starting it off.

 On Mon, Jul 6, 2015 at 11:45 PM, Matthias J. Sax 
 mj...@informatik.hu-berlin.de wrote:

 Hi squirrels,

 I am happy to announce Flink on Wikipedia:
 https://en.wikipedia.org/wiki/Apache_Flink

 The Logo Request is still pending, but should be online soon.


 -Matthias














 



signature.asc
Description: OpenPGP digital signature


Building several models in parallel

2015-07-07 Thread Felix Neutatz
Hi,

at the moment I have a dataset which looks like this:

DataSet[model_ID, DataVector] data

So what I want to do is group by the model_ID and build for each model_ID
one regression model

in pseudo code:
data.groupBy(model_ID)
-- MultipleLinearRegression().fit(data_grouped)

Is there anyway besides an iteration how to do this at the moment?

Thanks for your help,

Felix Neutatz


Re: Redesigned Features page

2015-07-07 Thread Gyula Fóra
I think the content is pretty good, much better than before. But the page
structure could be better (and this is very important in my opinion).
Now it just looks like a long list of features without any ways to navigate
between them. We should probably have something at the top that summarizes
the page.

And maybe at the specific points we should add links to other pages (for
instance internals pages).

Stephan Ewen se...@apache.org ezt írta (időpont: 2015. júl. 7., K, 13:22):

 I actually put quite some thought into the structure of the points. They
 reflect pretty much what I observed (meetups and talks) where people get
 excited and what they are missing.

 The structure follows the line of through of stream processor that also
 does batch very well. And then separate the features as they contribute to
 that.

 Intermixing features across their relation to streaming and batch has not
 worked out for my presentations at all. It came across that Flink is a
 system with a multitude of aspects, but no one in the audience could really
 say what Flink is good at in the end, where it stands out.

 The proposed structure makes this very clear:

   - Flink is a kick-ass stream processor (Strongest differentiation
 from all other technology)
   - Flink can also do batch very well  (It is broadly applicable
 and strong)
   - It surfaces nice APIs and rich libraries  (Low entry hurdle)
   - It plays well with existing technology


 Stephan


 On Tue, Jul 7, 2015 at 12:22 PM, Fabian Hueske fhue...@gmail.com wrote:

  +1 for the clear and brief feature descriptions!
 
  I am not so sure about the structure of the points, especially separating
  Streaming and Batch and Streaming in One System does not support the
  message of a unified system, IMO.
 
  How about categorizing the points into three sections (Internals, API,
  Integration) and structuring the points for example as follows:
 
  -- High Performance, Robust Execution, and Fault-tolerance
  High performance
  Continuous Streaming Model with Flow Control
  Fault-tolerance via Lightweight Distributed Snapshots
  Memory Management
  Iterations and Delta Iterations
 
  -- Ease of use, APIs, and Clear Semantics
  One runtime for Streaming and Batch Processing
  Expressive APIs (Batch + Streaming)
  Exactly-once Semantics for Stateful Computations
  Program Optimizer
  Library Ecosystem
 
  -- Integration
  Broad Integration
 
  Cheers, Fabian
 
  2015-07-07 10:39 GMT+02:00 Till Rohrmann trohrm...@apache.org:
 
   I also like the new feature page. I better conveys the strong points of
   Flink, since it's more to the point.
  
   On Mon, Jul 6, 2015 at 6:09 PM, Stephan Ewen se...@apache.org wrote:
  
Thanks Max!
   
Did not even know we had a github mirror of the flink-web repo...
   
On Mon, Jul 6, 2015 at 6:05 PM, Maximilian Michels m...@apache.org
   wrote:
   
 Hi Stephan,

 Thanks for the feature page update. I think it is much more
  informative
and
 better structured now.

 By the way, you could also open a pull request for your changes on
 https://github.com/apache/flink-web/pulls

 Cheers,
 Max


 On Mon, Jul 6, 2015 at 3:28 PM, Fabian Hueske fhue...@gmail.com
   wrote:

  I'll be happy to help, eh draw ;-)
 
  2015-07-06 15:22 GMT+02:00 Stephan Ewen se...@apache.org:
 
   Hi all!
  
   I think that the Features page of the website is a bit out of
   date.
  
   I made an effort to stub a new one. It is committed under 
  features_new.md
   
   and not yet built as an HTML page.
  
   If you want to take a look and help building this, pull the
   flink-web
 git
   repository and build the website locally. You can then look at
 it
 under 
   http://localhost:4000/index_new.html;
  
  
   So far, I am happy with the new contents. Three things remain
 to
  be
 done:
  
   1) The page misses a figure on Streaming throughput and
 latency.
  Currently,
   I am using a figure from some preliminary measurements made by
Marton.
   IIRC, Robert is running some performance tests right now and we
  can
use
  his
   results to create a new figure.
  
   2) The figures could use some love. The concept is good, but
 the
style
  is a
   bit sterile. I was hoping that Fabian would help out with his
   amazing
  style
   of drawing figures ;-)
  
   3) More people should have a look and say if we should replace
  the
  current
   Features page with this new one
  
  
   Greetings,
   Stephan
  
 

   
  
 



Re: Redesigned Features page

2015-07-07 Thread Stephan Ewen
I actually put quite some thought into the structure of the points. They
reflect pretty much what I observed (meetups and talks) where people get
excited and what they are missing.

The structure follows the line of through of stream processor that also
does batch very well. And then separate the features as they contribute to
that.

Intermixing features across their relation to streaming and batch has not
worked out for my presentations at all. It came across that Flink is a
system with a multitude of aspects, but no one in the audience could really
say what Flink is good at in the end, where it stands out.

The proposed structure makes this very clear:

  - Flink is a kick-ass stream processor (Strongest differentiation
from all other technology)
  - Flink can also do batch very well  (It is broadly applicable
and strong)
  - It surfaces nice APIs and rich libraries  (Low entry hurdle)
  - It plays well with existing technology


Stephan


On Tue, Jul 7, 2015 at 12:22 PM, Fabian Hueske fhue...@gmail.com wrote:

 +1 for the clear and brief feature descriptions!

 I am not so sure about the structure of the points, especially separating
 Streaming and Batch and Streaming in One System does not support the
 message of a unified system, IMO.

 How about categorizing the points into three sections (Internals, API,
 Integration) and structuring the points for example as follows:

 -- High Performance, Robust Execution, and Fault-tolerance
 High performance
 Continuous Streaming Model with Flow Control
 Fault-tolerance via Lightweight Distributed Snapshots
 Memory Management
 Iterations and Delta Iterations

 -- Ease of use, APIs, and Clear Semantics
 One runtime for Streaming and Batch Processing
 Expressive APIs (Batch + Streaming)
 Exactly-once Semantics for Stateful Computations
 Program Optimizer
 Library Ecosystem

 -- Integration
 Broad Integration

 Cheers, Fabian

 2015-07-07 10:39 GMT+02:00 Till Rohrmann trohrm...@apache.org:

  I also like the new feature page. I better conveys the strong points of
  Flink, since it's more to the point.
 
  On Mon, Jul 6, 2015 at 6:09 PM, Stephan Ewen se...@apache.org wrote:
 
   Thanks Max!
  
   Did not even know we had a github mirror of the flink-web repo...
  
   On Mon, Jul 6, 2015 at 6:05 PM, Maximilian Michels m...@apache.org
  wrote:
  
Hi Stephan,
   
Thanks for the feature page update. I think it is much more
 informative
   and
better structured now.
   
By the way, you could also open a pull request for your changes on
https://github.com/apache/flink-web/pulls
   
Cheers,
Max
   
   
On Mon, Jul 6, 2015 at 3:28 PM, Fabian Hueske fhue...@gmail.com
  wrote:
   
 I'll be happy to help, eh draw ;-)

 2015-07-06 15:22 GMT+02:00 Stephan Ewen se...@apache.org:

  Hi all!
 
  I think that the Features page of the website is a bit out of
  date.
 
  I made an effort to stub a new one. It is committed under 
 features_new.md
  
  and not yet built as an HTML page.
 
  If you want to take a look and help building this, pull the
  flink-web
git
  repository and build the website locally. You can then look at it
under 
  http://localhost:4000/index_new.html;
 
 
  So far, I am happy with the new contents. Three things remain to
 be
done:
 
  1) The page misses a figure on Streaming throughput and latency.
 Currently,
  I am using a figure from some preliminary measurements made by
   Marton.
  IIRC, Robert is running some performance tests right now and we
 can
   use
 his
  results to create a new figure.
 
  2) The figures could use some love. The concept is good, but the
   style
 is a
  bit sterile. I was hoping that Fabian would help out with his
  amazing
 style
  of drawing figures ;-)
 
  3) More people should have a look and say if we should replace
 the
 current
  Features page with this new one
 
 
  Greetings,
  Stephan
 

   
  
 



Re: Rework of streaming iteration API

2015-07-07 Thread Gyula Fóra
Sorry Stephan I meant it slightly differently, I see your point:

DataStream source = ...
SingleInputOperator mapper = source.map(...)
mapper.addInput()

So the add input would be a method of the operator not the stream.

Aljoscha Krettek aljos...@apache.org ezt írta (időpont: 2015. júl. 7., K,
16:12):

 I think this would be good yes. I was just about to open an Issue for
 changing the Streaming Iteration API. :D

 Then we should also make the implementation very straightforward and
 simple, right now, the implementation of the iterations is all over the
 place.

 On Tue, 7 Jul 2015 at 15:57 Gyula Fóra gyf...@apache.org wrote:

  Hey,
 
  Along with the suggested changes to the streaming API structure I think
 we
  should also rework the iteration api. Currently the iteration api tries
  to mimic the syntax of the batch API while the runtime behaviour is quite
  different.
 
  What we create instead of iterations is really just cyclic streams (loops
  in the streaming job), so the API should somehow be intuitive about this
  behaviour.
 
  I suggest to remove the explicit iterate call and instead add a method to
  the StreamOperators that allows to connect feedback inputs (create
 loops).
  It would look like this:
 
  A mapper that does nothing but iterates over some filtered input:
 
  *Current API :*
  DataStream source = ..
  IterativeDataStream it = source.iterate()
  DataStream mapper = it.map(noOpMapper)
  DataStream feedback = mapper.filter(...)
  it.closeWith(feedback)
 
  *Suggested API :*
  DataStream source = ..
  DataStream mapper = source.map(noOpMapper)
  DataStream feedback = mapper.filter(...)
  mapper.addInput(feedback)
 
  The suggested approach would let us define inputs to operators after they
  are created and implicitly union them with the normal input. This is I
  think a much clearer approach than what we have now.
 
  What do you think?
 
  Gyula
 



Re: Rework of streaming iteration API

2015-07-07 Thread Gyula Fóra
@Kostas:
This new API is I believe equivalent in expressivity with the current one.
We can define nested loops now as well.
And I also don't see nested loops much worse generally than simple loops.

Gyula Fóra gyula.f...@gmail.com ezt írta (időpont: 2015. júl. 7., K,
16:14):

 Sorry Stephan I meant it slightly differently, I see your point:

 DataStream source = ...
 SingleInputOperator mapper = source.map(...)
 mapper.addInput()

 So the add input would be a method of the operator not the stream.

 Aljoscha Krettek aljos...@apache.org ezt írta (időpont: 2015. júl. 7.,
 K, 16:12):

 I think this would be good yes. I was just about to open an Issue for
 changing the Streaming Iteration API. :D

 Then we should also make the implementation very straightforward and
 simple, right now, the implementation of the iterations is all over the
 place.

 On Tue, 7 Jul 2015 at 15:57 Gyula Fóra gyf...@apache.org wrote:

  Hey,
 
  Along with the suggested changes to the streaming API structure I think
 we
  should also rework the iteration api. Currently the iteration api
 tries
  to mimic the syntax of the batch API while the runtime behaviour is
 quite
  different.
 
  What we create instead of iterations is really just cyclic streams
 (loops
  in the streaming job), so the API should somehow be intuitive about this
  behaviour.
 
  I suggest to remove the explicit iterate call and instead add a method
 to
  the StreamOperators that allows to connect feedback inputs (create
 loops).
  It would look like this:
 
  A mapper that does nothing but iterates over some filtered input:
 
  *Current API :*
  DataStream source = ..
  IterativeDataStream it = source.iterate()
  DataStream mapper = it.map(noOpMapper)
  DataStream feedback = mapper.filter(...)
  it.closeWith(feedback)
 
  *Suggested API :*
  DataStream source = ..
  DataStream mapper = source.map(noOpMapper)
  DataStream feedback = mapper.filter(...)
  mapper.addInput(feedback)
 
  The suggested approach would let us define inputs to operators after
 they
  are created and implicitly union them with the normal input. This is I
  think a much clearer approach than what we have now.
 
  What do you think?
 
  Gyula
 




Re: Flink on Wikipedia

2015-07-07 Thread Stephan Ewen
Thanks, Matthias, for starting this.

It looks a bit like the article talks more about the Stratosphere project
than Flink right now.
I think we need to make a few things clear, to not confuse people:

1) Flink != Stratosphere. When looking at the Stratosphere Paper and when
looking at Flink, you look at two completely different things. It is not
just that a name changed.

2) It is not Stratosphere that became an Apache project. The project that
became Apache Flink was forked from a subset of the Stratosphere project
(core runtime only). That subset covered 1.5 out of the 5 areas of
Stratosphere.

When Flink became an Apache project, the only features it contained that
were developed as part of Stratosphere, were the iterations, and the
optimizer. Everything else had been re-developed.

3) Flink has nothing to do with Stratosphere II. Even though that research
project talks about streaming, they have a very different angle to that.


I am writing this, because we are working heavy on making it easier for
people to grasp what Flink is, and what it is not.
A system with so many features and so much different technology is bound to
be hard to grasp.

The article gives people sources that seemingly describe what Flink is, but
have little to do with what it is.


The relationship between Flink and Stratosphere is actually the following:
Flink is based on some research outcomes of Stratosphere (I), and was
bootstrapped from a fork of the Stratosphere code base of late 2013. Ever
since, these two have been proceeding independently.

I would suggest to change the article to clearly reflect that. Otherwise,
we will not do Flink a favor, but confuse people.

Greetings,
Stephan



On Tue, Jul 7, 2015 at 10:40 AM, Henry Saputra henry.sapu...@gmail.com
wrote:

 Nice work indeed!

 - Henry

 On Tue, Jul 7, 2015 at 1:25 AM, Chiwan Park chiwanp...@apache.org wrote:
  Great! Nice start. :)
  The logo is shown now.
 
  Regards,
  Chiwan Park
 
  On Jul 7, 2015, at 5:06 PM, Maximilian Michels m...@apache.org wrote:
 
  Cool. Nice work, Matthias, and thanks for starting it off.
 
  On Mon, Jul 6, 2015 at 11:45 PM, Matthias J. Sax 
  mj...@informatik.hu-berlin.de wrote:
 
  Hi squirrels,
 
  I am happy to announce Flink on Wikipedia:
  https://en.wikipedia.org/wiki/Apache_Flink
 
  The Logo Request is still pending, but should be online soon.
 
 
  -Matthias
 
 
 
 
 
 



Re: Rework of streaming iteration API

2015-07-07 Thread Gyula Fóra
@Aljoscha:
Yes, thats basically my point as well. This is what happens now too but we
give this mutable datastream a special name : IterativeDataStream

This can be handled in very different ways through the api, the goal would
be to make something easy to use. I am fine with what we have now because I
know how it works but it might confuse people to call it iterate.

Aljoscha Krettek aljos...@apache.org ezt írta (időpont: 2015. júl. 7., K,
16:18):

 I think it could work if we allowed a DataStream to be unioned after
 creation. For example:

 DataStream source = ..
 DataStream mapper = source.map(noOpMapper)
 DataStream feedback = mapper.filter(...)
 source.union(feedback)

 This would basically mean that a DataStream is mutable and can be extended
 after creation with more streams.

 On Tue, 7 Jul 2015 at 16:12 Aljoscha Krettek aljos...@apache.org wrote:

  I think this would be good yes. I was just about to open an Issue for
  changing the Streaming Iteration API. :D
 
  Then we should also make the implementation very straightforward and
  simple, right now, the implementation of the iterations is all over the
  place.
 
  On Tue, 7 Jul 2015 at 15:57 Gyula Fóra gyf...@apache.org wrote:
 
  Hey,
 
  Along with the suggested changes to the streaming API structure I think
 we
  should also rework the iteration api. Currently the iteration api
 tries
  to mimic the syntax of the batch API while the runtime behaviour is
 quite
  different.
 
  What we create instead of iterations is really just cyclic streams
 (loops
  in the streaming job), so the API should somehow be intuitive about this
  behaviour.
 
  I suggest to remove the explicit iterate call and instead add a method
 to
  the StreamOperators that allows to connect feedback inputs (create
 loops).
  It would look like this:
 
  A mapper that does nothing but iterates over some filtered input:
 
  *Current API :*
  DataStream source = ..
  IterativeDataStream it = source.iterate()
  DataStream mapper = it.map(noOpMapper)
  DataStream feedback = mapper.filter(...)
  it.closeWith(feedback)
 
  *Suggested API :*
  DataStream source = ..
  DataStream mapper = source.map(noOpMapper)
  DataStream feedback = mapper.filter(...)
  mapper.addInput(feedback)
 
  The suggested approach would let us define inputs to operators after
 they
  are created and implicitly union them with the normal input. This is I
  think a much clearer approach than what we have now.
 
  What do you think?
 
  Gyula
 
 



Re: Design documents for consolidated DataStream API

2015-07-07 Thread Aljoscha Krettek
Hi,
I just noticed that we don't have anything about how iterations and
timestamps/watermarks should interact.

Cheers,
Aljoscha

On Mon, 6 Jul 2015 at 23:56 Stephan Ewen se...@apache.org wrote:

 Hi all!

 As many of you know, there are a ongoing efforts to consolidate the
 streaming API for the next release, and then graduate it (from beta
 status).

 In the process of this consolidation, we want to achieve the following
 goals.

  - Make the code more robust and simplify it in parts

  - Clearly define the semantics of the constructs.

  - Prepare it for support of more advanced concepts, like partitionable
 state, and event time.

  - Cut support for certain corner cases that were prototyped, but turned
 out to be not efficiently doable


 Based on prior discussions on the mailing list, Aljoscha and me drafted the
 design documents below, which outline how the consolidated API would like.
 We focused in constructs, time, and window semantics.


 Design document on how to restructure the Streaming API:

 https://cwiki.apache.org/confluence/display/FLINK/Streams+and+Operations+on+Streams

 Design document on definitions of time, order, and the resulting semantics:
 https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams



 Note: The design of the interfaces and concepts for advanced state in
 functions is not in here. That is part of a separate design discussion and
 orthogonal to the designs drafted here.


 Please have a look and voice questions and concerns. Since we should not
 break the streaming API more than once, we should make sure this
 consolidation brings it into the shape we want it to be in.


 Greetings,
 Stephan



[Gelly] Help with GSA compiler tests

2015-07-07 Thread Vasiliki Kalavri
Hello to my squirrels,

I've started looking into FLINK-1943
https://issues.apache.org/jira/browse/FLINK-1943 and I need some help to
understand what to test and how to do it properly.

In the corresponding Spargel compiler test, the following functionality is
checked:

1. sink: the ship strategy is FORWARD and the parallelism is correct
2. iteration: degree of parallelism
3. solution set join: parallelism and input1 ship strategy is PARTITION_HASH
4. workset join: parallelism, input1 (edges) ship strategy is
PARTITION_HASH and cached, input2 (workset) ship strategy is FORWARD
5. check that the initial partitioning is pushed out of the loop
6. check that the initial workset sort is outside the loop

I have been able to verify 1-4 of the above for the GSA iteration plan, but
I'm not sure how to check (5) and (6) or whether they are expected to hold
in the GSA case.

In [1] you can see what the GSA iteration operators looks like and in [2]
you can see what the visualizer tools generates the GSA connected
components.

Any pointers would be greatly appreciated!

Cheers,
Vasia.

[1]:
https://docs.google.com/drawings/d/1tiNQeOphWtkNXTGlnDJ3Ipanh0Tm2R8sHe8XNyTnf98/edit?usp=sharing
[2]: http://imgur.com/GQZ48ZI


Rework of streaming iteration API

2015-07-07 Thread Gyula Fóra
Hey,

Along with the suggested changes to the streaming API structure I think we
should also rework the iteration api. Currently the iteration api tries
to mimic the syntax of the batch API while the runtime behaviour is quite
different.

What we create instead of iterations is really just cyclic streams (loops
in the streaming job), so the API should somehow be intuitive about this
behaviour.

I suggest to remove the explicit iterate call and instead add a method to
the StreamOperators that allows to connect feedback inputs (create loops).
It would look like this:

A mapper that does nothing but iterates over some filtered input:

*Current API :*
DataStream source = ..
IterativeDataStream it = source.iterate()
DataStream mapper = it.map(noOpMapper)
DataStream feedback = mapper.filter(...)
it.closeWith(feedback)

*Suggested API :*
DataStream source = ..
DataStream mapper = source.map(noOpMapper)
DataStream feedback = mapper.filter(...)
mapper.addInput(feedback)

The suggested approach would let us define inputs to operators after they
are created and implicitly union them with the normal input. This is I
think a much clearer approach than what we have now.

What do you think?

Gyula


Re: Building several models in parallel

2015-07-07 Thread Felix Schüler
Hi Felix!

We had a similar usecase and I trained multiple models on partitions of
my data with mapPartition and the model-parameters (weights) as
broadcast variable. If I understood broadcast variables in Flink
correctly, you should end up with one model on each TaskManager.

Does that work?

Felix

Am 07.07.2015 um 17:32 schrieb Felix Neutatz:
 Hi,
 
 at the moment I have a dataset which looks like this:
 
 DataSet[model_ID, DataVector] data
 
 So what I want to do is group by the model_ID and build for each model_ID
 one regression model
 
 in pseudo code:
 data.groupBy(model_ID)
 -- MultipleLinearRegression().fit(data_grouped)
 
 Is there anyway besides an iteration how to do this at the moment?
 
 Thanks for your help,
 
 Felix Neutatz
 


Re: Rework of streaming iteration API

2015-07-07 Thread Gyula Fóra
Okay, I am fine with this approach as well I see the advantages. Then we
just need to find a suitable name for marking a FeedbackPoint :)

Stephan Ewen se...@apache.org ezt írta (időpont: 2015. júl. 7., K, 16:28):

 In Aljoscha's approach, we would need a special mutable stream. We could do
 it like this:

 DataStream source = ...

 FeedbackPoint pt = source.createFeedbackPoint();

 DataStream mapper = pt .map(noOpMapper)
 DataStream feedback = mapper.filter(...)
 pt .addFeedbacl(feedback)


 It is basically like the current approach, with different names.

 I actually like the current approach, because it is explicit where streams
 could be altered in hind-sight (after their definition).


 On Tue, Jul 7, 2015 at 4:20 PM, Gyula Fóra gyula.f...@gmail.com wrote:

  @Aljoscha:
  Yes, thats basically my point as well. This is what happens now too but
 we
  give this mutable datastream a special name : IterativeDataStream
 
  This can be handled in very different ways through the api, the goal
 would
  be to make something easy to use. I am fine with what we have now
 because I
  know how it works but it might confuse people to call it iterate.
 
  Aljoscha Krettek aljos...@apache.org ezt írta (időpont: 2015. júl. 7.,
  K,
  16:18):
 
   I think it could work if we allowed a DataStream to be unioned after
   creation. For example:
  
   DataStream source = ..
   DataStream mapper = source.map(noOpMapper)
   DataStream feedback = mapper.filter(...)
   source.union(feedback)
  
   This would basically mean that a DataStream is mutable and can be
  extended
   after creation with more streams.
  
   On Tue, 7 Jul 2015 at 16:12 Aljoscha Krettek aljos...@apache.org
  wrote:
  
I think this would be good yes. I was just about to open an Issue for
changing the Streaming Iteration API. :D
   
Then we should also make the implementation very straightforward and
simple, right now, the implementation of the iterations is all over
 the
place.
   
On Tue, 7 Jul 2015 at 15:57 Gyula Fóra gyf...@apache.org wrote:
   
Hey,
   
Along with the suggested changes to the streaming API structure I
  think
   we
should also rework the iteration api. Currently the iteration api
   tries
to mimic the syntax of the batch API while the runtime behaviour is
   quite
different.
   
What we create instead of iterations is really just cyclic streams
   (loops
in the streaming job), so the API should somehow be intuitive about
  this
behaviour.
   
I suggest to remove the explicit iterate call and instead add a
 method
   to
the StreamOperators that allows to connect feedback inputs (create
   loops).
It would look like this:
   
A mapper that does nothing but iterates over some filtered input:
   
*Current API :*
DataStream source = ..
IterativeDataStream it = source.iterate()
DataStream mapper = it.map(noOpMapper)
DataStream feedback = mapper.filter(...)
it.closeWith(feedback)
   
*Suggested API :*
DataStream source = ..
DataStream mapper = source.map(noOpMapper)
DataStream feedback = mapper.filter(...)
mapper.addInput(feedback)
   
The suggested approach would let us define inputs to operators after
   they
are created and implicitly union them with the normal input. This
 is I
think a much clearer approach than what we have now.
   
What do you think?
   
Gyula
   
   
  
 



[jira] [Created] (FLINK-2326) Mutitenancy on Yarn

2015-07-07 Thread LINTE (JIRA)
LINTE created FLINK-2326:


 Summary: Mutitenancy on Yarn
 Key: FLINK-2326
 URL: https://issues.apache.org/jira/browse/FLINK-2326
 Project: Flink
  Issue Type: Improvement
  Components: YARN Client
Affects Versions: 0.9
 Environment: Centos 6.6
Hadoop 2.7 secured with kerberos
Flink 0.9
Reporter: LINTE


When a user launch a Flink cluster on yarn, i .yarn-properties file is created 
in the conf directory.

In multiuser environnement the configuration directory is read only. 
Even with write permission for all user on conf directory, then just on user 
can run a Flink cluster.

Regards






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Flink on Wikipedia

2015-07-07 Thread Stephan Ewen
Okay, I wrote a lot there

tl:dr = Let's make sure people understand that the Stratosphere paper does
not describe Flink.

On Tue, Jul 7, 2015 at 4:33 PM, Matthias J. Sax 
mj...@informatik.hu-berlin.de wrote:

 I can't follow. Stratosphere is only mentioned in the History part. Of
 course, we can strike out Stratosphere II and make clear that Flink is
 a fork on Stratosphere. But that is minor.

 And adding the Stratosphere papers as a reference, was the requirement
 to get the article accepted in the first place. Thus, it would not make
 sense to remove them. Currently, there are no other reliable and
 notable source (in term of Wikipedia guidelines) that can be used as
 references.

 Of course, the article is super short and needs to be extended. But the
 project is moving fast and only stable stuff should be on Wikipedia.
 It is hard enough to keep the project web page up to data. ;) Maybe, we
 can discuss what should be included and what not.


 -Matthias



 On 07/07/2015 04:17 PM, Stephan Ewen wrote:
  Thanks, Matthias, for starting this.
 
  It looks a bit like the article talks more about the Stratosphere project
  than Flink right now.
  I think we need to make a few things clear, to not confuse people:
 
  1) Flink != Stratosphere. When looking at the Stratosphere Paper and when
  looking at Flink, you look at two completely different things. It is not
  just that a name changed.
 
  2) It is not Stratosphere that became an Apache project. The project that
  became Apache Flink was forked from a subset of the Stratosphere project
  (core runtime only). That subset covered 1.5 out of the 5 areas of
  Stratosphere.
 
  When Flink became an Apache project, the only features it contained that
  were developed as part of Stratosphere, were the iterations, and the
  optimizer. Everything else had been re-developed.
 
  3) Flink has nothing to do with Stratosphere II. Even though that
 research
  project talks about streaming, they have a very different angle to that.
 
 
  I am writing this, because we are working heavy on making it easier for
  people to grasp what Flink is, and what it is not.
  A system with so many features and so much different technology is bound
 to
  be hard to grasp.
 
  The article gives people sources that seemingly describe what Flink is,
 but
  have little to do with what it is.
 
 
  The relationship between Flink and Stratosphere is actually the
 following:
  Flink is based on some research outcomes of Stratosphere (I), and was
  bootstrapped from a fork of the Stratosphere code base of late 2013. Ever
  since, these two have been proceeding independently.
 
  I would suggest to change the article to clearly reflect that. Otherwise,
  we will not do Flink a favor, but confuse people.
 
  Greetings,
  Stephan
 
 
 
  On Tue, Jul 7, 2015 at 10:40 AM, Henry Saputra henry.sapu...@gmail.com
  wrote:
 
  Nice work indeed!
 
  - Henry
 
  On Tue, Jul 7, 2015 at 1:25 AM, Chiwan Park chiwanp...@apache.org
 wrote:
  Great! Nice start. :)
  The logo is shown now.
 
  Regards,
  Chiwan Park
 
  On Jul 7, 2015, at 5:06 PM, Maximilian Michels m...@apache.org
 wrote:
 
  Cool. Nice work, Matthias, and thanks for starting it off.
 
  On Mon, Jul 6, 2015 at 11:45 PM, Matthias J. Sax 
  mj...@informatik.hu-berlin.de wrote:
 
  Hi squirrels,
 
  I am happy to announce Flink on Wikipedia:
  https://en.wikipedia.org/wiki/Apache_Flink
 
  The Logo Request is still pending, but should be online soon.
 
 
  -Matthias
 
 
 
 
 
 
 
 




Re: Flink on Wikipedia

2015-07-07 Thread Kostas Tzoumas
I think it is clear to most people that the only official and (hopefully)
up-to-date description of an Apache project is its Apache website, and any
paper can get outdated. Perhaps we can change the link to a more up-to-date
paper when we have one.

I like the article, thanks Matthias!

Kostas

On Tue, Jul 7, 2015 at 4:43 PM, Stephan Ewen se...@apache.org wrote:

 Okay, I wrote a lot there

 tl:dr = Let's make sure people understand that the Stratosphere paper does
 not describe Flink.

 On Tue, Jul 7, 2015 at 4:33 PM, Matthias J. Sax 
 mj...@informatik.hu-berlin.de wrote:

  I can't follow. Stratosphere is only mentioned in the History part. Of
  course, we can strike out Stratosphere II and make clear that Flink is
  a fork on Stratosphere. But that is minor.
 
  And adding the Stratosphere papers as a reference, was the requirement
  to get the article accepted in the first place. Thus, it would not make
  sense to remove them. Currently, there are no other reliable and
  notable source (in term of Wikipedia guidelines) that can be used as
  references.
 
  Of course, the article is super short and needs to be extended. But the
  project is moving fast and only stable stuff should be on Wikipedia.
  It is hard enough to keep the project web page up to data. ;) Maybe, we
  can discuss what should be included and what not.
 
 
  -Matthias
 
 
 
  On 07/07/2015 04:17 PM, Stephan Ewen wrote:
   Thanks, Matthias, for starting this.
  
   It looks a bit like the article talks more about the Stratosphere
 project
   than Flink right now.
   I think we need to make a few things clear, to not confuse people:
  
   1) Flink != Stratosphere. When looking at the Stratosphere Paper and
 when
   looking at Flink, you look at two completely different things. It is
 not
   just that a name changed.
  
   2) It is not Stratosphere that became an Apache project. The project
 that
   became Apache Flink was forked from a subset of the Stratosphere
 project
   (core runtime only). That subset covered 1.5 out of the 5 areas of
   Stratosphere.
  
   When Flink became an Apache project, the only features it contained
 that
   were developed as part of Stratosphere, were the iterations, and the
   optimizer. Everything else had been re-developed.
  
   3) Flink has nothing to do with Stratosphere II. Even though that
  research
   project talks about streaming, they have a very different angle to
 that.
  
  
   I am writing this, because we are working heavy on making it easier for
   people to grasp what Flink is, and what it is not.
   A system with so many features and so much different technology is
 bound
  to
   be hard to grasp.
  
   The article gives people sources that seemingly describe what Flink is,
  but
   have little to do with what it is.
  
  
   The relationship between Flink and Stratosphere is actually the
  following:
   Flink is based on some research outcomes of Stratosphere (I), and was
   bootstrapped from a fork of the Stratosphere code base of late 2013.
 Ever
   since, these two have been proceeding independently.
  
   I would suggest to change the article to clearly reflect that.
 Otherwise,
   we will not do Flink a favor, but confuse people.
  
   Greetings,
   Stephan
  
  
  
   On Tue, Jul 7, 2015 at 10:40 AM, Henry Saputra 
 henry.sapu...@gmail.com
   wrote:
  
   Nice work indeed!
  
   - Henry
  
   On Tue, Jul 7, 2015 at 1:25 AM, Chiwan Park chiwanp...@apache.org
  wrote:
   Great! Nice start. :)
   The logo is shown now.
  
   Regards,
   Chiwan Park
  
   On Jul 7, 2015, at 5:06 PM, Maximilian Michels m...@apache.org
  wrote:
  
   Cool. Nice work, Matthias, and thanks for starting it off.
  
   On Mon, Jul 6, 2015 at 11:45 PM, Matthias J. Sax 
   mj...@informatik.hu-berlin.de wrote:
  
   Hi squirrels,
  
   I am happy to announce Flink on Wikipedia:
   https://en.wikipedia.org/wiki/Apache_Flink
  
   The Logo Request is still pending, but should be online soon.
  
  
   -Matthias
  
  
  
  
  
  
  
  
 
 



Re: Rework of streaming iteration API

2015-07-07 Thread Paris Carbone
Good points. If we want to structured loops on streaming we will need to inject 
iteration counters. The question is if we really need structured iterations on 
plain data streams. Window iterations are must-have on the other hand...

Paris

 On 07 Jul 2015, at 16:43, Kostas Tzoumas ktzou...@apache.org wrote:
 
 I see. Perhaps more important IMO is defining the semantics of stream loops
 with event time.
 
 The reason I asked about nested is that Naiad and other designs used a
 multidimensional timestamp to capture loops: (outer loop counter, inner
 loop counter, timestamp). I assume that currently making sense of which
 iteration an element comes from is left to the user. Should we aim to
 change that with the API redesign?
 
 
 On Tue, Jul 7, 2015 at 4:30 PM, Gyula Fóra gyula.f...@gmail.com wrote:
 
 Okay, I am fine with this approach as well I see the advantages. Then we
 just need to find a suitable name for marking a FeedbackPoint :)
 
 Stephan Ewen se...@apache.org ezt írta (időpont: 2015. júl. 7., K,
 16:28):
 
 In Aljoscha's approach, we would need a special mutable stream. We could
 do
 it like this:
 
 DataStream source = ...
 
 FeedbackPoint pt = source.createFeedbackPoint();
 
 DataStream mapper = pt .map(noOpMapper)
 DataStream feedback = mapper.filter(...)
 pt .addFeedbacl(feedback)
 
 
 It is basically like the current approach, with different names.
 
 I actually like the current approach, because it is explicit where
 streams
 could be altered in hind-sight (after their definition).
 
 
 On Tue, Jul 7, 2015 at 4:20 PM, Gyula Fóra gyula.f...@gmail.com wrote:
 
 @Aljoscha:
 Yes, thats basically my point as well. This is what happens now too but
 we
 give this mutable datastream a special name : IterativeDataStream
 
 This can be handled in very different ways through the api, the goal
 would
 be to make something easy to use. I am fine with what we have now
 because I
 know how it works but it might confuse people to call it iterate.
 
 Aljoscha Krettek aljos...@apache.org ezt írta (időpont: 2015. júl.
 7.,
 K,
 16:18):
 
 I think it could work if we allowed a DataStream to be unioned after
 creation. For example:
 
 DataStream source = ..
 DataStream mapper = source.map(noOpMapper)
 DataStream feedback = mapper.filter(...)
 source.union(feedback)
 
 This would basically mean that a DataStream is mutable and can be
 extended
 after creation with more streams.
 
 On Tue, 7 Jul 2015 at 16:12 Aljoscha Krettek aljos...@apache.org
 wrote:
 
 I think this would be good yes. I was just about to open an Issue
 for
 changing the Streaming Iteration API. :D
 
 Then we should also make the implementation very straightforward
 and
 simple, right now, the implementation of the iterations is all over
 the
 place.
 
 On Tue, 7 Jul 2015 at 15:57 Gyula Fóra gyf...@apache.org wrote:
 
 Hey,
 
 Along with the suggested changes to the streaming API structure I
 think
 we
 should also rework the iteration api. Currently the iteration
 api
 tries
 to mimic the syntax of the batch API while the runtime behaviour
 is
 quite
 different.
 
 What we create instead of iterations is really just cyclic streams
 (loops
 in the streaming job), so the API should somehow be intuitive
 about
 this
 behaviour.
 
 I suggest to remove the explicit iterate call and instead add a
 method
 to
 the StreamOperators that allows to connect feedback inputs (create
 loops).
 It would look like this:
 
 A mapper that does nothing but iterates over some filtered input:
 
 *Current API :*
 DataStream source = ..
 IterativeDataStream it = source.iterate()
 DataStream mapper = it.map(noOpMapper)
 DataStream feedback = mapper.filter(...)
 it.closeWith(feedback)
 
 *Suggested API :*
 DataStream source = ..
 DataStream mapper = source.map(noOpMapper)
 DataStream feedback = mapper.filter(...)
 mapper.addInput(feedback)
 
 The suggested approach would let us define inputs to operators
 after
 they
 are created and implicitly union them with the normal input. This
 is I
 think a much clearer approach than what we have now.
 
 What do you think?
 
 Gyula
 
 
 
 
 
 



Re: Flink on Wikipedia

2015-07-07 Thread Matthias J. Sax
I agree with Kostas and don't see much danger that people get confused.
Nevertheless, I will update the history section accordingly.


On 07/07/2015 04:48 PM, Kostas Tzoumas wrote:
 I think it is clear to most people that the only official and (hopefully)
 up-to-date description of an Apache project is its Apache website, and any
 paper can get outdated. Perhaps we can change the link to a more up-to-date
 paper when we have one.
 
 I like the article, thanks Matthias!
 
 Kostas
 
 On Tue, Jul 7, 2015 at 4:43 PM, Stephan Ewen se...@apache.org wrote:
 
 Okay, I wrote a lot there

 tl:dr = Let's make sure people understand that the Stratosphere paper does
 not describe Flink.

 On Tue, Jul 7, 2015 at 4:33 PM, Matthias J. Sax 
 mj...@informatik.hu-berlin.de wrote:

 I can't follow. Stratosphere is only mentioned in the History part. Of
 course, we can strike out Stratosphere II and make clear that Flink is
 a fork on Stratosphere. But that is minor.

 And adding the Stratosphere papers as a reference, was the requirement
 to get the article accepted in the first place. Thus, it would not make
 sense to remove them. Currently, there are no other reliable and
 notable source (in term of Wikipedia guidelines) that can be used as
 references.

 Of course, the article is super short and needs to be extended. But the
 project is moving fast and only stable stuff should be on Wikipedia.
 It is hard enough to keep the project web page up to data. ;) Maybe, we
 can discuss what should be included and what not.


 -Matthias



 On 07/07/2015 04:17 PM, Stephan Ewen wrote:
 Thanks, Matthias, for starting this.

 It looks a bit like the article talks more about the Stratosphere
 project
 than Flink right now.
 I think we need to make a few things clear, to not confuse people:

 1) Flink != Stratosphere. When looking at the Stratosphere Paper and
 when
 looking at Flink, you look at two completely different things. It is
 not
 just that a name changed.

 2) It is not Stratosphere that became an Apache project. The project
 that
 became Apache Flink was forked from a subset of the Stratosphere
 project
 (core runtime only). That subset covered 1.5 out of the 5 areas of
 Stratosphere.

 When Flink became an Apache project, the only features it contained
 that
 were developed as part of Stratosphere, were the iterations, and the
 optimizer. Everything else had been re-developed.

 3) Flink has nothing to do with Stratosphere II. Even though that
 research
 project talks about streaming, they have a very different angle to
 that.


 I am writing this, because we are working heavy on making it easier for
 people to grasp what Flink is, and what it is not.
 A system with so many features and so much different technology is
 bound
 to
 be hard to grasp.

 The article gives people sources that seemingly describe what Flink is,
 but
 have little to do with what it is.


 The relationship between Flink and Stratosphere is actually the
 following:
 Flink is based on some research outcomes of Stratosphere (I), and was
 bootstrapped from a fork of the Stratosphere code base of late 2013.
 Ever
 since, these two have been proceeding independently.

 I would suggest to change the article to clearly reflect that.
 Otherwise,
 we will not do Flink a favor, but confuse people.

 Greetings,
 Stephan



 On Tue, Jul 7, 2015 at 10:40 AM, Henry Saputra 
 henry.sapu...@gmail.com
 wrote:

 Nice work indeed!

 - Henry

 On Tue, Jul 7, 2015 at 1:25 AM, Chiwan Park chiwanp...@apache.org
 wrote:
 Great! Nice start. :)
 The logo is shown now.

 Regards,
 Chiwan Park

 On Jul 7, 2015, at 5:06 PM, Maximilian Michels m...@apache.org
 wrote:

 Cool. Nice work, Matthias, and thanks for starting it off.

 On Mon, Jul 6, 2015 at 11:45 PM, Matthias J. Sax 
 mj...@informatik.hu-berlin.de wrote:

 Hi squirrels,

 I am happy to announce Flink on Wikipedia:
 https://en.wikipedia.org/wiki/Apache_Flink

 The Logo Request is still pending, but should be online soon.


 -Matthias











 



signature.asc
Description: OpenPGP digital signature