Re: [ml] Convergence Criterias
I think Sachin wants to provide something similar to the LossFunction but for the convergence criterion. This would mean that the user can specify a convergence calculator, for example to the optimization framework, which is used from within a iterateWithTermination call. I think this is a good idea and it would be a nice addition to the optimization framework, to begin with. I think that with (data, previousModel, currentModel) one can already model a lot of different convergence criteria. @Sachin, do you want to take the lead? However, I don’t think that a convergence criterion makes sense for the Predictor, because there are also algorithms which are non-iterative, e.g. some linear regression implementations. However, those which are, can take a convergence calculator as a parameter value. Cheers, Till On Mon, Jul 6, 2015 at 6:17 PM, Theodore Vasiloudis theodoros.vasilou...@gmail.com wrote: The point is to provide user with the solution before an iteration and Am I correct to assume that by user you mean library developers here? Regular users who just use the API are unlikely to write their own convergence criterion function, yes? They would just set a value, for example the relative error change in gradient descent, perhaps after choosing the criterion from a few available options. We can very well employ the iterateWithTermination semantics even under this by setting the second term in the return value to originalSolution.filter(x = !converged) Yes, we use this approach in the GradientDescent code, where we check for convergence using the relative loss between iterations. So assuming that this is aimed at developers and checking for convergence can be done quite efficiently using the above technique, what extra functionality would these proposed functions provide? I expect any kind of syntactic sugar aimed at developers will still have to use iterateWithTermination underneath. On Mon, Jul 6, 2015 at 4:47 PM, Sachin Goel sachingoel0...@gmail.com wrote: Sure. Usually, the convergence criterion can be user defined. For example, for a linear regression problem, user might want to run the training until the relative change in squared error falls below a specific threshold, or the weights fail to shift by a relative or absolute percentage. Similarly, for example, in the kmeans problem, we again have several different convergence criteria based on the change in wcss value, or the relative change in centroids. The point is to provide user with the solution before an iteration and solution after an iteration and let them decide whether it's time to just be done with iterating. We can very well employ the iterateWithTermination semantics even under this by setting the second term in the return value to originalSolution.filter(x = !converged) where converged is determined by the user defined convergence criteria. Of course, we're free to use our own convergence criteria in case the user doesn't specify any. This achieves the desired effect. This way user has more fine grained control over the training phase. Of course, to aid the user in defining their own convergence criteria, we can provide some generic functions in the Predictor itself, for example, to calculate the current value of the objective function. After this, rest is upto the imagination of the user. Thinking more about this, I'd actually like to drop the idea of providing an iteration state to the user. That only makes it more complicated and further requires user to know what exactly goes in the algorithm. Usually, the before and after solutions should suffice. I got too hung up on my decision tree implementation and wanted to incorporate the convergence criteria used there too. Cheers! Sachin [Written from a mobile device. Might contain some typos or grammatical errors] On Jul 6, 2015 1:31 PM, Theodore Vasiloudis theodoros.vasilou...@gmail.com wrote: Hello Sachin, could you share the motivation behind this? The iterateWithTermination function provides us with a means of checking for convergence during iterations, and checking for convergence depends highly on the algorithm being implemented. It could be the relative change in error, it could depend on the state (error+weights) history, or relative or absolute change in the model etc. Could you provide an example where having this function makes development easier? My concern is that this is a hard problem to generalize properly, given the dependence on the specific algorithm, model, and data. Regards, Theodore On Wed, Jul 1, 2015 at 9:23 PM, Sachin Goel sachingoel0...@gmail.com wrote: Hi all I'm trying to work out a general convergence framework for Machine Learning Algorithms which utilize iterations for optimization. For now, I can think of three kinds of convergence
[jira] [Created] (FLINK-2324) Rework partitioned state storage
Gyula Fora created FLINK-2324: - Summary: Rework partitioned state storage Key: FLINK-2324 URL: https://issues.apache.org/jira/browse/FLINK-2324 Project: Flink Issue Type: Improvement Reporter: Gyula Fora Assignee: Gyula Fora Partitioned states are currently stored per-key in statehandles. This is alright for in-memory storage but is very inefficient for HDFS. The logic behind the current mechanism is that this approach provides a way to repartition a state without fetching the data from the external storage and only manipulating handles. We should come up with a solution that can achieve both. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2323) Rename OperatorState methods to .value() and .update(..)
Gyula Fora created FLINK-2323: - Summary: Rename OperatorState methods to .value() and .update(..) Key: FLINK-2323 URL: https://issues.apache.org/jira/browse/FLINK-2323 Project: Flink Issue Type: Improvement Reporter: Gyula Fora Assignee: Gyula Fora Priority: Minor We should rename OperatorState methods to .value() and .update(..) from getState and updateState to make it more clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Flink on Wikipedia
It was just a suggestion. @Matthias: You wrote the article, you decide. If you want to keep it, that's fine! On Tue, Jul 7, 2015 at 4:57 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I agree with Kostas and don't see much danger that people get confused. Nevertheless, I will update the history section accordingly. On 07/07/2015 04:48 PM, Kostas Tzoumas wrote: I think it is clear to most people that the only official and (hopefully) up-to-date description of an Apache project is its Apache website, and any paper can get outdated. Perhaps we can change the link to a more up-to-date paper when we have one. I like the article, thanks Matthias! Kostas On Tue, Jul 7, 2015 at 4:43 PM, Stephan Ewen se...@apache.org wrote: Okay, I wrote a lot there tl:dr = Let's make sure people understand that the Stratosphere paper does not describe Flink. On Tue, Jul 7, 2015 at 4:33 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I can't follow. Stratosphere is only mentioned in the History part. Of course, we can strike out Stratosphere II and make clear that Flink is a fork on Stratosphere. But that is minor. And adding the Stratosphere papers as a reference, was the requirement to get the article accepted in the first place. Thus, it would not make sense to remove them. Currently, there are no other reliable and notable source (in term of Wikipedia guidelines) that can be used as references. Of course, the article is super short and needs to be extended. But the project is moving fast and only stable stuff should be on Wikipedia. It is hard enough to keep the project web page up to data. ;) Maybe, we can discuss what should be included and what not. -Matthias On 07/07/2015 04:17 PM, Stephan Ewen wrote: Thanks, Matthias, for starting this. It looks a bit like the article talks more about the Stratosphere project than Flink right now. I think we need to make a few things clear, to not confuse people: 1) Flink != Stratosphere. When looking at the Stratosphere Paper and when looking at Flink, you look at two completely different things. It is not just that a name changed. 2) It is not Stratosphere that became an Apache project. The project that became Apache Flink was forked from a subset of the Stratosphere project (core runtime only). That subset covered 1.5 out of the 5 areas of Stratosphere. When Flink became an Apache project, the only features it contained that were developed as part of Stratosphere, were the iterations, and the optimizer. Everything else had been re-developed. 3) Flink has nothing to do with Stratosphere II. Even though that research project talks about streaming, they have a very different angle to that. I am writing this, because we are working heavy on making it easier for people to grasp what Flink is, and what it is not. A system with so many features and so much different technology is bound to be hard to grasp. The article gives people sources that seemingly describe what Flink is, but have little to do with what it is. The relationship between Flink and Stratosphere is actually the following: Flink is based on some research outcomes of Stratosphere (I), and was bootstrapped from a fork of the Stratosphere code base of late 2013. Ever since, these two have been proceeding independently. I would suggest to change the article to clearly reflect that. Otherwise, we will not do Flink a favor, but confuse people. Greetings, Stephan On Tue, Jul 7, 2015 at 10:40 AM, Henry Saputra henry.sapu...@gmail.com wrote: Nice work indeed! - Henry On Tue, Jul 7, 2015 at 1:25 AM, Chiwan Park chiwanp...@apache.org wrote: Great! Nice start. :) The logo is shown now. Regards, Chiwan Park On Jul 7, 2015, at 5:06 PM, Maximilian Michels m...@apache.org wrote: Cool. Nice work, Matthias, and thanks for starting it off. On Mon, Jul 6, 2015 at 11:45 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi squirrels, I am happy to announce Flink on Wikipedia: https://en.wikipedia.org/wiki/Apache_Flink The Logo Request is still pending, but should be online soon. -Matthias
Re: Read 727 gz files ()
Yes, that's maybe the problem. The user max is set to 100.000 open files. 2015-07-06 15:55 GMT+02:00 Stephan Ewen se...@apache.org: 4 mio file handles should be enough ;-) Is that the system global max, or the user's max? If the user's max us lower, this may be the issue... On Mon, Jul 6, 2015 at 3:50 PM, Felix Neutatz neut...@googlemail.com wrote: So do you know how to solve this issue apart from increasing the current file-max (4748198)? 2015-07-06 15:35 GMT+02:00 Stephan Ewen se...@apache.org: I think the error is pretty much exactly in the stack trace: Caused by: java.io.FileNotFoundException: /data/4/hadoop/tmp/flink-io-0e2460bf-964b-4883-8eee-12869b9476ab/ 995a38a2c92536383d0057e3482999a9.000329.channel (Too many open files in system) On Mon, Jul 6, 2015 at 3:31 PM, Felix Neutatz neut...@googlemail.com wrote: Hi, I want to do some simple aggregations on 727 gz files (68 GB total) from HDFS. See code here: https://github.com/FelixNeutatz/wikiTrends/blob/master/extraction/src/main/scala/io/sanfran/wikiTrends/extraction/flink/Stats.scala We are using a Flink-0.9 SNAPSHOT. I get the following error: Caused by: java.lang.Exception: The data preparation for task 'Reduce(Reduce at org.apache.flink.api.scala.GroupedDataSet.reduce(GroupedDataSet.scala:293))' , caused an e rror: Error obtaining the sorted input: Thread 'SortMerger spilling thread' terminated due to an exception: Channel to path '/data/4/hadoop/tmp/flink-io-0e2460bf-964b-488 3-8eee-12869b9476ab/995a38a2c92536383d0057e3482999a9.000329.channel' could not be opened. at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:471) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:853) Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 'SortMerger spilling thread' terminated due to an exception: Channel to path '/data/4/hado op/tmp/flink-io-0e2460bf-964b-4883-8eee-12869b9476ab/995a38a2c92536383d0057e3482999a9.000329.channel' could not be opened. at org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:607) at org.apache.flink.runtime.operators.RegularPactTask.getInput(RegularPactTask.java:1145) at org.apache.flink.runtime.operators.ReduceDriver.prepare(ReduceDriver.java:93) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:466) ... 3 more Caused by: java.io.IOException: Thread 'SortMerger spilling thread' terminated due to an exception: Channel to path '/data/4/hadoop/tmp/flink-io-0e2460bf-964b-4883-8eee-1 2869b9476ab/995a38a2c92536383d0057e3482999a9.000329.channel' could not be opened. at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:784) Caused by: java.io.IOException: Channel to path '/data/4/hadoop/tmp/flink-io-0e2460bf-964b-4883-8eee-12869b9476ab/995a38a2c92536383d0057e3482999a9.000329.channel' could n ot be opened. at org.apache.flink.runtime.io.disk.iomanager.AbstractFileIOChannel.init(AbstractFileIOChannel.java:61) at org.apache.flink.runtime.io.disk.iomanager.AsynchronousFileIOChannel.init(AsynchronousFileIOChannel.java:86) at org.apache.flink.runtime.io.disk.iomanager.AsynchronousBlockWriterWithCallback.init(AsynchronousBlockWriterWithCallback.java:42) at org.apache.flink.runtime.io.disk.iomanager.AsynchronousBlockWriter.init(AsynchronousBlockWriter.java:44) at org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync.createBlockChannelWriter(IOManagerAsync.java:195) at org.apache.flink.runtime.io.disk.iomanager.IOManager.createBlockChannelWriter(IOManager.java:218) at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$SpillingThread.go(UnilateralSortMerger.java:1318) at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:781) Caused by: java.io.FileNotFoundException: /data/4/hadoop/tmp/flink-io-0e2460bf-964b-4883-8eee-12869b9476ab/995a38a2c92536383d0057e3482999a9.000329.channel (Too many open files in system) at java.io.RandomAccessFile.init(RandomAccessFile.java:252) at java.io.RandomAccessFile.init(RandomAccessFile.java:133) at
Re: Flink on Wikipedia
Well. It is not my article. It is on Wikipedia. Anyone can (and should) improve it! On 07/07/2015 05:08 PM, Stephan Ewen wrote: It was just a suggestion. @Matthias: You wrote the article, you decide. If you want to keep it, that's fine! On Tue, Jul 7, 2015 at 4:57 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I agree with Kostas and don't see much danger that people get confused. Nevertheless, I will update the history section accordingly. On 07/07/2015 04:48 PM, Kostas Tzoumas wrote: I think it is clear to most people that the only official and (hopefully) up-to-date description of an Apache project is its Apache website, and any paper can get outdated. Perhaps we can change the link to a more up-to-date paper when we have one. I like the article, thanks Matthias! Kostas On Tue, Jul 7, 2015 at 4:43 PM, Stephan Ewen se...@apache.org wrote: Okay, I wrote a lot there tl:dr = Let's make sure people understand that the Stratosphere paper does not describe Flink. On Tue, Jul 7, 2015 at 4:33 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I can't follow. Stratosphere is only mentioned in the History part. Of course, we can strike out Stratosphere II and make clear that Flink is a fork on Stratosphere. But that is minor. And adding the Stratosphere papers as a reference, was the requirement to get the article accepted in the first place. Thus, it would not make sense to remove them. Currently, there are no other reliable and notable source (in term of Wikipedia guidelines) that can be used as references. Of course, the article is super short and needs to be extended. But the project is moving fast and only stable stuff should be on Wikipedia. It is hard enough to keep the project web page up to data. ;) Maybe, we can discuss what should be included and what not. -Matthias On 07/07/2015 04:17 PM, Stephan Ewen wrote: Thanks, Matthias, for starting this. It looks a bit like the article talks more about the Stratosphere project than Flink right now. I think we need to make a few things clear, to not confuse people: 1) Flink != Stratosphere. When looking at the Stratosphere Paper and when looking at Flink, you look at two completely different things. It is not just that a name changed. 2) It is not Stratosphere that became an Apache project. The project that became Apache Flink was forked from a subset of the Stratosphere project (core runtime only). That subset covered 1.5 out of the 5 areas of Stratosphere. When Flink became an Apache project, the only features it contained that were developed as part of Stratosphere, were the iterations, and the optimizer. Everything else had been re-developed. 3) Flink has nothing to do with Stratosphere II. Even though that research project talks about streaming, they have a very different angle to that. I am writing this, because we are working heavy on making it easier for people to grasp what Flink is, and what it is not. A system with so many features and so much different technology is bound to be hard to grasp. The article gives people sources that seemingly describe what Flink is, but have little to do with what it is. The relationship between Flink and Stratosphere is actually the following: Flink is based on some research outcomes of Stratosphere (I), and was bootstrapped from a fork of the Stratosphere code base of late 2013. Ever since, these two have been proceeding independently. I would suggest to change the article to clearly reflect that. Otherwise, we will not do Flink a favor, but confuse people. Greetings, Stephan On Tue, Jul 7, 2015 at 10:40 AM, Henry Saputra henry.sapu...@gmail.com wrote: Nice work indeed! - Henry On Tue, Jul 7, 2015 at 1:25 AM, Chiwan Park chiwanp...@apache.org wrote: Great! Nice start. :) The logo is shown now. Regards, Chiwan Park On Jul 7, 2015, at 5:06 PM, Maximilian Michels m...@apache.org wrote: Cool. Nice work, Matthias, and thanks for starting it off. On Mon, Jul 6, 2015 at 11:45 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi squirrels, I am happy to announce Flink on Wikipedia: https://en.wikipedia.org/wiki/Apache_Flink The Logo Request is still pending, but should be online soon. -Matthias signature.asc Description: OpenPGP digital signature
Building several models in parallel
Hi, at the moment I have a dataset which looks like this: DataSet[model_ID, DataVector] data So what I want to do is group by the model_ID and build for each model_ID one regression model in pseudo code: data.groupBy(model_ID) -- MultipleLinearRegression().fit(data_grouped) Is there anyway besides an iteration how to do this at the moment? Thanks for your help, Felix Neutatz
Re: Redesigned Features page
I think the content is pretty good, much better than before. But the page structure could be better (and this is very important in my opinion). Now it just looks like a long list of features without any ways to navigate between them. We should probably have something at the top that summarizes the page. And maybe at the specific points we should add links to other pages (for instance internals pages). Stephan Ewen se...@apache.org ezt írta (időpont: 2015. júl. 7., K, 13:22): I actually put quite some thought into the structure of the points. They reflect pretty much what I observed (meetups and talks) where people get excited and what they are missing. The structure follows the line of through of stream processor that also does batch very well. And then separate the features as they contribute to that. Intermixing features across their relation to streaming and batch has not worked out for my presentations at all. It came across that Flink is a system with a multitude of aspects, but no one in the audience could really say what Flink is good at in the end, where it stands out. The proposed structure makes this very clear: - Flink is a kick-ass stream processor (Strongest differentiation from all other technology) - Flink can also do batch very well (It is broadly applicable and strong) - It surfaces nice APIs and rich libraries (Low entry hurdle) - It plays well with existing technology Stephan On Tue, Jul 7, 2015 at 12:22 PM, Fabian Hueske fhue...@gmail.com wrote: +1 for the clear and brief feature descriptions! I am not so sure about the structure of the points, especially separating Streaming and Batch and Streaming in One System does not support the message of a unified system, IMO. How about categorizing the points into three sections (Internals, API, Integration) and structuring the points for example as follows: -- High Performance, Robust Execution, and Fault-tolerance High performance Continuous Streaming Model with Flow Control Fault-tolerance via Lightweight Distributed Snapshots Memory Management Iterations and Delta Iterations -- Ease of use, APIs, and Clear Semantics One runtime for Streaming and Batch Processing Expressive APIs (Batch + Streaming) Exactly-once Semantics for Stateful Computations Program Optimizer Library Ecosystem -- Integration Broad Integration Cheers, Fabian 2015-07-07 10:39 GMT+02:00 Till Rohrmann trohrm...@apache.org: I also like the new feature page. I better conveys the strong points of Flink, since it's more to the point. On Mon, Jul 6, 2015 at 6:09 PM, Stephan Ewen se...@apache.org wrote: Thanks Max! Did not even know we had a github mirror of the flink-web repo... On Mon, Jul 6, 2015 at 6:05 PM, Maximilian Michels m...@apache.org wrote: Hi Stephan, Thanks for the feature page update. I think it is much more informative and better structured now. By the way, you could also open a pull request for your changes on https://github.com/apache/flink-web/pulls Cheers, Max On Mon, Jul 6, 2015 at 3:28 PM, Fabian Hueske fhue...@gmail.com wrote: I'll be happy to help, eh draw ;-) 2015-07-06 15:22 GMT+02:00 Stephan Ewen se...@apache.org: Hi all! I think that the Features page of the website is a bit out of date. I made an effort to stub a new one. It is committed under features_new.md and not yet built as an HTML page. If you want to take a look and help building this, pull the flink-web git repository and build the website locally. You can then look at it under http://localhost:4000/index_new.html; So far, I am happy with the new contents. Three things remain to be done: 1) The page misses a figure on Streaming throughput and latency. Currently, I am using a figure from some preliminary measurements made by Marton. IIRC, Robert is running some performance tests right now and we can use his results to create a new figure. 2) The figures could use some love. The concept is good, but the style is a bit sterile. I was hoping that Fabian would help out with his amazing style of drawing figures ;-) 3) More people should have a look and say if we should replace the current Features page with this new one Greetings, Stephan
Re: Redesigned Features page
I actually put quite some thought into the structure of the points. They reflect pretty much what I observed (meetups and talks) where people get excited and what they are missing. The structure follows the line of through of stream processor that also does batch very well. And then separate the features as they contribute to that. Intermixing features across their relation to streaming and batch has not worked out for my presentations at all. It came across that Flink is a system with a multitude of aspects, but no one in the audience could really say what Flink is good at in the end, where it stands out. The proposed structure makes this very clear: - Flink is a kick-ass stream processor (Strongest differentiation from all other technology) - Flink can also do batch very well (It is broadly applicable and strong) - It surfaces nice APIs and rich libraries (Low entry hurdle) - It plays well with existing technology Stephan On Tue, Jul 7, 2015 at 12:22 PM, Fabian Hueske fhue...@gmail.com wrote: +1 for the clear and brief feature descriptions! I am not so sure about the structure of the points, especially separating Streaming and Batch and Streaming in One System does not support the message of a unified system, IMO. How about categorizing the points into three sections (Internals, API, Integration) and structuring the points for example as follows: -- High Performance, Robust Execution, and Fault-tolerance High performance Continuous Streaming Model with Flow Control Fault-tolerance via Lightweight Distributed Snapshots Memory Management Iterations and Delta Iterations -- Ease of use, APIs, and Clear Semantics One runtime for Streaming and Batch Processing Expressive APIs (Batch + Streaming) Exactly-once Semantics for Stateful Computations Program Optimizer Library Ecosystem -- Integration Broad Integration Cheers, Fabian 2015-07-07 10:39 GMT+02:00 Till Rohrmann trohrm...@apache.org: I also like the new feature page. I better conveys the strong points of Flink, since it's more to the point. On Mon, Jul 6, 2015 at 6:09 PM, Stephan Ewen se...@apache.org wrote: Thanks Max! Did not even know we had a github mirror of the flink-web repo... On Mon, Jul 6, 2015 at 6:05 PM, Maximilian Michels m...@apache.org wrote: Hi Stephan, Thanks for the feature page update. I think it is much more informative and better structured now. By the way, you could also open a pull request for your changes on https://github.com/apache/flink-web/pulls Cheers, Max On Mon, Jul 6, 2015 at 3:28 PM, Fabian Hueske fhue...@gmail.com wrote: I'll be happy to help, eh draw ;-) 2015-07-06 15:22 GMT+02:00 Stephan Ewen se...@apache.org: Hi all! I think that the Features page of the website is a bit out of date. I made an effort to stub a new one. It is committed under features_new.md and not yet built as an HTML page. If you want to take a look and help building this, pull the flink-web git repository and build the website locally. You can then look at it under http://localhost:4000/index_new.html; So far, I am happy with the new contents. Three things remain to be done: 1) The page misses a figure on Streaming throughput and latency. Currently, I am using a figure from some preliminary measurements made by Marton. IIRC, Robert is running some performance tests right now and we can use his results to create a new figure. 2) The figures could use some love. The concept is good, but the style is a bit sterile. I was hoping that Fabian would help out with his amazing style of drawing figures ;-) 3) More people should have a look and say if we should replace the current Features page with this new one Greetings, Stephan
Re: Rework of streaming iteration API
Sorry Stephan I meant it slightly differently, I see your point: DataStream source = ... SingleInputOperator mapper = source.map(...) mapper.addInput() So the add input would be a method of the operator not the stream. Aljoscha Krettek aljos...@apache.org ezt írta (időpont: 2015. júl. 7., K, 16:12): I think this would be good yes. I was just about to open an Issue for changing the Streaming Iteration API. :D Then we should also make the implementation very straightforward and simple, right now, the implementation of the iterations is all over the place. On Tue, 7 Jul 2015 at 15:57 Gyula Fóra gyf...@apache.org wrote: Hey, Along with the suggested changes to the streaming API structure I think we should also rework the iteration api. Currently the iteration api tries to mimic the syntax of the batch API while the runtime behaviour is quite different. What we create instead of iterations is really just cyclic streams (loops in the streaming job), so the API should somehow be intuitive about this behaviour. I suggest to remove the explicit iterate call and instead add a method to the StreamOperators that allows to connect feedback inputs (create loops). It would look like this: A mapper that does nothing but iterates over some filtered input: *Current API :* DataStream source = .. IterativeDataStream it = source.iterate() DataStream mapper = it.map(noOpMapper) DataStream feedback = mapper.filter(...) it.closeWith(feedback) *Suggested API :* DataStream source = .. DataStream mapper = source.map(noOpMapper) DataStream feedback = mapper.filter(...) mapper.addInput(feedback) The suggested approach would let us define inputs to operators after they are created and implicitly union them with the normal input. This is I think a much clearer approach than what we have now. What do you think? Gyula
Re: Rework of streaming iteration API
@Kostas: This new API is I believe equivalent in expressivity with the current one. We can define nested loops now as well. And I also don't see nested loops much worse generally than simple loops. Gyula Fóra gyula.f...@gmail.com ezt írta (időpont: 2015. júl. 7., K, 16:14): Sorry Stephan I meant it slightly differently, I see your point: DataStream source = ... SingleInputOperator mapper = source.map(...) mapper.addInput() So the add input would be a method of the operator not the stream. Aljoscha Krettek aljos...@apache.org ezt írta (időpont: 2015. júl. 7., K, 16:12): I think this would be good yes. I was just about to open an Issue for changing the Streaming Iteration API. :D Then we should also make the implementation very straightforward and simple, right now, the implementation of the iterations is all over the place. On Tue, 7 Jul 2015 at 15:57 Gyula Fóra gyf...@apache.org wrote: Hey, Along with the suggested changes to the streaming API structure I think we should also rework the iteration api. Currently the iteration api tries to mimic the syntax of the batch API while the runtime behaviour is quite different. What we create instead of iterations is really just cyclic streams (loops in the streaming job), so the API should somehow be intuitive about this behaviour. I suggest to remove the explicit iterate call and instead add a method to the StreamOperators that allows to connect feedback inputs (create loops). It would look like this: A mapper that does nothing but iterates over some filtered input: *Current API :* DataStream source = .. IterativeDataStream it = source.iterate() DataStream mapper = it.map(noOpMapper) DataStream feedback = mapper.filter(...) it.closeWith(feedback) *Suggested API :* DataStream source = .. DataStream mapper = source.map(noOpMapper) DataStream feedback = mapper.filter(...) mapper.addInput(feedback) The suggested approach would let us define inputs to operators after they are created and implicitly union them with the normal input. This is I think a much clearer approach than what we have now. What do you think? Gyula
Re: Flink on Wikipedia
Thanks, Matthias, for starting this. It looks a bit like the article talks more about the Stratosphere project than Flink right now. I think we need to make a few things clear, to not confuse people: 1) Flink != Stratosphere. When looking at the Stratosphere Paper and when looking at Flink, you look at two completely different things. It is not just that a name changed. 2) It is not Stratosphere that became an Apache project. The project that became Apache Flink was forked from a subset of the Stratosphere project (core runtime only). That subset covered 1.5 out of the 5 areas of Stratosphere. When Flink became an Apache project, the only features it contained that were developed as part of Stratosphere, were the iterations, and the optimizer. Everything else had been re-developed. 3) Flink has nothing to do with Stratosphere II. Even though that research project talks about streaming, they have a very different angle to that. I am writing this, because we are working heavy on making it easier for people to grasp what Flink is, and what it is not. A system with so many features and so much different technology is bound to be hard to grasp. The article gives people sources that seemingly describe what Flink is, but have little to do with what it is. The relationship between Flink and Stratosphere is actually the following: Flink is based on some research outcomes of Stratosphere (I), and was bootstrapped from a fork of the Stratosphere code base of late 2013. Ever since, these two have been proceeding independently. I would suggest to change the article to clearly reflect that. Otherwise, we will not do Flink a favor, but confuse people. Greetings, Stephan On Tue, Jul 7, 2015 at 10:40 AM, Henry Saputra henry.sapu...@gmail.com wrote: Nice work indeed! - Henry On Tue, Jul 7, 2015 at 1:25 AM, Chiwan Park chiwanp...@apache.org wrote: Great! Nice start. :) The logo is shown now. Regards, Chiwan Park On Jul 7, 2015, at 5:06 PM, Maximilian Michels m...@apache.org wrote: Cool. Nice work, Matthias, and thanks for starting it off. On Mon, Jul 6, 2015 at 11:45 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi squirrels, I am happy to announce Flink on Wikipedia: https://en.wikipedia.org/wiki/Apache_Flink The Logo Request is still pending, but should be online soon. -Matthias
Re: Rework of streaming iteration API
@Aljoscha: Yes, thats basically my point as well. This is what happens now too but we give this mutable datastream a special name : IterativeDataStream This can be handled in very different ways through the api, the goal would be to make something easy to use. I am fine with what we have now because I know how it works but it might confuse people to call it iterate. Aljoscha Krettek aljos...@apache.org ezt írta (időpont: 2015. júl. 7., K, 16:18): I think it could work if we allowed a DataStream to be unioned after creation. For example: DataStream source = .. DataStream mapper = source.map(noOpMapper) DataStream feedback = mapper.filter(...) source.union(feedback) This would basically mean that a DataStream is mutable and can be extended after creation with more streams. On Tue, 7 Jul 2015 at 16:12 Aljoscha Krettek aljos...@apache.org wrote: I think this would be good yes. I was just about to open an Issue for changing the Streaming Iteration API. :D Then we should also make the implementation very straightforward and simple, right now, the implementation of the iterations is all over the place. On Tue, 7 Jul 2015 at 15:57 Gyula Fóra gyf...@apache.org wrote: Hey, Along with the suggested changes to the streaming API structure I think we should also rework the iteration api. Currently the iteration api tries to mimic the syntax of the batch API while the runtime behaviour is quite different. What we create instead of iterations is really just cyclic streams (loops in the streaming job), so the API should somehow be intuitive about this behaviour. I suggest to remove the explicit iterate call and instead add a method to the StreamOperators that allows to connect feedback inputs (create loops). It would look like this: A mapper that does nothing but iterates over some filtered input: *Current API :* DataStream source = .. IterativeDataStream it = source.iterate() DataStream mapper = it.map(noOpMapper) DataStream feedback = mapper.filter(...) it.closeWith(feedback) *Suggested API :* DataStream source = .. DataStream mapper = source.map(noOpMapper) DataStream feedback = mapper.filter(...) mapper.addInput(feedback) The suggested approach would let us define inputs to operators after they are created and implicitly union them with the normal input. This is I think a much clearer approach than what we have now. What do you think? Gyula
Re: Design documents for consolidated DataStream API
Hi, I just noticed that we don't have anything about how iterations and timestamps/watermarks should interact. Cheers, Aljoscha On Mon, 6 Jul 2015 at 23:56 Stephan Ewen se...@apache.org wrote: Hi all! As many of you know, there are a ongoing efforts to consolidate the streaming API for the next release, and then graduate it (from beta status). In the process of this consolidation, we want to achieve the following goals. - Make the code more robust and simplify it in parts - Clearly define the semantics of the constructs. - Prepare it for support of more advanced concepts, like partitionable state, and event time. - Cut support for certain corner cases that were prototyped, but turned out to be not efficiently doable Based on prior discussions on the mailing list, Aljoscha and me drafted the design documents below, which outline how the consolidated API would like. We focused in constructs, time, and window semantics. Design document on how to restructure the Streaming API: https://cwiki.apache.org/confluence/display/FLINK/Streams+and+Operations+on+Streams Design document on definitions of time, order, and the resulting semantics: https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams Note: The design of the interfaces and concepts for advanced state in functions is not in here. That is part of a separate design discussion and orthogonal to the designs drafted here. Please have a look and voice questions and concerns. Since we should not break the streaming API more than once, we should make sure this consolidation brings it into the shape we want it to be in. Greetings, Stephan
[Gelly] Help with GSA compiler tests
Hello to my squirrels, I've started looking into FLINK-1943 https://issues.apache.org/jira/browse/FLINK-1943 and I need some help to understand what to test and how to do it properly. In the corresponding Spargel compiler test, the following functionality is checked: 1. sink: the ship strategy is FORWARD and the parallelism is correct 2. iteration: degree of parallelism 3. solution set join: parallelism and input1 ship strategy is PARTITION_HASH 4. workset join: parallelism, input1 (edges) ship strategy is PARTITION_HASH and cached, input2 (workset) ship strategy is FORWARD 5. check that the initial partitioning is pushed out of the loop 6. check that the initial workset sort is outside the loop I have been able to verify 1-4 of the above for the GSA iteration plan, but I'm not sure how to check (5) and (6) or whether they are expected to hold in the GSA case. In [1] you can see what the GSA iteration operators looks like and in [2] you can see what the visualizer tools generates the GSA connected components. Any pointers would be greatly appreciated! Cheers, Vasia. [1]: https://docs.google.com/drawings/d/1tiNQeOphWtkNXTGlnDJ3Ipanh0Tm2R8sHe8XNyTnf98/edit?usp=sharing [2]: http://imgur.com/GQZ48ZI
Rework of streaming iteration API
Hey, Along with the suggested changes to the streaming API structure I think we should also rework the iteration api. Currently the iteration api tries to mimic the syntax of the batch API while the runtime behaviour is quite different. What we create instead of iterations is really just cyclic streams (loops in the streaming job), so the API should somehow be intuitive about this behaviour. I suggest to remove the explicit iterate call and instead add a method to the StreamOperators that allows to connect feedback inputs (create loops). It would look like this: A mapper that does nothing but iterates over some filtered input: *Current API :* DataStream source = .. IterativeDataStream it = source.iterate() DataStream mapper = it.map(noOpMapper) DataStream feedback = mapper.filter(...) it.closeWith(feedback) *Suggested API :* DataStream source = .. DataStream mapper = source.map(noOpMapper) DataStream feedback = mapper.filter(...) mapper.addInput(feedback) The suggested approach would let us define inputs to operators after they are created and implicitly union them with the normal input. This is I think a much clearer approach than what we have now. What do you think? Gyula
Re: Building several models in parallel
Hi Felix! We had a similar usecase and I trained multiple models on partitions of my data with mapPartition and the model-parameters (weights) as broadcast variable. If I understood broadcast variables in Flink correctly, you should end up with one model on each TaskManager. Does that work? Felix Am 07.07.2015 um 17:32 schrieb Felix Neutatz: Hi, at the moment I have a dataset which looks like this: DataSet[model_ID, DataVector] data So what I want to do is group by the model_ID and build for each model_ID one regression model in pseudo code: data.groupBy(model_ID) -- MultipleLinearRegression().fit(data_grouped) Is there anyway besides an iteration how to do this at the moment? Thanks for your help, Felix Neutatz
Re: Rework of streaming iteration API
Okay, I am fine with this approach as well I see the advantages. Then we just need to find a suitable name for marking a FeedbackPoint :) Stephan Ewen se...@apache.org ezt írta (időpont: 2015. júl. 7., K, 16:28): In Aljoscha's approach, we would need a special mutable stream. We could do it like this: DataStream source = ... FeedbackPoint pt = source.createFeedbackPoint(); DataStream mapper = pt .map(noOpMapper) DataStream feedback = mapper.filter(...) pt .addFeedbacl(feedback) It is basically like the current approach, with different names. I actually like the current approach, because it is explicit where streams could be altered in hind-sight (after their definition). On Tue, Jul 7, 2015 at 4:20 PM, Gyula Fóra gyula.f...@gmail.com wrote: @Aljoscha: Yes, thats basically my point as well. This is what happens now too but we give this mutable datastream a special name : IterativeDataStream This can be handled in very different ways through the api, the goal would be to make something easy to use. I am fine with what we have now because I know how it works but it might confuse people to call it iterate. Aljoscha Krettek aljos...@apache.org ezt írta (időpont: 2015. júl. 7., K, 16:18): I think it could work if we allowed a DataStream to be unioned after creation. For example: DataStream source = .. DataStream mapper = source.map(noOpMapper) DataStream feedback = mapper.filter(...) source.union(feedback) This would basically mean that a DataStream is mutable and can be extended after creation with more streams. On Tue, 7 Jul 2015 at 16:12 Aljoscha Krettek aljos...@apache.org wrote: I think this would be good yes. I was just about to open an Issue for changing the Streaming Iteration API. :D Then we should also make the implementation very straightforward and simple, right now, the implementation of the iterations is all over the place. On Tue, 7 Jul 2015 at 15:57 Gyula Fóra gyf...@apache.org wrote: Hey, Along with the suggested changes to the streaming API structure I think we should also rework the iteration api. Currently the iteration api tries to mimic the syntax of the batch API while the runtime behaviour is quite different. What we create instead of iterations is really just cyclic streams (loops in the streaming job), so the API should somehow be intuitive about this behaviour. I suggest to remove the explicit iterate call and instead add a method to the StreamOperators that allows to connect feedback inputs (create loops). It would look like this: A mapper that does nothing but iterates over some filtered input: *Current API :* DataStream source = .. IterativeDataStream it = source.iterate() DataStream mapper = it.map(noOpMapper) DataStream feedback = mapper.filter(...) it.closeWith(feedback) *Suggested API :* DataStream source = .. DataStream mapper = source.map(noOpMapper) DataStream feedback = mapper.filter(...) mapper.addInput(feedback) The suggested approach would let us define inputs to operators after they are created and implicitly union them with the normal input. This is I think a much clearer approach than what we have now. What do you think? Gyula
[jira] [Created] (FLINK-2326) Mutitenancy on Yarn
LINTE created FLINK-2326: Summary: Mutitenancy on Yarn Key: FLINK-2326 URL: https://issues.apache.org/jira/browse/FLINK-2326 Project: Flink Issue Type: Improvement Components: YARN Client Affects Versions: 0.9 Environment: Centos 6.6 Hadoop 2.7 secured with kerberos Flink 0.9 Reporter: LINTE When a user launch a Flink cluster on yarn, i .yarn-properties file is created in the conf directory. In multiuser environnement the configuration directory is read only. Even with write permission for all user on conf directory, then just on user can run a Flink cluster. Regards -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Flink on Wikipedia
Okay, I wrote a lot there tl:dr = Let's make sure people understand that the Stratosphere paper does not describe Flink. On Tue, Jul 7, 2015 at 4:33 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I can't follow. Stratosphere is only mentioned in the History part. Of course, we can strike out Stratosphere II and make clear that Flink is a fork on Stratosphere. But that is minor. And adding the Stratosphere papers as a reference, was the requirement to get the article accepted in the first place. Thus, it would not make sense to remove them. Currently, there are no other reliable and notable source (in term of Wikipedia guidelines) that can be used as references. Of course, the article is super short and needs to be extended. But the project is moving fast and only stable stuff should be on Wikipedia. It is hard enough to keep the project web page up to data. ;) Maybe, we can discuss what should be included and what not. -Matthias On 07/07/2015 04:17 PM, Stephan Ewen wrote: Thanks, Matthias, for starting this. It looks a bit like the article talks more about the Stratosphere project than Flink right now. I think we need to make a few things clear, to not confuse people: 1) Flink != Stratosphere. When looking at the Stratosphere Paper and when looking at Flink, you look at two completely different things. It is not just that a name changed. 2) It is not Stratosphere that became an Apache project. The project that became Apache Flink was forked from a subset of the Stratosphere project (core runtime only). That subset covered 1.5 out of the 5 areas of Stratosphere. When Flink became an Apache project, the only features it contained that were developed as part of Stratosphere, were the iterations, and the optimizer. Everything else had been re-developed. 3) Flink has nothing to do with Stratosphere II. Even though that research project talks about streaming, they have a very different angle to that. I am writing this, because we are working heavy on making it easier for people to grasp what Flink is, and what it is not. A system with so many features and so much different technology is bound to be hard to grasp. The article gives people sources that seemingly describe what Flink is, but have little to do with what it is. The relationship between Flink and Stratosphere is actually the following: Flink is based on some research outcomes of Stratosphere (I), and was bootstrapped from a fork of the Stratosphere code base of late 2013. Ever since, these two have been proceeding independently. I would suggest to change the article to clearly reflect that. Otherwise, we will not do Flink a favor, but confuse people. Greetings, Stephan On Tue, Jul 7, 2015 at 10:40 AM, Henry Saputra henry.sapu...@gmail.com wrote: Nice work indeed! - Henry On Tue, Jul 7, 2015 at 1:25 AM, Chiwan Park chiwanp...@apache.org wrote: Great! Nice start. :) The logo is shown now. Regards, Chiwan Park On Jul 7, 2015, at 5:06 PM, Maximilian Michels m...@apache.org wrote: Cool. Nice work, Matthias, and thanks for starting it off. On Mon, Jul 6, 2015 at 11:45 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi squirrels, I am happy to announce Flink on Wikipedia: https://en.wikipedia.org/wiki/Apache_Flink The Logo Request is still pending, but should be online soon. -Matthias
Re: Flink on Wikipedia
I think it is clear to most people that the only official and (hopefully) up-to-date description of an Apache project is its Apache website, and any paper can get outdated. Perhaps we can change the link to a more up-to-date paper when we have one. I like the article, thanks Matthias! Kostas On Tue, Jul 7, 2015 at 4:43 PM, Stephan Ewen se...@apache.org wrote: Okay, I wrote a lot there tl:dr = Let's make sure people understand that the Stratosphere paper does not describe Flink. On Tue, Jul 7, 2015 at 4:33 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I can't follow. Stratosphere is only mentioned in the History part. Of course, we can strike out Stratosphere II and make clear that Flink is a fork on Stratosphere. But that is minor. And adding the Stratosphere papers as a reference, was the requirement to get the article accepted in the first place. Thus, it would not make sense to remove them. Currently, there are no other reliable and notable source (in term of Wikipedia guidelines) that can be used as references. Of course, the article is super short and needs to be extended. But the project is moving fast and only stable stuff should be on Wikipedia. It is hard enough to keep the project web page up to data. ;) Maybe, we can discuss what should be included and what not. -Matthias On 07/07/2015 04:17 PM, Stephan Ewen wrote: Thanks, Matthias, for starting this. It looks a bit like the article talks more about the Stratosphere project than Flink right now. I think we need to make a few things clear, to not confuse people: 1) Flink != Stratosphere. When looking at the Stratosphere Paper and when looking at Flink, you look at two completely different things. It is not just that a name changed. 2) It is not Stratosphere that became an Apache project. The project that became Apache Flink was forked from a subset of the Stratosphere project (core runtime only). That subset covered 1.5 out of the 5 areas of Stratosphere. When Flink became an Apache project, the only features it contained that were developed as part of Stratosphere, were the iterations, and the optimizer. Everything else had been re-developed. 3) Flink has nothing to do with Stratosphere II. Even though that research project talks about streaming, they have a very different angle to that. I am writing this, because we are working heavy on making it easier for people to grasp what Flink is, and what it is not. A system with so many features and so much different technology is bound to be hard to grasp. The article gives people sources that seemingly describe what Flink is, but have little to do with what it is. The relationship between Flink and Stratosphere is actually the following: Flink is based on some research outcomes of Stratosphere (I), and was bootstrapped from a fork of the Stratosphere code base of late 2013. Ever since, these two have been proceeding independently. I would suggest to change the article to clearly reflect that. Otherwise, we will not do Flink a favor, but confuse people. Greetings, Stephan On Tue, Jul 7, 2015 at 10:40 AM, Henry Saputra henry.sapu...@gmail.com wrote: Nice work indeed! - Henry On Tue, Jul 7, 2015 at 1:25 AM, Chiwan Park chiwanp...@apache.org wrote: Great! Nice start. :) The logo is shown now. Regards, Chiwan Park On Jul 7, 2015, at 5:06 PM, Maximilian Michels m...@apache.org wrote: Cool. Nice work, Matthias, and thanks for starting it off. On Mon, Jul 6, 2015 at 11:45 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi squirrels, I am happy to announce Flink on Wikipedia: https://en.wikipedia.org/wiki/Apache_Flink The Logo Request is still pending, but should be online soon. -Matthias
Re: Rework of streaming iteration API
Good points. If we want to structured loops on streaming we will need to inject iteration counters. The question is if we really need structured iterations on plain data streams. Window iterations are must-have on the other hand... Paris On 07 Jul 2015, at 16:43, Kostas Tzoumas ktzou...@apache.org wrote: I see. Perhaps more important IMO is defining the semantics of stream loops with event time. The reason I asked about nested is that Naiad and other designs used a multidimensional timestamp to capture loops: (outer loop counter, inner loop counter, timestamp). I assume that currently making sense of which iteration an element comes from is left to the user. Should we aim to change that with the API redesign? On Tue, Jul 7, 2015 at 4:30 PM, Gyula Fóra gyula.f...@gmail.com wrote: Okay, I am fine with this approach as well I see the advantages. Then we just need to find a suitable name for marking a FeedbackPoint :) Stephan Ewen se...@apache.org ezt írta (időpont: 2015. júl. 7., K, 16:28): In Aljoscha's approach, we would need a special mutable stream. We could do it like this: DataStream source = ... FeedbackPoint pt = source.createFeedbackPoint(); DataStream mapper = pt .map(noOpMapper) DataStream feedback = mapper.filter(...) pt .addFeedbacl(feedback) It is basically like the current approach, with different names. I actually like the current approach, because it is explicit where streams could be altered in hind-sight (after their definition). On Tue, Jul 7, 2015 at 4:20 PM, Gyula Fóra gyula.f...@gmail.com wrote: @Aljoscha: Yes, thats basically my point as well. This is what happens now too but we give this mutable datastream a special name : IterativeDataStream This can be handled in very different ways through the api, the goal would be to make something easy to use. I am fine with what we have now because I know how it works but it might confuse people to call it iterate. Aljoscha Krettek aljos...@apache.org ezt írta (időpont: 2015. júl. 7., K, 16:18): I think it could work if we allowed a DataStream to be unioned after creation. For example: DataStream source = .. DataStream mapper = source.map(noOpMapper) DataStream feedback = mapper.filter(...) source.union(feedback) This would basically mean that a DataStream is mutable and can be extended after creation with more streams. On Tue, 7 Jul 2015 at 16:12 Aljoscha Krettek aljos...@apache.org wrote: I think this would be good yes. I was just about to open an Issue for changing the Streaming Iteration API. :D Then we should also make the implementation very straightforward and simple, right now, the implementation of the iterations is all over the place. On Tue, 7 Jul 2015 at 15:57 Gyula Fóra gyf...@apache.org wrote: Hey, Along with the suggested changes to the streaming API structure I think we should also rework the iteration api. Currently the iteration api tries to mimic the syntax of the batch API while the runtime behaviour is quite different. What we create instead of iterations is really just cyclic streams (loops in the streaming job), so the API should somehow be intuitive about this behaviour. I suggest to remove the explicit iterate call and instead add a method to the StreamOperators that allows to connect feedback inputs (create loops). It would look like this: A mapper that does nothing but iterates over some filtered input: *Current API :* DataStream source = .. IterativeDataStream it = source.iterate() DataStream mapper = it.map(noOpMapper) DataStream feedback = mapper.filter(...) it.closeWith(feedback) *Suggested API :* DataStream source = .. DataStream mapper = source.map(noOpMapper) DataStream feedback = mapper.filter(...) mapper.addInput(feedback) The suggested approach would let us define inputs to operators after they are created and implicitly union them with the normal input. This is I think a much clearer approach than what we have now. What do you think? Gyula
Re: Flink on Wikipedia
I agree with Kostas and don't see much danger that people get confused. Nevertheless, I will update the history section accordingly. On 07/07/2015 04:48 PM, Kostas Tzoumas wrote: I think it is clear to most people that the only official and (hopefully) up-to-date description of an Apache project is its Apache website, and any paper can get outdated. Perhaps we can change the link to a more up-to-date paper when we have one. I like the article, thanks Matthias! Kostas On Tue, Jul 7, 2015 at 4:43 PM, Stephan Ewen se...@apache.org wrote: Okay, I wrote a lot there tl:dr = Let's make sure people understand that the Stratosphere paper does not describe Flink. On Tue, Jul 7, 2015 at 4:33 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: I can't follow. Stratosphere is only mentioned in the History part. Of course, we can strike out Stratosphere II and make clear that Flink is a fork on Stratosphere. But that is minor. And adding the Stratosphere papers as a reference, was the requirement to get the article accepted in the first place. Thus, it would not make sense to remove them. Currently, there are no other reliable and notable source (in term of Wikipedia guidelines) that can be used as references. Of course, the article is super short and needs to be extended. But the project is moving fast and only stable stuff should be on Wikipedia. It is hard enough to keep the project web page up to data. ;) Maybe, we can discuss what should be included and what not. -Matthias On 07/07/2015 04:17 PM, Stephan Ewen wrote: Thanks, Matthias, for starting this. It looks a bit like the article talks more about the Stratosphere project than Flink right now. I think we need to make a few things clear, to not confuse people: 1) Flink != Stratosphere. When looking at the Stratosphere Paper and when looking at Flink, you look at two completely different things. It is not just that a name changed. 2) It is not Stratosphere that became an Apache project. The project that became Apache Flink was forked from a subset of the Stratosphere project (core runtime only). That subset covered 1.5 out of the 5 areas of Stratosphere. When Flink became an Apache project, the only features it contained that were developed as part of Stratosphere, were the iterations, and the optimizer. Everything else had been re-developed. 3) Flink has nothing to do with Stratosphere II. Even though that research project talks about streaming, they have a very different angle to that. I am writing this, because we are working heavy on making it easier for people to grasp what Flink is, and what it is not. A system with so many features and so much different technology is bound to be hard to grasp. The article gives people sources that seemingly describe what Flink is, but have little to do with what it is. The relationship between Flink and Stratosphere is actually the following: Flink is based on some research outcomes of Stratosphere (I), and was bootstrapped from a fork of the Stratosphere code base of late 2013. Ever since, these two have been proceeding independently. I would suggest to change the article to clearly reflect that. Otherwise, we will not do Flink a favor, but confuse people. Greetings, Stephan On Tue, Jul 7, 2015 at 10:40 AM, Henry Saputra henry.sapu...@gmail.com wrote: Nice work indeed! - Henry On Tue, Jul 7, 2015 at 1:25 AM, Chiwan Park chiwanp...@apache.org wrote: Great! Nice start. :) The logo is shown now. Regards, Chiwan Park On Jul 7, 2015, at 5:06 PM, Maximilian Michels m...@apache.org wrote: Cool. Nice work, Matthias, and thanks for starting it off. On Mon, Jul 6, 2015 at 11:45 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi squirrels, I am happy to announce Flink on Wikipedia: https://en.wikipedia.org/wiki/Apache_Flink The Logo Request is still pending, but should be online soon. -Matthias signature.asc Description: OpenPGP digital signature