Re: Flink on Wikipedia

2015-07-07 Thread Maximilian Michels
Cool. Nice work, Matthias, and thanks for starting it off. On Mon, Jul 6, 2015 at 11:45 PM, Matthias J. Sax < mj...@informatik.hu-berlin.de> wrote: > Hi squirrels, > > I am happy to announce Flink on Wikipedia: > https://en.wikipedia.org/wiki/Apache_Flink > > The Logo Request is still pending, bu

Re: Flink on Wikipedia

2015-07-07 Thread Chiwan Park
Great! Nice start. :) The logo is shown now. Regards, Chiwan Park > On Jul 7, 2015, at 5:06 PM, Maximilian Michels wrote: > > Cool. Nice work, Matthias, and thanks for starting it off. > > On Mon, Jul 6, 2015 at 11:45 PM, Matthias J. Sax < > mj...@informatik.hu-berlin.de> wrote: > >> Hi squir

Re: [ml] Convergence Criterias

2015-07-07 Thread Till Rohrmann
I think Sachin wants to provide something similar to the LossFunction but for the convergence criterion. This would mean that the user can specify a convergence calculator, for example to the optimization framework, which is used from within a iterateWithTermination call. I think this is a good id

[jira] [Created] (FLINK-2323) Rename OperatorState methods to .value() and .update(..)

2015-07-07 Thread Gyula Fora (JIRA)
Gyula Fora created FLINK-2323: - Summary: Rename OperatorState methods to .value() and .update(..) Key: FLINK-2323 URL: https://issues.apache.org/jira/browse/FLINK-2323 Project: Flink Issue Type:

[jira] [Created] (FLINK-2324) Rework partitioned state storage

2015-07-07 Thread Gyula Fora (JIRA)
Gyula Fora created FLINK-2324: - Summary: Rework partitioned state storage Key: FLINK-2324 URL: https://issues.apache.org/jira/browse/FLINK-2324 Project: Flink Issue Type: Improvement

Re: Redesigned "Features" page

2015-07-07 Thread Till Rohrmann
I also like the new feature page. I better conveys the strong points of Flink, since it's more to the point. On Mon, Jul 6, 2015 at 6:09 PM, Stephan Ewen wrote: > Thanks Max! > > Did not even know we had a github mirror of the flink-web repo... > > On Mon, Jul 6, 2015 at 6:05 PM, Maximilian Mich

Re: Flink on Wikipedia

2015-07-07 Thread Henry Saputra
Nice work indeed! - Henry On Tue, Jul 7, 2015 at 1:25 AM, Chiwan Park wrote: > Great! Nice start. :) > The logo is shown now. > > Regards, > Chiwan Park > >> On Jul 7, 2015, at 5:06 PM, Maximilian Michels wrote: >> >> Cool. Nice work, Matthias, and thanks for starting it off. >> >> On Mon, Jul

Re: [ml] Convergence Criterias

2015-07-07 Thread Sachin Goel
> > Am I correct to assume that by "user" you mean library developers here? > Regular users who just use the API are unlikely to write their own > convergence > criterion function, yes? They would just set a value, for example the > relative > error change in gradient descent, perhaps after choosin

Re: [ml] Convergence Criterias

2015-07-07 Thread Sachin Goel
> > I think Sachin wants to provide something similar to the LossFunction but > for the convergence criterion. This would mean that the user can specify a > convergence calculator, for example to the optimization framework, which is > used from within a iterateWithTermination call > @Till, yes. Th

[jira] [Created] (FLINK-2325) PersistentKafkaSource throws ArrayIndexOutOfBoundsException if reading from a topic that is created after starting the Source

2015-07-07 Thread Rico Bergmann (JIRA)
Rico Bergmann created FLINK-2325: Summary: PersistentKafkaSource throws ArrayIndexOutOfBoundsException if reading from a topic that is created after starting the Source Key: FLINK-2325 URL: https://issues.apache.

Re: Redesigned "Features" page

2015-07-07 Thread Fabian Hueske
+1 for the clear and brief feature descriptions! I am not so sure about the structure of the points, especially separating "Streaming" and "Batch and Streaming in One System" does not support the message of a unified system, IMO. How about categorizing the points into three sections (Internals, A

Re: Redesigned "Features" page

2015-07-07 Thread Stephan Ewen
I actually put quite some thought into the structure of the points. They reflect pretty much what I observed (meetups and talks) where people get excited and what they are missing. The structure follows the line of through of "stream processor that also does batch very well". And then separate the

Re: Redesigned "Features" page

2015-07-07 Thread Gyula Fóra
I think the content is pretty good, much better than before. But the page structure could be better (and this is very important in my opinion). Now it just looks like a long list of features without any ways to navigate between them. We should probably have something at the top that summarizes the

Re: Redesigned "Features" page

2015-07-07 Thread Stephan Ewen
+1 to adding links In fact, all points should link to some documentation part. On Tue, Jul 7, 2015 at 1:33 PM, Gyula Fóra wrote: > I think the content is pretty good, much better than before. But the page > structure could be better (and this is very important in my opinion). > Now it just lo

Re: Flink 0.9 built with Scala 2.11

2015-07-07 Thread Chiwan Park
I cannot find talking about pure/non-pure java distinction in the documentation. I defined the rules about artifact id to apply modules by only Scala version not pure/non-pure java. The modules without suffix `_2.11` means that they are linked with Scala 2.10 binary. If I misunderstood your sente

[Gelly] Help with GSA compiler tests

2015-07-07 Thread Vasiliki Kalavri
Hello to my squirrels, I've started looking into FLINK-1943 and I need some help to understand what to test and how to do it properly. In the corresponding Spargel compiler test, the following functionality is checked: 1. sink: the ship strategy

Re: Design documents for consolidated DataStream API

2015-07-07 Thread Aljoscha Krettek
Hi, I just noticed that we don't have anything about how iterations and timestamps/watermarks should interact. Cheers, Aljoscha On Mon, 6 Jul 2015 at 23:56 Stephan Ewen wrote: > Hi all! > > As many of you know, there are a ongoing efforts to consolidate the > streaming API for the next release,

Re: Design documents for consolidated DataStream API

2015-07-07 Thread Gyula Fóra
You are right thats an important issue. And I think we should also do some renaming with the "iterations" because they are not really iterations like in the batch case and it might confuse some users. Maybe we can call them loops or cycles and rename the api calls to make it more intuitive what ha

Rework of streaming iteration API

2015-07-07 Thread Gyula Fóra
Hey, Along with the suggested changes to the streaming API structure I think we should also rework the "iteration" api. Currently the iteration api tries to mimic the syntax of the batch API while the runtime behaviour is quite different. What we create instead of iterations is really just cyclic

Re: Rework of streaming iteration API

2015-07-07 Thread Stephan Ewen
I see that the newly proposed API makes some things easier to define. There is one source of confusion, though, in my opinion: The new API suggests that the data stream actually refers to the operator that created it. The "DataStream mapper = source.map(noOpMapper)" here refers to the map operato

Re: Rework of streaming iteration API

2015-07-07 Thread Aljoscha Krettek
I think this would be good yes. I was just about to open an Issue for changing the Streaming Iteration API. :D Then we should also make the implementation very straightforward and simple, right now, the implementation of the iterations is all over the place. On Tue, 7 Jul 2015 at 15:57 Gyula Fóra

Re: Rework of streaming iteration API

2015-07-07 Thread Kostas Tzoumas
+1 for rethinking the iterations in DataStream However, wouldn't this proposal allow the definition of arbitrary loops (e.g., nested loops) that are not well behaved afaik? On Tue, Jul 7, 2015 at 4:12 PM, Stephan Ewen wrote: > I see that the newly proposed API makes some things easier to define

Re: Rework of streaming iteration API

2015-07-07 Thread Gyula Fóra
Sorry Stephan I meant it slightly differently, I see your point: DataStream source = ... SingleInputOperator mapper = source.map(...) mapper.addInput() So the add input would be a method of the operator not the stream. Aljoscha Krettek ezt írta (időpont: 2015. júl. 7., K, 16:12): > I think thi

Re: Rework of streaming iteration API

2015-07-07 Thread Gyula Fóra
@Kostas: This new API is I believe equivalent in expressivity with the current one. We can define nested loops now as well. And I also don't see nested loops much worse generally than simple loops. Gyula Fóra ezt írta (időpont: 2015. júl. 7., K, 16:14): > Sorry Stephan I meant it slightly differ

Re: Rework of streaming iteration API

2015-07-07 Thread Aljoscha Krettek
I think it could work if we allowed a DataStream to be unioned after creation. For example: DataStream source = .. DataStream mapper = source.map(noOpMapper) DataStream feedback = mapper.filter(...) source.union(feedback) This would basically mean that a DataStream is mutable and can be extended

Re: Flink on Wikipedia

2015-07-07 Thread Stephan Ewen
Thanks, Matthias, for starting this. It looks a bit like the article talks more about the Stratosphere project than Flink right now. I think we need to make a few things clear, to not confuse people: 1) Flink != Stratosphere. When looking at the Stratosphere Paper and when looking at Flink, you l

Re: Rework of streaming iteration API

2015-07-07 Thread Gyula Fóra
@Aljoscha: Yes, thats basically my point as well. This is what happens now too but we give this mutable datastream a special name : IterativeDataStream This can be handled in very different ways through the api, the goal would be to make something easy to use. I am fine with what we have now becau

Re: Rework of streaming iteration API

2015-07-07 Thread Stephan Ewen
In Aljoscha's approach, we would need a special mutable stream. We could do it like this: DataStream source = ... FeedbackPoint pt = source.createFeedbackPoint(); DataStream mapper = pt .map(noOpMapper) DataStream feedback = mapper.filter(...) pt .addFeedbacl(feedback) It is basically like the

Re: Rework of streaming iteration API

2015-07-07 Thread Gyula Fóra
Okay, I am fine with this approach as well I see the advantages. Then we just need to find a suitable name for marking a "FeedbackPoint" :) Stephan Ewen ezt írta (időpont: 2015. júl. 7., K, 16:28): > In Aljoscha's approach, we would need a special mutable stream. We could do > it like this: > >

[jira] [Created] (FLINK-2326) Mutitenancy on Yarn

2015-07-07 Thread LINTE (JIRA)
LINTE created FLINK-2326: Summary: Mutitenancy on Yarn Key: FLINK-2326 URL: https://issues.apache.org/jira/browse/FLINK-2326 Project: Flink Issue Type: Improvement Components: YARN Client

Re: Flink on Wikipedia

2015-07-07 Thread Matthias J. Sax
I can't follow. Stratosphere is only mentioned in the "History" part. Of course, we can strike out "Stratosphere II" and make clear that Flink is a fork on Stratosphere. But that is minor. And adding the Stratosphere papers as a reference, was the requirement to get the article accepted in the fir

Re: Rework of streaming iteration API

2015-07-07 Thread Kostas Tzoumas
I see. Perhaps more important IMO is defining the semantics of stream loops with event time. The reason I asked about nested is that Naiad and other designs used a multidimensional timestamp to capture loops: (outer loop counter, inner loop counter, timestamp). I assume that currently making sense

Re: Flink on Wikipedia

2015-07-07 Thread Stephan Ewen
Okay, I wrote a lot there tl:dr = Let's make sure people understand that the Stratosphere paper does not describe Flink. On Tue, Jul 7, 2015 at 4:33 PM, Matthias J. Sax < mj...@informatik.hu-berlin.de> wrote: > I can't follow. Stratosphere is only mentioned in the "History" part. Of > course, we

Re: Flink on Wikipedia

2015-07-07 Thread Kostas Tzoumas
I think it is clear to most people that the only official and (hopefully) up-to-date description of an Apache project is its Apache website, and any paper can get outdated. Perhaps we can change the link to a more up-to-date paper when we have one. I like the article, thanks Matthias! Kostas On

Re: Rework of streaming iteration API

2015-07-07 Thread Paris Carbone
Good points. If we want to structured loops on streaming we will need to inject iteration counters. The question is if we really need structured iterations on plain data streams. Window iterations are must-have on the other hand... Paris > On 07 Jul 2015, at 16:43, Kostas Tzoumas wrote: > > I

Re: Flink on Wikipedia

2015-07-07 Thread Matthias J. Sax
I agree with Kostas and don't see much danger that people get confused. Nevertheless, I will update the history section accordingly. On 07/07/2015 04:48 PM, Kostas Tzoumas wrote: > I think it is clear to most people that the only official and (hopefully) > up-to-date description of an Apache proj

Re: Flink on Wikipedia

2015-07-07 Thread Stephan Ewen
It was just a suggestion. @Matthias: You wrote the article, you decide. If you want to keep it, that's fine! On Tue, Jul 7, 2015 at 4:57 PM, Matthias J. Sax < mj...@informatik.hu-berlin.de> wrote: > I agree with Kostas and don't see much danger that people get confused. > Nevertheless, I will up

Re: Read 727 gz files ()

2015-07-07 Thread Felix Neutatz
Yes, that's maybe the problem. The user max is set to 100.000 open files. 2015-07-06 15:55 GMT+02:00 Stephan Ewen : > 4 mio file handles should be enough ;-) > > Is that the system global max, or the user's max? If the user's max us > lower, this may be the issue... > > On Mon, Jul 6, 2015 at 3:5

Re: Flink on Wikipedia

2015-07-07 Thread Matthias J. Sax
Well. It is not "my" article. It is on Wikipedia. Anyone can (and should) improve it! On 07/07/2015 05:08 PM, Stephan Ewen wrote: > It was just a suggestion. > > @Matthias: You wrote the article, you decide. If you want to keep it, > that's fine! > > On Tue, Jul 7, 2015 at 4:57 PM, Matthias J. S

Building several models in parallel

2015-07-07 Thread Felix Neutatz
Hi, at the moment I have a dataset which looks like this: DataSet[model_ID, DataVector] data So what I want to do is group by the model_ID and build for each model_ID one regression model in pseudo code: data.groupBy(model_ID) --> MultipleLinearRegression().fit(data_grouped) Is there a

Re: Building several models in parallel

2015-07-07 Thread Felix Schüler
Hi Felix! We had a similar usecase and I trained multiple models on partitions of my data with mapPartition and the model-parameters (weights) as broadcast variable. If I understood broadcast variables in Flink correctly, you should end up with one model on each TaskManager. Does that work? Feli