Re: [GSoc][flink-streaming] Interested in pursuing FLINK-1617 and FLINK-1534
Hey Dixit, Sorry for the delay, I had to discuss this in more detail with some of our other core developers. The consensus seems to be that we would like push this project in a direction where the changes can be quickly included in the next releases. For this it is essential that we implement features that are complete (and clean) from the users perspective. This does not necessarily mean that we would like to have everything at once but rather that it is preferable to start with something clean and simple (for instance the naive chained filter approach) and progressively build more complex logic. This also mean that we would like to avoid researchy code in the codebase as much as possible. Of course once we have a stable api for this functionality we can work towards making the optimizations that you have mentioned like operator sharing and so on. The ideal proposal would give a clear sketch of the pattern matching API that you would like to implement, which might be some added operators at first to the current API and possible a DSL later with more advanced functionality (this would probably go in a separate library until it is very stable). So please in the proposal include a preview of what the pattern matching syntax would look like integrated with the current operators, how it would interact with other parts of the system etc. These are the thing we need to figure out before we consider the optimizations I think, because it usually turns out, that the API semantics you would like to provide can hugely affect (probably limit) the possibilities that you have afterwards in terms of optimizations. Let me know if you have further questions regarding this :) Gyula On Tue, Mar 24, 2015 at 12:01 PM, Gyula Fóra gyf...@apache.org wrote: Hey, Give me an hour or so as I am in a meeting currently, but I will get back to you afterwards. Regards, Gyula On Tue, Mar 24, 2015 at 11:03 AM, Akshay Dixit akshayd...@gmail.com wrote: Hi, It'd really help if I got a reply soon. It'll be helpful in writing the proposal since the deadline is on 27th. Thanks Regards, Akshay Dixit On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit akshayd...@gmail.com wrote: Thanks for the explanation Marton. I've decided to try out for FLINK-1534. After reading through the thesis[4] and a few other papers[1][2][3], I believe I've gathered a little context to ask more questions. But I'm still not sure how Flink's internals work so please bear with me. Although the ongoing effort to document the architecture and internal is really helpful for newbies like me and would greatly decrease the ramping up time. Detecting a pattern of events would comprise of a pipeline that accepts the pattern query and sources of DataStreams, and outputs detected matches of that pattern to a sink or forwards it along to another stream for further computation. As you said, a simple filter-join-aggregate query system could be developed implementing using the existing Streaming windowing API. But matching over complex events and decoding their pattern queries would require implementing a DSL that transforms queries into an evaluation model. For e.g, in [1], the authors have implemented an NFA automaton with a shared versioned buffer that models the queries. In [4], the authors propose a new language that is much more expressive and compiles into a topology graph for Storm. So in Flink's case, I believe the proposed DSL would generate operator graphs for the Flink compiler to schedule Jobgraphs over TaskManagers. If we don't depend on the Windowing API, would we need to create new operators such as the Projection, Conjunction and Union operators defined in [4] ? Also I would like to hear your thoughts on how to approach scaling the pattern matching query. Note all these techniques talk about scaling a single query. I've read various ways such as 1. Merging equivalent runs[1] -: This seems a good way to squash multiple instances of pattern matching forks into a single one if they have the same state. But I'm not sure how we would implement this in Flink since this is a runtime optimization. 2. Implementing a matched version buffer[1] -: This would involve sharing state of a buffer datastructure across multiple candidate match instances for the pattern. 3. Splitting complex composite patterns into simpler sub-patterns[4] and executing separate queries to detect those sub-patterns. This might translate into different tasks and duplicating the source datastreams to all the new generated tasks. Also since I don't know how the Flink compiler behaves, would some of the optimizations involve making changes to it too? Regards, Akshay Dixit [1] : Efficient Pattern Matching over Event Streams http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf [2] : On Supporting Kleene Closure over Event Streams
Re: [GSoc][flink-streaming] Interested in pursuing FLINK-1617 and FLINK-1534
Hi, It'd really help if I got a reply soon. It'll be helpful in writing the proposal since the deadline is on 27th. Thanks Regards, Akshay Dixit On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit akshayd...@gmail.com wrote: Thanks for the explanation Marton. I've decided to try out for FLINK-1534. After reading through the thesis[4] and a few other papers[1][2][3], I believe I've gathered a little context to ask more questions. But I'm still not sure how Flink's internals work so please bear with me. Although the ongoing effort to document the architecture and internal is really helpful for newbies like me and would greatly decrease the ramping up time. Detecting a pattern of events would comprise of a pipeline that accepts the pattern query and sources of DataStreams, and outputs detected matches of that pattern to a sink or forwards it along to another stream for further computation. As you said, a simple filter-join-aggregate query system could be developed implementing using the existing Streaming windowing API. But matching over complex events and decoding their pattern queries would require implementing a DSL that transforms queries into an evaluation model. For e.g, in [1], the authors have implemented an NFA automaton with a shared versioned buffer that models the queries. In [4], the authors propose a new language that is much more expressive and compiles into a topology graph for Storm. So in Flink's case, I believe the proposed DSL would generate operator graphs for the Flink compiler to schedule Jobgraphs over TaskManagers. If we don't depend on the Windowing API, would we need to create new operators such as the Projection, Conjunction and Union operators defined in [4] ? Also I would like to hear your thoughts on how to approach scaling the pattern matching query. Note all these techniques talk about scaling a single query. I've read various ways such as 1. Merging equivalent runs[1] -: This seems a good way to squash multiple instances of pattern matching forks into a single one if they have the same state. But I'm not sure how we would implement this in Flink since this is a runtime optimization. 2. Implementing a matched version buffer[1] -: This would involve sharing state of a buffer datastructure across multiple candidate match instances for the pattern. 3. Splitting complex composite patterns into simpler sub-patterns[4] and executing separate queries to detect those sub-patterns. This might translate into different tasks and duplicating the source datastreams to all the new generated tasks. Also since I don't know how the Flink compiler behaves, would some of the optimizations involve making changes to it too? Regards, Akshay Dixit [1] : Efficient Pattern Matching over Event Streams http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf [2] : On Supporting Kleene Closure over Event Streams http://people.cs.umass.edu/~yanlei/publications/sase-icde08.pdf [3] : Processing Flows of Information: From Data Stream to Complex Event Processing http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.396.1785rep=rep1type=pdf [4] : Distributing Complex Event Detection http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf On Mon, Mar 16, 2015 at 3:22 PM, Márton Balassi balassi.mar...@gmail.com wrote: Dear Akshay, Thanks again for your interest and for the recent contribution to streaming. Both of the projects mentioned wold be largely appreciated by the community, and you can also propose other project suggestions here for discussion. Regarding FLINK-1534, the thesis I mentioned serves as a starting point and indeed the basic solution can be implemented with filtering and windowing/mapping with some state storing whether the cause of an event has been already seen. Solely relying on the now existing windowing API this however might cause performance issues if the events also have an expiration timeout - some optimization there would be included. The further challenge is to try to further exploit the parallel job execution of Flink to possibly scale a pattern matching query. Best, Marton On Sun, Mar 15, 2015 at 3:22 PM, Akshay Dixit akshayd...@gmail.com wrote: Hi, I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore, India. I'm currently interested in distributed systems and stream processing and am looking to delve deeper into the subject, and hope to get some insight by contributing to Apache Flink. I've gathered some idea of the flink-streaming codebase by recently working on a PR for FLINK-1450[2]. Both FLINK-1617[3] and FLINK-1534[4] are interesting projects that I would love to work on over the summer. I was wondering which amongst these would be more appreciated by the community, so I can start working towards a proposal for either one. Regarding FLINK-1534, I was wondering why would simply merging and filtering the existing
Affinity propagation for Gelly
Hello everyone, I am working on affinity propagation implementation for Gelly (FLINK 1707 https://issues.apache.org/jira/browse/FLINK-1707).The algorithm passes messages between every pair of vertices (NOT every pair of connected vertices) in each iteration with computation complexity (N²*Iter), it has a memory complexity of O(N²) also. So I believe the algorithm will suffer from large communication complexity, no matter how we distribute the graph into different machines. The simple fact is that the algorithm passing two kinds of message on a complement graph. I see some similar discussion in SPARK https://issues.apache.org/jira/browse/SPARK-5832 I found an adaptive implementation on hadoop, It runs affinity appropagation on the subgraphs , then merges these clusters into larger ones. http://www.ijeee.net/uploadfile/2014/0807/20140807114023665.pdf . It is not equal to the original algorithm. So,does any one know another distributed version or have any suggestions? ZHOU Yi
Re: [GSoc][flink-streaming] Interested in pursuing FLINK-1617 and FLINK-1534
Hey, Give me an hour or so as I am in a meeting currently, but I will get back to you afterwards. Regards, Gyula On Tue, Mar 24, 2015 at 11:03 AM, Akshay Dixit akshayd...@gmail.com wrote: Hi, It'd really help if I got a reply soon. It'll be helpful in writing the proposal since the deadline is on 27th. Thanks Regards, Akshay Dixit On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit akshayd...@gmail.com wrote: Thanks for the explanation Marton. I've decided to try out for FLINK-1534. After reading through the thesis[4] and a few other papers[1][2][3], I believe I've gathered a little context to ask more questions. But I'm still not sure how Flink's internals work so please bear with me. Although the ongoing effort to document the architecture and internal is really helpful for newbies like me and would greatly decrease the ramping up time. Detecting a pattern of events would comprise of a pipeline that accepts the pattern query and sources of DataStreams, and outputs detected matches of that pattern to a sink or forwards it along to another stream for further computation. As you said, a simple filter-join-aggregate query system could be developed implementing using the existing Streaming windowing API. But matching over complex events and decoding their pattern queries would require implementing a DSL that transforms queries into an evaluation model. For e.g, in [1], the authors have implemented an NFA automaton with a shared versioned buffer that models the queries. In [4], the authors propose a new language that is much more expressive and compiles into a topology graph for Storm. So in Flink's case, I believe the proposed DSL would generate operator graphs for the Flink compiler to schedule Jobgraphs over TaskManagers. If we don't depend on the Windowing API, would we need to create new operators such as the Projection, Conjunction and Union operators defined in [4] ? Also I would like to hear your thoughts on how to approach scaling the pattern matching query. Note all these techniques talk about scaling a single query. I've read various ways such as 1. Merging equivalent runs[1] -: This seems a good way to squash multiple instances of pattern matching forks into a single one if they have the same state. But I'm not sure how we would implement this in Flink since this is a runtime optimization. 2. Implementing a matched version buffer[1] -: This would involve sharing state of a buffer datastructure across multiple candidate match instances for the pattern. 3. Splitting complex composite patterns into simpler sub-patterns[4] and executing separate queries to detect those sub-patterns. This might translate into different tasks and duplicating the source datastreams to all the new generated tasks. Also since I don't know how the Flink compiler behaves, would some of the optimizations involve making changes to it too? Regards, Akshay Dixit [1] : Efficient Pattern Matching over Event Streams http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf [2] : On Supporting Kleene Closure over Event Streams http://people.cs.umass.edu/~yanlei/publications/sase-icde08.pdf [3] : Processing Flows of Information: From Data Stream to Complex Event Processing http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.396.1785rep=rep1type=pdf [4] : Distributing Complex Event Detection http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf On Mon, Mar 16, 2015 at 3:22 PM, Márton Balassi balassi.mar...@gmail.com wrote: Dear Akshay, Thanks again for your interest and for the recent contribution to streaming. Both of the projects mentioned wold be largely appreciated by the community, and you can also propose other project suggestions here for discussion. Regarding FLINK-1534, the thesis I mentioned serves as a starting point and indeed the basic solution can be implemented with filtering and windowing/mapping with some state storing whether the cause of an event has been already seen. Solely relying on the now existing windowing API this however might cause performance issues if the events also have an expiration timeout - some optimization there would be included. The further challenge is to try to further exploit the parallel job execution of Flink to possibly scale a pattern matching query. Best, Marton On Sun, Mar 15, 2015 at 3:22 PM, Akshay Dixit akshayd...@gmail.com wrote: Hi, I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore, India. I'm currently interested in distributed systems and stream processing and am looking to delve deeper into the subject, and hope to get some insight by contributing to Apache Flink. I've gathered some idea of the flink-streaming codebase by recently working on a PR for FLINK-1450[2]. Both FLINK-1617[3] and
Re: Travis-CI builds queuing up
If we could not get more capacity we could set up ASF Jenkins instead. It is already used to power CI for many ASF projects like Hadoop so should not be too shabby. I have created ticket for Flink to setup ASF Jenkins but have not found time to work on it. - Henry On Tuesday, March 24, 2015, Robert Metzger rmetz...@apache.org wrote: Hi guys, the build queue on travis is getting very very long. It seems that it takes 4 days now until commits to master are build. The nightly builds from the website and the maven snapshots are also delayed by that. Right now, there are 33 pull request builds scheduled ( https://travis-ci.org/apache/flink/pull_requests), and 8 builds on master: https://travis-ci.org/apache/flink/builds. The problem is that travis accounts are per github user. In our case, the user is apache, so all ASF projects that have travis enabled share 5 concurrent builders. I would actually like to continue using Travis. The easiest option is probably asking travis if they can give the apache user more build capacity. If thats not possible, we have to look into other options. I'm going to ask Travis if they can do anything about it. Robert
[jira] [Created] (FLINK-1777) Update Java 8 Lambdas with Eclipse documentation
Timo Walther created FLINK-1777: --- Summary: Update Java 8 Lambdas with Eclipse documentation Key: FLINK-1777 URL: https://issues.apache.org/jira/browse/FLINK-1777 Project: Flink Issue Type: Bug Components: Documentation Reporter: Timo Walther Assignee: Timo Walther Priority: Minor The Eclipse JDT compiler team has introduced a compiler flag for us, which is not covered in the Flink documentation yet. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Travis-CI builds queuing up
Hi guys, the build queue on travis is getting very very long. It seems that it takes 4 days now until commits to master are build. The nightly builds from the website and the maven snapshots are also delayed by that. Right now, there are 33 pull request builds scheduled ( https://travis-ci.org/apache/flink/pull_requests), and 8 builds on master: https://travis-ci.org/apache/flink/builds. The problem is that travis accounts are per github user. In our case, the user is apache, so all ASF projects that have travis enabled share 5 concurrent builders. I would actually like to continue using Travis. The easiest option is probably asking travis if they can give the apache user more build capacity. If thats not possible, we have to look into other options. I'm going to ask Travis if they can do anything about it. Robert
Re: Travis-CI builds queuing up
Let's see what Travis replies to Robert, but in general I agree with Max. Travis helped a lot to discover certain race conditions in the last weeks... I would like to not ditch it completely as Max suggested. On 24 Mar 2015, at 16:03, Maximilian Michels m...@apache.org wrote: I would also like to continue using Travis but the current situation is not acceptable because we practically can't use Travis anymore for pull requests or the current master. If it cannot be resolved then I think we should move on. The builds service team [1] at Apache offers Jenkins [2] for continuous integration. I think it should be fairly simple to set up. We could still use Travis in our forked repositories but have a reliable CI solution for the master and pull requests.
ApacheCon 2015 is coming to Austin, Texas, USA
Dear Apache Flink enthusiast, In just a few weeks, we'll be holding ApacheCon in Austin, Texas, and we'd love to have you in attendance. You can save $300 on admission by registering NOW, since the early bird price ends on the 21st. Register at http://s.apache.org/acna2015-reg ApacheCon this year celebrates the 20th birthday of the Apache HTTP Server, and we'll have Brian Behlendorf, who started this whole thing, keynoting for us, and you'll have a chance to meet some of the original Apache Group, who will be there to celebrate with us. We've got 7 tracks of great talks, as well as BOFs, the Apache BarCamp, project-specific hack events, and evening events where you can deepen your connection with the larger Apache community. See the full schedule at http://apacheconna2015.sched.org/ And if you have any questions, comments, or just want to hang out with us before and during the event, follow us on Twitter - @apachecon - or drop by #apachecon on the Freenode IRC network. Hope to see you in Austin! - Henry
Re: Question about Flink Streaming
Hi Gyula, thank a lot. I still don't understand why setup() and open() can not be unified? I also don't know, what the difference between RuntimeContext and StreamTaskContext is (or to be more precise, why not using a single context class that unifies both)? About the renaming of Timestamp class: Marton told me, that he thinks the name is fine and should not be changed. Before opening a JIRA for it, you two should get in sync and decide what to do. -Matthias On 03/24/2015 05:38 PM, Gyula Fóra wrote: Hey Matthias, Let's see if I get these things for you :) 1) The difference between setup and open is that, setup to set things like collectors, runtimecontext and everything that will be used by the implemented invokable, and also by the rich functions. Open is called after setup, to actually open the execution of the UDF operator. 2) Close is always called, even when the task is cancelled. In addition when a task is failing (maybe because other tasks are failing) the cancel method is called and the main thread is interrupted. The point of having a cancel method that some invokables might require different shutdown logic in case of failure. 3) This I need to look into... 4) You are right about the unintuitive name here, if you could open a JIRA for this I would appreciate that :) 5) You are absolutely right on this point, we need to spend more effort on writing proper docs. I hope I could clarify some stuff. Cheers, Gyula On Tue, Mar 24, 2015 at 5:05 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, as I get more familiar with Flink streaming and do some coding, I hit a few points which I want do discuss about because I find them contra-intuitive. Please tell me, what you think about it or clarify what I misunderstood. 1) In class StreamInvokable has two methods .setup(...) and .open(...) - what is the difference between both? When is each of both called exactly? It seems to be, that both are used to setup an operator. Why can't they be unified? 2) The same question about .close() and .cancel() ? 3) There is an class/interface hierarchy for user defined functions. The top level interface is 'Function' and there is an interface 'RichFunction' and abstract class 'AbstractRichFunction'. For each different type, there are user functions derived from. So far so good. However, the StreamInvokable class only takes a constructor argument Function, indicating that RichFunctions are not supported. Internally, the given function is tested to be a RichFunction (using instanceof) at certain places. This in contra-intuitive from a API point of view. From my OO understanding it would be better to replace Function by RichFunction everywhere. However, I was told that the (empty) Function interface is necessary for lambda expressions. Thus, I would suggest to extend the API with methods taking a RichFunction a parameter so it is clear that those are supported, too. 4) There is the interface Timestamp that is used to extract a time stamp for a record on order to create windows on a record attribute. I think the name Timestamp is miss leading, because the class does not represent a time stamp. I would rather call the interface TimestampExtractor or something similar. 5) Stefan started the discussion about more tests for the streaming component. I would additionally suggest to improve the Javadoc documentation. The are many classes an method with missing or very brief documentation and it is ofter hard to guess what they are used for. I would also suggest to describe the interaction of components/classes and WHY some thing are implemented in a certain way. As I have background knowledge from Stratosphere, I personally can work around it and make sense out of it (at least most times). However, for new contributers it might be very hard to make sense out of it and to get started implementing new features. Cheers, Matthias signature.asc Description: OpenPGP digital signature
[jira] [Created] (FLINK-1780) Rename FlatCombineFunction to GroupCombineFunction
Fabian Hueske created FLINK-1780: Summary: Rename FlatCombineFunction to GroupCombineFunction Key: FLINK-1780 URL: https://issues.apache.org/jira/browse/FLINK-1780 Project: Flink Issue Type: Improvement Components: Java API, Scala API Affects Versions: 0.9 Reporter: Fabian Hueske Priority: Minor Fix For: 0.9 The recently added {{GroupCombineOperator}} requires a {{FlatCombineFunction}}, however a {{GroupReduceOperator}} requires a {{GroupReduceFunction}}. {{FlatCombineFunction}} and {{GroupReduceFunction}} work on the same types of parameters (Iterable and Collector). Therefore, I propose to change the name {{FlatCombineFunction}} to {{GroupCombineFunction}}. Since the function could not be independently used until recently, this API breaking change should be OK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Question about Flink Streaming
Hey Matthias, Let's see if I get these things for you :) 1) The difference between setup and open is that, setup to set things like collectors, runtimecontext and everything that will be used by the implemented invokable, and also by the rich functions. Open is called after setup, to actually open the execution of the UDF operator. 2) Close is always called, even when the task is cancelled. In addition when a task is failing (maybe because other tasks are failing) the cancel method is called and the main thread is interrupted. The point of having a cancel method that some invokables might require different shutdown logic in case of failure. 3) This I need to look into... 4) You are right about the unintuitive name here, if you could open a JIRA for this I would appreciate that :) 5) You are absolutely right on this point, we need to spend more effort on writing proper docs. I hope I could clarify some stuff. Cheers, Gyula On Tue, Mar 24, 2015 at 5:05 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, as I get more familiar with Flink streaming and do some coding, I hit a few points which I want do discuss about because I find them contra-intuitive. Please tell me, what you think about it or clarify what I misunderstood. 1) In class StreamInvokable has two methods .setup(...) and .open(...) - what is the difference between both? When is each of both called exactly? It seems to be, that both are used to setup an operator. Why can't they be unified? 2) The same question about .close() and .cancel() ? 3) There is an class/interface hierarchy for user defined functions. The top level interface is 'Function' and there is an interface 'RichFunction' and abstract class 'AbstractRichFunction'. For each different type, there are user functions derived from. So far so good. However, the StreamInvokable class only takes a constructor argument Function, indicating that RichFunctions are not supported. Internally, the given function is tested to be a RichFunction (using instanceof) at certain places. This in contra-intuitive from a API point of view. From my OO understanding it would be better to replace Function by RichFunction everywhere. However, I was told that the (empty) Function interface is necessary for lambda expressions. Thus, I would suggest to extend the API with methods taking a RichFunction a parameter so it is clear that those are supported, too. 4) There is the interface Timestamp that is used to extract a time stamp for a record on order to create windows on a record attribute. I think the name Timestamp is miss leading, because the class does not represent a time stamp. I would rather call the interface TimestampExtractor or something similar. 5) Stefan started the discussion about more tests for the streaming component. I would additionally suggest to improve the Javadoc documentation. The are many classes an method with missing or very brief documentation and it is ofter hard to guess what they are used for. I would also suggest to describe the interaction of components/classes and WHY some thing are implemented in a certain way. As I have background knowledge from Stratosphere, I personally can work around it and make sense out of it (at least most times). However, for new contributers it might be very hard to make sense out of it and to get started implementing new features. Cheers, Matthias
Re: Question about Flink Streaming
The setup and open methods could be called together, but they do different tasks, and therefore I dont see any reason why they should be in a same method. This is a critical part of the code so better keep things clean and separate. The RuntimeContext refers to the operator while the TaskContext refers to the task itself. We dont want to expose everything in the TaskContext to the user, because that could lead to serious problems. Ok I will talk with him. On Tue, Mar 24, 2015 at 5:54 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi Gyula, thank a lot. I still don't understand why setup() and open() can not be unified? I also don't know, what the difference between RuntimeContext and StreamTaskContext is (or to be more precise, why not using a single context class that unifies both)? About the renaming of Timestamp class: Marton told me, that he thinks the name is fine and should not be changed. Before opening a JIRA for it, you two should get in sync and decide what to do. -Matthias On 03/24/2015 05:38 PM, Gyula Fóra wrote: Hey Matthias, Let's see if I get these things for you :) 1) The difference between setup and open is that, setup to set things like collectors, runtimecontext and everything that will be used by the implemented invokable, and also by the rich functions. Open is called after setup, to actually open the execution of the UDF operator. 2) Close is always called, even when the task is cancelled. In addition when a task is failing (maybe because other tasks are failing) the cancel method is called and the main thread is interrupted. The point of having a cancel method that some invokables might require different shutdown logic in case of failure. 3) This I need to look into... 4) You are right about the unintuitive name here, if you could open a JIRA for this I would appreciate that :) 5) You are absolutely right on this point, we need to spend more effort on writing proper docs. I hope I could clarify some stuff. Cheers, Gyula On Tue, Mar 24, 2015 at 5:05 PM, Matthias J. Sax mj...@informatik.hu-berlin.de wrote: Hi, as I get more familiar with Flink streaming and do some coding, I hit a few points which I want do discuss about because I find them contra-intuitive. Please tell me, what you think about it or clarify what I misunderstood. 1) In class StreamInvokable has two methods .setup(...) and .open(...) - what is the difference between both? When is each of both called exactly? It seems to be, that both are used to setup an operator. Why can't they be unified? 2) The same question about .close() and .cancel() ? 3) There is an class/interface hierarchy for user defined functions. The top level interface is 'Function' and there is an interface 'RichFunction' and abstract class 'AbstractRichFunction'. For each different type, there are user functions derived from. So far so good. However, the StreamInvokable class only takes a constructor argument Function, indicating that RichFunctions are not supported. Internally, the given function is tested to be a RichFunction (using instanceof) at certain places. This in contra-intuitive from a API point of view. From my OO understanding it would be better to replace Function by RichFunction everywhere. However, I was told that the (empty) Function interface is necessary for lambda expressions. Thus, I would suggest to extend the API with methods taking a RichFunction a parameter so it is clear that those are supported, too. 4) There is the interface Timestamp that is used to extract a time stamp for a record on order to create windows on a record attribute. I think the name Timestamp is miss leading, because the class does not represent a time stamp. I would rather call the interface TimestampExtractor or something similar. 5) Stefan started the discussion about more tests for the streaming component. I would additionally suggest to improve the Javadoc documentation. The are many classes an method with missing or very brief documentation and it is ofter hard to guess what they are used for. I would also suggest to describe the interaction of components/classes and WHY some thing are implemented in a certain way. As I have background knowledge from Stratosphere, I personally can work around it and make sense out of it (at least most times). However, for new contributers it might be very hard to make sense out of it and to get started implementing new features. Cheers, Matthias
Re: [GSoc][flink-streaming] Interested in pursuing FLINK-1617 and FLINK-1534
Thanks Gyula. I agree too that simple and working implementations are preferrable over hacky complex solutions. I'll start sketching out an initial straighforward API with only basic pattern matching features and base it on the existing windowing API. I'll post a draft of the proposal, keeping the points you've said in mind, tomorrow, so you can look it over to see if its all right. Regards, Akshay Dixit On Tue, Mar 24, 2015 at 6:30 PM, Gyula Fóra gyf...@apache.org wrote: Hey Dixit, Sorry for the delay, I had to discuss this in more detail with some of our other core developers. The consensus seems to be that we would like push this project in a direction where the changes can be quickly included in the next releases. For this it is essential that we implement features that are complete (and clean) from the users perspective. This does not necessarily mean that we would like to have everything at once but rather that it is preferable to start with something clean and simple (for instance the naive chained filter approach) and progressively build more complex logic. This also mean that we would like to avoid researchy code in the codebase as much as possible. Of course once we have a stable api for this functionality we can work towards making the optimizations that you have mentioned like operator sharing and so on. The ideal proposal would give a clear sketch of the pattern matching API that you would like to implement, which might be some added operators at first to the current API and possible a DSL later with more advanced functionality (this would probably go in a separate library until it is very stable). So please in the proposal include a preview of what the pattern matching syntax would look like integrated with the current operators, how it would interact with other parts of the system etc. These are the thing we need to figure out before we consider the optimizations I think, because it usually turns out, that the API semantics you would like to provide can hugely affect (probably limit) the possibilities that you have afterwards in terms of optimizations. Let me know if you have further questions regarding this :) Gyula On Tue, Mar 24, 2015 at 12:01 PM, Gyula Fóra gyf...@apache.org wrote: Hey, Give me an hour or so as I am in a meeting currently, but I will get back to you afterwards. Regards, Gyula On Tue, Mar 24, 2015 at 11:03 AM, Akshay Dixit akshayd...@gmail.com wrote: Hi, It'd really help if I got a reply soon. It'll be helpful in writing the proposal since the deadline is on 27th. Thanks Regards, Akshay Dixit On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit akshayd...@gmail.com wrote: Thanks for the explanation Marton. I've decided to try out for FLINK-1534. After reading through the thesis[4] and a few other papers[1][2][3], I believe I've gathered a little context to ask more questions. But I'm still not sure how Flink's internals work so please bear with me. Although the ongoing effort to document the architecture and internal is really helpful for newbies like me and would greatly decrease the ramping up time. Detecting a pattern of events would comprise of a pipeline that accepts the pattern query and sources of DataStreams, and outputs detected matches of that pattern to a sink or forwards it along to another stream for further computation. As you said, a simple filter-join-aggregate query system could be developed implementing using the existing Streaming windowing API. But matching over complex events and decoding their pattern queries would require implementing a DSL that transforms queries into an evaluation model. For e.g, in [1], the authors have implemented an NFA automaton with a shared versioned buffer that models the queries. In [4], the authors propose a new language that is much more expressive and compiles into a topology graph for Storm. So in Flink's case, I believe the proposed DSL would generate operator graphs for the Flink compiler to schedule Jobgraphs over TaskManagers. If we don't depend on the Windowing API, would we need to create new operators such as the Projection, Conjunction and Union operators defined in [4] ? Also I would like to hear your thoughts on how to approach scaling the pattern matching query. Note all these techniques talk about scaling a single query. I've read various ways such as 1. Merging equivalent runs[1] -: This seems a good way to squash multiple instances of pattern matching forks into a single one if they have the same state. But I'm not sure how we would implement this in Flink since this is a runtime optimization. 2. Implementing a matched version buffer[1] -: This would involve sharing state of a buffer datastructure across multiple candidate match instances for the pattern. 3. Splitting