Re: [GSoc][flink-streaming] Interested in pursuing FLINK-1617 and FLINK-1534

2015-03-24 Thread Gyula Fóra
Hey Dixit,

Sorry for the delay, I had to discuss this in more detail with some of our
other core developers.

The consensus seems to be that we would like push this project in a
direction where the changes can be quickly included in the next releases.
For this it is essential that we implement features that are complete (and
clean) from the users perspective. This does not necessarily mean that we
would like to have everything at once but rather that it is preferable to
start with something clean and simple (for instance the naive chained
filter approach) and progressively build more complex logic.

This also mean that we would like to avoid researchy code in the codebase
as much as possible. Of course once we have a stable api for this
functionality we can work towards making the optimizations that you have
mentioned like operator sharing and so on.

The ideal proposal would give a clear sketch of the pattern matching API
that you would like to implement, which might be some added operators at
first to the current API and possible a DSL later with more advanced
functionality (this would probably go in a separate library until it is
very stable).

So please in the proposal include a preview of what the pattern matching
syntax would look like integrated with the current operators, how it would
interact with other parts of the system etc.

These are the thing we need to figure out before we consider the
optimizations I think, because it usually turns out, that the API semantics
you would like to provide can hugely affect (probably limit) the
possibilities that you have afterwards in terms of optimizations.

Let me know if you have further questions regarding this :)

Gyula

On Tue, Mar 24, 2015 at 12:01 PM, Gyula Fóra gyf...@apache.org wrote:

 Hey,

 Give me an hour or so as I am in a meeting currently, but I will get back
 to you afterwards.

 Regards,
 Gyula

 On Tue, Mar 24, 2015 at 11:03 AM, Akshay Dixit akshayd...@gmail.com
 wrote:

 Hi,
 It'd really help if I got a reply soon. It'll be helpful in writing the
 proposal since the deadline is on 27th. Thanks
 Regards,
 Akshay Dixit

 On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit akshayd...@gmail.com
 wrote:

  Thanks for the explanation Marton. I've decided to try out for
 FLINK-1534.
 
  After reading through the thesis[4] and a few other papers[1][2][3], I
  believe I've gathered a little context to ask more questions. But I'm
 still
  not sure how Flink's internals work
  so please bear with me. Although the ongoing effort to document the
  architecture and internal is really helpful for newbies like me and
 would
  greatly decrease the ramping up time.
 
  Detecting a pattern of events would comprise of a pipeline that accepts
  the pattern query and
  sources of DataStreams, and outputs detected matches of that pattern to
 a
  sink or forwards it
  along to another stream for further computation.
 
  As you said, a simple filter-join-aggregate query system could be
  developed implementing using the existing Streaming windowing API.
  But matching over complex events and decoding their pattern queries
 would
  require implementing a DSL that transforms queries into an evaluation
  model. For e.g,
  in [1], the authors have implemented an NFA automaton with a shared
  versioned buffer that models the queries. In [4], the authors
  propose a new language that is much more expressive and compiles into a
  topology graph for Storm.
 
  So in Flink's case, I believe the proposed DSL would generate operator
  graphs for the Flink compiler to schedule Jobgraphs over TaskManagers.
  If we don't depend on the Windowing API, would we need to create new
  operators such as the Projection, Conjunction and Union operators
 defined
  in [4] ?
  Also I would like to hear your thoughts on how to approach scaling the
  pattern matching query. Note all these techniques talk about scaling a
  single query.
  I've read various ways such as
 
  1.  Merging equivalent runs[1] -: This seems a good way to squash
 multiple
  instances of pattern matching forks into a single one if they have the
 same
  state.
  But I'm not sure how we would implement this in Flink since this is a
  runtime optimization.
 
  2.  Implementing a matched version buffer[1] -: This would involve
 sharing
  state of a buffer datastructure across multiple candidate match
 instances
  for the pattern.
 
  3.  Splitting complex composite patterns into simpler sub-patterns[4]
 and
  executing separate queries to detect those sub-patterns. This might
  translate into different
  tasks and duplicating the source datastreams to all the new generated
  tasks.
 
  Also since I don't know how the Flink compiler behaves, would some of
 the
  optimizations involve making changes to it too?
 
  Regards,
  Akshay Dixit
 
  [1] : Efficient Pattern Matching over Event Streams
  http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf
 
  [2] : On Supporting Kleene Closure over Event Streams
  

Re: [GSoc][flink-streaming] Interested in pursuing FLINK-1617 and FLINK-1534

2015-03-24 Thread Akshay Dixit
Hi,
It'd really help if I got a reply soon. It'll be helpful in writing the
proposal since the deadline is on 27th. Thanks
Regards,
Akshay Dixit

On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit akshayd...@gmail.com wrote:

 Thanks for the explanation Marton. I've decided to try out for FLINK-1534.

 After reading through the thesis[4] and a few other papers[1][2][3], I
 believe I've gathered a little context to ask more questions. But I'm still
 not sure how Flink's internals work
 so please bear with me. Although the ongoing effort to document the
 architecture and internal is really helpful for newbies like me and would
 greatly decrease the ramping up time.

 Detecting a pattern of events would comprise of a pipeline that accepts
 the pattern query and
 sources of DataStreams, and outputs detected matches of that pattern to a
 sink or forwards it
 along to another stream for further computation.

 As you said, a simple filter-join-aggregate query system could be
 developed implementing using the existing Streaming windowing API.
 But matching over complex events and decoding their pattern queries would
 require implementing a DSL that transforms queries into an evaluation
 model. For e.g,
 in [1], the authors have implemented an NFA automaton with a shared
 versioned buffer that models the queries. In [4], the authors
 propose a new language that is much more expressive and compiles into a
 topology graph for Storm.

 So in Flink's case, I believe the proposed DSL would generate operator
 graphs for the Flink compiler to schedule Jobgraphs over TaskManagers.
 If we don't depend on the Windowing API, would we need to create new
 operators such as the Projection, Conjunction and Union operators defined
 in [4] ?
 Also I would like to hear your thoughts on how to approach scaling the
 pattern matching query. Note all these techniques talk about scaling a
 single query.
 I've read various ways such as

 1.  Merging equivalent runs[1] -: This seems a good way to squash multiple
 instances of pattern matching forks into a single one if they have the same
 state.
 But I'm not sure how we would implement this in Flink since this is a
 runtime optimization.

 2.  Implementing a matched version buffer[1] -: This would involve sharing
 state of a buffer datastructure across multiple candidate match instances
 for the pattern.

 3.  Splitting complex composite patterns into simpler sub-patterns[4] and
 executing separate queries to detect those sub-patterns. This might
 translate into different
 tasks and duplicating the source datastreams to all the new generated
 tasks.

 Also since I don't know how the Flink compiler behaves, would some of the
 optimizations involve making changes to it too?

 Regards,
 Akshay Dixit

 [1] : Efficient Pattern Matching over Event Streams
 http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf
 [2] : On Supporting Kleene Closure over Event Streams
 http://people.cs.umass.edu/~yanlei/publications/sase-icde08.pdf
 [3] : Processing Flows of Information: From Data Stream to Complex Event
 Processing
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.396.1785rep=rep1type=pdf
 [4] : Distributing Complex Event Detection
 http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf

 On Mon, Mar 16, 2015 at 3:22 PM, Márton Balassi balassi.mar...@gmail.com
 wrote:

 Dear Akshay,

 Thanks again for your interest and for the recent contribution to
 streaming.

 Both of the projects mentioned wold be largely appreciated by the
 community, and you can also propose other project suggestions here for
 discussion.

 Regarding FLINK-1534, the thesis I mentioned serves as a starting point
 and
 indeed the basic solution can be implemented with filtering and
 windowing/mapping with some state storing whether the cause of an event
 has
 been already seen. Solely relying on the now existing windowing API this
 however might cause performance issues if the events also have an
 expiration timeout - some optimization there would be included. The
 further
 challenge is to try to further exploit the parallel job execution of Flink
 to possibly scale a pattern matching query.

 Best,

 Marton

 On Sun, Mar 15, 2015 at 3:22 PM, Akshay Dixit akshayd...@gmail.com
 wrote:

  Hi,
  I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore, India. I'm
  currently interested in distributed systems and stream processing and am
  looking to delve deeper into the subject, and hope to get some insight
 by
  contributing to Apache Flink. I've gathered some idea of the
  flink-streaming codebase by recently working on a PR for FLINK-1450[2].
 
  Both FLINK-1617[3] and FLINK-1534[4] are interesting projects that I
 would
  love to work on over the summer. I was wondering which amongst these
 would
  be more appreciated by the community, so I can start working towards a
  proposal for either one.
 
  Regarding FLINK-1534, I was wondering why would simply merging and
  filtering the existing 

Affinity propagation for Gelly

2015-03-24 Thread Yi ZHOU

Hello everyone,

I am working on  affinity propagation implementation for Gelly (FLINK 
1707 https://issues.apache.org/jira/browse/FLINK-1707).The 
algorithm passes messages between every pair of vertices (NOT every pair 
of connected vertices) in each iteration with computation complexity 
(N²*Iter), it has a memory complexity of O(N²) also. So I believe  the 
algorithm will suffer from large communication complexity, no matter how 
we distribute the graph into different machines. The simple fact is that 
the algorithm passing two kinds of message on a complement graph. I see 
some similar discussion in SPARK 
https://issues.apache.org/jira/browse/SPARK-5832


I found an adaptive implementation on hadoop, It runs affinity 
appropagation on the  subgraphs , then merges  these clusters into 
larger ones. 
http://www.ijeee.net/uploadfile/2014/0807/20140807114023665.pdf . It is 
not equal to the original algorithm. So,does any one know another 
distributed version or have any suggestions?



ZHOU Yi


Re: [GSoc][flink-streaming] Interested in pursuing FLINK-1617 and FLINK-1534

2015-03-24 Thread Gyula Fóra
Hey,

Give me an hour or so as I am in a meeting currently, but I will get back
to you afterwards.

Regards,
Gyula

On Tue, Mar 24, 2015 at 11:03 AM, Akshay Dixit akshayd...@gmail.com wrote:

 Hi,
 It'd really help if I got a reply soon. It'll be helpful in writing the
 proposal since the deadline is on 27th. Thanks
 Regards,
 Akshay Dixit

 On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit akshayd...@gmail.com
 wrote:

  Thanks for the explanation Marton. I've decided to try out for
 FLINK-1534.
 
  After reading through the thesis[4] and a few other papers[1][2][3], I
  believe I've gathered a little context to ask more questions. But I'm
 still
  not sure how Flink's internals work
  so please bear with me. Although the ongoing effort to document the
  architecture and internal is really helpful for newbies like me and would
  greatly decrease the ramping up time.
 
  Detecting a pattern of events would comprise of a pipeline that accepts
  the pattern query and
  sources of DataStreams, and outputs detected matches of that pattern to a
  sink or forwards it
  along to another stream for further computation.
 
  As you said, a simple filter-join-aggregate query system could be
  developed implementing using the existing Streaming windowing API.
  But matching over complex events and decoding their pattern queries would
  require implementing a DSL that transforms queries into an evaluation
  model. For e.g,
  in [1], the authors have implemented an NFA automaton with a shared
  versioned buffer that models the queries. In [4], the authors
  propose a new language that is much more expressive and compiles into a
  topology graph for Storm.
 
  So in Flink's case, I believe the proposed DSL would generate operator
  graphs for the Flink compiler to schedule Jobgraphs over TaskManagers.
  If we don't depend on the Windowing API, would we need to create new
  operators such as the Projection, Conjunction and Union operators defined
  in [4] ?
  Also I would like to hear your thoughts on how to approach scaling the
  pattern matching query. Note all these techniques talk about scaling a
  single query.
  I've read various ways such as
 
  1.  Merging equivalent runs[1] -: This seems a good way to squash
 multiple
  instances of pattern matching forks into a single one if they have the
 same
  state.
  But I'm not sure how we would implement this in Flink since this is a
  runtime optimization.
 
  2.  Implementing a matched version buffer[1] -: This would involve
 sharing
  state of a buffer datastructure across multiple candidate match instances
  for the pattern.
 
  3.  Splitting complex composite patterns into simpler sub-patterns[4] and
  executing separate queries to detect those sub-patterns. This might
  translate into different
  tasks and duplicating the source datastreams to all the new generated
  tasks.
 
  Also since I don't know how the Flink compiler behaves, would some of the
  optimizations involve making changes to it too?
 
  Regards,
  Akshay Dixit
 
  [1] : Efficient Pattern Matching over Event Streams
  http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf
  [2] : On Supporting Kleene Closure over Event Streams
  http://people.cs.umass.edu/~yanlei/publications/sase-icde08.pdf
  [3] : Processing Flows of Information: From Data Stream to Complex Event
  Processing
  
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.396.1785rep=rep1type=pdf
 
  [4] : Distributing Complex Event Detection
  http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf
 
 
  On Mon, Mar 16, 2015 at 3:22 PM, Márton Balassi 
 balassi.mar...@gmail.com
  wrote:
 
  Dear Akshay,
 
  Thanks again for your interest and for the recent contribution to
  streaming.
 
  Both of the projects mentioned wold be largely appreciated by the
  community, and you can also propose other project suggestions here for
  discussion.
 
  Regarding FLINK-1534, the thesis I mentioned serves as a starting point
  and
  indeed the basic solution can be implemented with filtering and
  windowing/mapping with some state storing whether the cause of an event
  has
  been already seen. Solely relying on the now existing windowing API this
  however might cause performance issues if the events also have an
  expiration timeout - some optimization there would be included. The
  further
  challenge is to try to further exploit the parallel job execution of
 Flink
  to possibly scale a pattern matching query.
 
  Best,
 
  Marton
 
  On Sun, Mar 15, 2015 at 3:22 PM, Akshay Dixit akshayd...@gmail.com
  wrote:
 
   Hi,
   I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore, India. I'm
   currently interested in distributed systems and stream processing and
 am
   looking to delve deeper into the subject, and hope to get some insight
  by
   contributing to Apache Flink. I've gathered some idea of the
   flink-streaming codebase by recently working on a PR for
 FLINK-1450[2].
  
   Both FLINK-1617[3] and 

Re: Travis-CI builds queuing up

2015-03-24 Thread Henry Saputra
If we could not get more capacity we could set up ASF Jenkins instead.
It is already used to power CI for many ASF projects like Hadoop so should
not be too shabby.

I have created ticket for Flink to setup ASF Jenkins but have not found
time to work on it.

- Henry

On Tuesday, March 24, 2015, Robert Metzger rmetz...@apache.org wrote:

 Hi guys,

 the build queue on travis is getting very very long. It seems that it takes
 4 days now until commits to master are build. The nightly builds from the
 website and the maven snapshots are also delayed by that.
 Right now,  there are 33 pull request builds scheduled (
 https://travis-ci.org/apache/flink/pull_requests), and 8 builds on master:
 https://travis-ci.org/apache/flink/builds.

 The problem is that travis accounts are per github user. In our case, the
 user is apache, so all ASF projects that have travis enabled share 5
 concurrent builders.

 I would actually like to continue using Travis.

 The easiest option is probably asking travis if they can give the apache
 user more build capacity.

 If thats not possible, we have to look into other options.


 I'm going to ask Travis if they can do anything about it.

 Robert



[jira] [Created] (FLINK-1777) Update Java 8 Lambdas with Eclipse documentation

2015-03-24 Thread Timo Walther (JIRA)
Timo Walther created FLINK-1777:
---

 Summary: Update Java 8 Lambdas with Eclipse documentation
 Key: FLINK-1777
 URL: https://issues.apache.org/jira/browse/FLINK-1777
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Reporter: Timo Walther
Assignee: Timo Walther
Priority: Minor


The Eclipse JDT compiler team has introduced a compiler flag for us, which is 
not covered in the Flink documentation yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Travis-CI builds queuing up

2015-03-24 Thread Robert Metzger
Hi guys,

the build queue on travis is getting very very long. It seems that it takes
4 days now until commits to master are build. The nightly builds from the
website and the maven snapshots are also delayed by that.
Right now,  there are 33 pull request builds scheduled (
https://travis-ci.org/apache/flink/pull_requests), and 8 builds on master:
https://travis-ci.org/apache/flink/builds.

The problem is that travis accounts are per github user. In our case, the
user is apache, so all ASF projects that have travis enabled share 5
concurrent builders.

I would actually like to continue using Travis.

The easiest option is probably asking travis if they can give the apache
user more build capacity.

If thats not possible, we have to look into other options.


I'm going to ask Travis if they can do anything about it.

Robert


Re: Travis-CI builds queuing up

2015-03-24 Thread Ufuk Celebi
Let's see what Travis replies to Robert, but in general I agree with Max.

Travis helped a lot to discover certain race conditions in the last weeks... I 
would like to not ditch it completely as Max suggested.

On 24 Mar 2015, at 16:03, Maximilian Michels m...@apache.org wrote:

 I would also like to continue using Travis but the current situation is not
 acceptable because we practically can't use Travis anymore for pull
 requests or the current master. If it cannot be resolved then I think we
 should move on.
 The builds service team [1] at Apache offers Jenkins [2] for continuous
 integration. I think it should be fairly simple to set up. We could still
 use Travis in our forked repositories but have a reliable CI solution for
 the master and pull requests.


ApacheCon 2015 is coming to Austin, Texas, USA

2015-03-24 Thread Henry Saputra
Dear Apache Flink enthusiast,

In just a few weeks, we'll be holding ApacheCon in Austin, Texas, and
we'd love to have you in attendance. You can save $300 on admission by
registering NOW, since the early bird price ends on the 21st.

Register at http://s.apache.org/acna2015-reg

ApacheCon this year celebrates the 20th birthday of the Apache HTTP
Server, and we'll have Brian Behlendorf, who started this whole thing,
keynoting for us, and you'll have a chance to meet some of the
original Apache Group, who will be there to celebrate with us.

We've got 7 tracks of great talks, as well as BOFs, the Apache
BarCamp, project-specific hack events, and evening events where you
can deepen your connection with the larger Apache community. See the
full schedule at http://apacheconna2015.sched.org/

And if you have any questions, comments, or just want to hang out with
us before and during the event, follow us on Twitter - @apachecon - or
drop by #apachecon on the Freenode IRC network.

Hope to see you in Austin!

- Henry


Re: Question about Flink Streaming

2015-03-24 Thread Matthias J. Sax
Hi Gyula,

thank a lot. I still don't understand why setup() and open() can not be
unified? I also don't know, what the difference between RuntimeContext
and StreamTaskContext is (or to be more precise, why not using a single
context class that unifies both)?

About the renaming of Timestamp class: Marton told me, that he thinks
the name is fine and should not be changed. Before opening a JIRA for
it, you two should get in sync and decide what to do.


-Matthias



On 03/24/2015 05:38 PM, Gyula Fóra wrote:
 Hey Matthias,
 
 Let's see if I get these things for you :)
 
 1) The difference between setup and open is that, setup to set things like
 collectors, runtimecontext and everything that will be used by the
 implemented invokable, and also by the rich functions. Open is called after
 setup, to actually open the execution of the UDF operator.
 
 2) Close is always called, even when the task is cancelled. In addition
 when a task is failing (maybe because other tasks are failing) the cancel
 method is called and the main thread is interrupted. The point of having a
 cancel method that some invokables might require different shutdown logic
 in case of failure.
 
 3) This I need to look into...
 
 4) You are right about the unintuitive name here, if you could open a JIRA
 for this I would appreciate that :)
 
 5) You are absolutely right on this point, we need to spend more effort on
 writing proper docs.
 
 I hope I could clarify some stuff.
 
 Cheers,
 Gyula
 
 On Tue, Mar 24, 2015 at 5:05 PM, Matthias J. Sax 
 mj...@informatik.hu-berlin.de wrote:
 
 Hi,

 as I get more familiar with Flink streaming and do some coding, I hit a
 few points which I want do discuss about because I find them
 contra-intuitive. Please tell me, what you think about it or clarify
 what I misunderstood.

 1) In class StreamInvokable has two methods .setup(...) and .open(...)
- what is the difference between both? When is each of both called
 exactly? It seems to be, that both are used to setup an operator. Why
 can't they be unified?

 2) The same question about .close() and .cancel() ?

 3) There is an class/interface hierarchy for user defined functions. The
 top level interface is 'Function' and there is an interface
 'RichFunction' and abstract class 'AbstractRichFunction'. For each
 different type, there are user functions derived from. So far so good.
 However, the StreamInvokable class only takes a constructor
 argument
 Function, indicating that RichFunctions are not supported. Internally,
 the given function is tested to be a RichFunction (using instanceof) at
 certain places. This in contra-intuitive from a API point of view.
 From my OO understanding it would be better to replace Function by
 RichFunction everywhere. However, I was told that the (empty) Function
 interface is necessary for lambda expressions. Thus, I would suggest to
 extend the API with methods taking a RichFunction a parameter so it is
 clear that those are supported, too.

 4) There is the interface Timestamp that is used to extract a time stamp
 for a record on order to create windows on a record attribute. I think
 the name Timestamp is miss leading, because the class does not
 represent a time stamp. I would rather call the interface
 TimestampExtractor or something similar.

 5) Stefan started the discussion about more tests for the streaming
 component. I would additionally suggest to improve the Javadoc
 documentation. The are many classes an method with missing or very brief
 documentation and it is ofter hard to guess what they are used for. I
 would also suggest to describe the interaction of components/classes and
 WHY some thing are implemented in a certain way. As I have background
 knowledge from Stratosphere, I personally can work around it and make
 sense out of it (at least most times). However, for new contributers it
 might be very hard to make sense out of it and to get started
 implementing new features.


 Cheers,
   Matthias



 



signature.asc
Description: OpenPGP digital signature


[jira] [Created] (FLINK-1780) Rename FlatCombineFunction to GroupCombineFunction

2015-03-24 Thread Fabian Hueske (JIRA)
Fabian Hueske created FLINK-1780:


 Summary: Rename FlatCombineFunction to GroupCombineFunction
 Key: FLINK-1780
 URL: https://issues.apache.org/jira/browse/FLINK-1780
 Project: Flink
  Issue Type: Improvement
  Components: Java API, Scala API
Affects Versions: 0.9
Reporter: Fabian Hueske
Priority: Minor
 Fix For: 0.9


The recently added {{GroupCombineOperator}} requires a {{FlatCombineFunction}}, 
however a {{GroupReduceOperator}} requires a {{GroupReduceFunction}}. 
{{FlatCombineFunction}} and {{GroupReduceFunction}} work on the same types of 
parameters (Iterable and Collector).

Therefore, I propose to change the name {{FlatCombineFunction}} to 
{{GroupCombineFunction}}. Since the function could not be independently used 
until recently, this API breaking change should be OK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Question about Flink Streaming

2015-03-24 Thread Gyula Fóra
Hey Matthias,

Let's see if I get these things for you :)

1) The difference between setup and open is that, setup to set things like
collectors, runtimecontext and everything that will be used by the
implemented invokable, and also by the rich functions. Open is called after
setup, to actually open the execution of the UDF operator.

2) Close is always called, even when the task is cancelled. In addition
when a task is failing (maybe because other tasks are failing) the cancel
method is called and the main thread is interrupted. The point of having a
cancel method that some invokables might require different shutdown logic
in case of failure.

3) This I need to look into...

4) You are right about the unintuitive name here, if you could open a JIRA
for this I would appreciate that :)

5) You are absolutely right on this point, we need to spend more effort on
writing proper docs.

I hope I could clarify some stuff.

Cheers,
Gyula

On Tue, Mar 24, 2015 at 5:05 PM, Matthias J. Sax 
mj...@informatik.hu-berlin.de wrote:

 Hi,

 as I get more familiar with Flink streaming and do some coding, I hit a
 few points which I want do discuss about because I find them
 contra-intuitive. Please tell me, what you think about it or clarify
 what I misunderstood.

 1) In class StreamInvokable has two methods .setup(...) and .open(...)
- what is the difference between both? When is each of both called
 exactly? It seems to be, that both are used to setup an operator. Why
 can't they be unified?

 2) The same question about .close() and .cancel() ?

 3) There is an class/interface hierarchy for user defined functions. The
 top level interface is 'Function' and there is an interface
 'RichFunction' and abstract class 'AbstractRichFunction'. For each
 different type, there are user functions derived from. So far so good.
 However, the StreamInvokable class only takes a constructor
 argument
 Function, indicating that RichFunctions are not supported. Internally,
 the given function is tested to be a RichFunction (using instanceof) at
 certain places. This in contra-intuitive from a API point of view.
 From my OO understanding it would be better to replace Function by
 RichFunction everywhere. However, I was told that the (empty) Function
 interface is necessary for lambda expressions. Thus, I would suggest to
 extend the API with methods taking a RichFunction a parameter so it is
 clear that those are supported, too.

 4) There is the interface Timestamp that is used to extract a time stamp
 for a record on order to create windows on a record attribute. I think
 the name Timestamp is miss leading, because the class does not
 represent a time stamp. I would rather call the interface
 TimestampExtractor or something similar.

 5) Stefan started the discussion about more tests for the streaming
 component. I would additionally suggest to improve the Javadoc
 documentation. The are many classes an method with missing or very brief
 documentation and it is ofter hard to guess what they are used for. I
 would also suggest to describe the interaction of components/classes and
 WHY some thing are implemented in a certain way. As I have background
 knowledge from Stratosphere, I personally can work around it and make
 sense out of it (at least most times). However, for new contributers it
 might be very hard to make sense out of it and to get started
 implementing new features.


 Cheers,
   Matthias





Re: Question about Flink Streaming

2015-03-24 Thread Gyula Fóra
The setup and open methods could be called together, but they do different
tasks, and therefore I dont see any reason why they should be in a same
method. This is a critical part of the code so better keep things clean and
separate.

The RuntimeContext refers to the operator while the TaskContext refers to
the task itself. We dont want to expose everything in the TaskContext to
the user, because that could lead to serious problems.

Ok I will talk with him.

On Tue, Mar 24, 2015 at 5:54 PM, Matthias J. Sax 
mj...@informatik.hu-berlin.de wrote:

 Hi Gyula,

 thank a lot. I still don't understand why setup() and open() can not be
 unified? I also don't know, what the difference between RuntimeContext
 and StreamTaskContext is (or to be more precise, why not using a single
 context class that unifies both)?

 About the renaming of Timestamp class: Marton told me, that he thinks
 the name is fine and should not be changed. Before opening a JIRA for
 it, you two should get in sync and decide what to do.


 -Matthias



 On 03/24/2015 05:38 PM, Gyula Fóra wrote:
  Hey Matthias,
 
  Let's see if I get these things for you :)
 
  1) The difference between setup and open is that, setup to set things
 like
  collectors, runtimecontext and everything that will be used by the
  implemented invokable, and also by the rich functions. Open is called
 after
  setup, to actually open the execution of the UDF operator.
 
  2) Close is always called, even when the task is cancelled. In addition
  when a task is failing (maybe because other tasks are failing) the cancel
  method is called and the main thread is interrupted. The point of having
 a
  cancel method that some invokables might require different shutdown logic
  in case of failure.
 
  3) This I need to look into...
 
  4) You are right about the unintuitive name here, if you could open a
 JIRA
  for this I would appreciate that :)
 
  5) You are absolutely right on this point, we need to spend more effort
 on
  writing proper docs.
 
  I hope I could clarify some stuff.
 
  Cheers,
  Gyula
 
  On Tue, Mar 24, 2015 at 5:05 PM, Matthias J. Sax 
  mj...@informatik.hu-berlin.de wrote:
 
  Hi,
 
  as I get more familiar with Flink streaming and do some coding, I hit a
  few points which I want do discuss about because I find them
  contra-intuitive. Please tell me, what you think about it or clarify
  what I misunderstood.
 
  1) In class StreamInvokable has two methods .setup(...) and .open(...)
 - what is the difference between both? When is each of both called
  exactly? It seems to be, that both are used to setup an operator. Why
  can't they be unified?
 
  2) The same question about .close() and .cancel() ?
 
  3) There is an class/interface hierarchy for user defined functions. The
  top level interface is 'Function' and there is an interface
  'RichFunction' and abstract class 'AbstractRichFunction'. For each
  different type, there are user functions derived from. So far so good.
  However, the StreamInvokable class only takes a constructor
  argument
  Function, indicating that RichFunctions are not supported. Internally,
  the given function is tested to be a RichFunction (using instanceof) at
  certain places. This in contra-intuitive from a API point of view.
  From my OO understanding it would be better to replace Function
 by
  RichFunction everywhere. However, I was told that the (empty) Function
  interface is necessary for lambda expressions. Thus, I would suggest to
  extend the API with methods taking a RichFunction a parameter so it is
  clear that those are supported, too.
 
  4) There is the interface Timestamp that is used to extract a time stamp
  for a record on order to create windows on a record attribute. I think
  the name Timestamp is miss leading, because the class does not
  represent a time stamp. I would rather call the interface
  TimestampExtractor or something similar.
 
  5) Stefan started the discussion about more tests for the streaming
  component. I would additionally suggest to improve the Javadoc
  documentation. The are many classes an method with missing or very brief
  documentation and it is ofter hard to guess what they are used for. I
  would also suggest to describe the interaction of components/classes and
  WHY some thing are implemented in a certain way. As I have background
  knowledge from Stratosphere, I personally can work around it and make
  sense out of it (at least most times). However, for new contributers it
  might be very hard to make sense out of it and to get started
  implementing new features.
 
 
  Cheers,
Matthias
 
 
 
 




Re: [GSoc][flink-streaming] Interested in pursuing FLINK-1617 and FLINK-1534

2015-03-24 Thread Akshay Dixit
Thanks Gyula.

I agree too that simple and working implementations are preferrable over
hacky complex solutions. I'll start sketching out an initial straighforward
API with only basic pattern matching features
and base it on the existing windowing API. I'll post a draft of the
proposal,  keeping the points you've said in mind, tomorrow, so you can
look it over to see if its all right.
Regards,
Akshay Dixit

On Tue, Mar 24, 2015 at 6:30 PM, Gyula Fóra gyf...@apache.org wrote:

 Hey Dixit,

 Sorry for the delay, I had to discuss this in more detail with some of our
 other core developers.

 The consensus seems to be that we would like push this project in a
 direction where the changes can be quickly included in the next releases.
 For this it is essential that we implement features that are complete (and
 clean) from the users perspective. This does not necessarily mean that we
 would like to have everything at once but rather that it is preferable to
 start with something clean and simple (for instance the naive chained
 filter approach) and progressively build more complex logic.

 This also mean that we would like to avoid researchy code in the codebase
 as much as possible. Of course once we have a stable api for this
 functionality we can work towards making the optimizations that you have
 mentioned like operator sharing and so on.

 The ideal proposal would give a clear sketch of the pattern matching API
 that you would like to implement, which might be some added operators at
 first to the current API and possible a DSL later with more advanced
 functionality (this would probably go in a separate library until it is
 very stable).

 So please in the proposal include a preview of what the pattern matching
 syntax would look like integrated with the current operators, how it would
 interact with other parts of the system etc.

 These are the thing we need to figure out before we consider the
 optimizations I think, because it usually turns out, that the API semantics
 you would like to provide can hugely affect (probably limit) the
 possibilities that you have afterwards in terms of optimizations.

 Let me know if you have further questions regarding this :)

 Gyula

 On Tue, Mar 24, 2015 at 12:01 PM, Gyula Fóra gyf...@apache.org wrote:

  Hey,
 
  Give me an hour or so as I am in a meeting currently, but I will get back
  to you afterwards.
 
  Regards,
  Gyula
 
  On Tue, Mar 24, 2015 at 11:03 AM, Akshay Dixit akshayd...@gmail.com
  wrote:
 
  Hi,
  It'd really help if I got a reply soon. It'll be helpful in writing the
  proposal since the deadline is on 27th. Thanks
  Regards,
  Akshay Dixit
 
  On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit akshayd...@gmail.com
  wrote:
 
   Thanks for the explanation Marton. I've decided to try out for
  FLINK-1534.
  
   After reading through the thesis[4] and a few other papers[1][2][3], I
   believe I've gathered a little context to ask more questions. But I'm
  still
   not sure how Flink's internals work
   so please bear with me. Although the ongoing effort to document the
   architecture and internal is really helpful for newbies like me and
  would
   greatly decrease the ramping up time.
  
   Detecting a pattern of events would comprise of a pipeline that
 accepts
   the pattern query and
   sources of DataStreams, and outputs detected matches of that pattern
 to
  a
   sink or forwards it
   along to another stream for further computation.
  
   As you said, a simple filter-join-aggregate query system could be
   developed implementing using the existing Streaming windowing API.
   But matching over complex events and decoding their pattern queries
  would
   require implementing a DSL that transforms queries into an evaluation
   model. For e.g,
   in [1], the authors have implemented an NFA automaton with a shared
   versioned buffer that models the queries. In [4], the authors
   propose a new language that is much more expressive and compiles into
 a
   topology graph for Storm.
  
   So in Flink's case, I believe the proposed DSL would generate operator
   graphs for the Flink compiler to schedule Jobgraphs over TaskManagers.
   If we don't depend on the Windowing API, would we need to create new
   operators such as the Projection, Conjunction and Union operators
  defined
   in [4] ?
   Also I would like to hear your thoughts on how to approach scaling the
   pattern matching query. Note all these techniques talk about scaling a
   single query.
   I've read various ways such as
  
   1.  Merging equivalent runs[1] -: This seems a good way to squash
  multiple
   instances of pattern matching forks into a single one if they have the
  same
   state.
   But I'm not sure how we would implement this in Flink since this is a
   runtime optimization.
  
   2.  Implementing a matched version buffer[1] -: This would involve
  sharing
   state of a buffer datastructure across multiple candidate match
  instances
   for the pattern.
  
   3.  Splitting