Thanks for the explanation Marton. I've decided to try out for FLINK-1534.

After reading through the thesis[4] and a few other papers[1][2][3], I
believe I've gathered a little context to ask more questions. But I'm still
not sure how Flink's internals work
so please bear with me. Although the ongoing effort to document the
architecture and internal is really helpful for newbies like me and would
greatly decrease the ramping up time.

Detecting a pattern of events would comprise of a pipeline that accepts the
pattern query and
sources of DataStreams, and outputs detected matches of that pattern to a
sink or forwards it
along to another stream for further computation.

As you said, a simple filter-join-aggregate query system could be developed
implementing using the existing Streaming windowing API.
But matching over complex events and decoding their pattern queries would
require implementing a DSL that transforms queries into an evaluation
model. For e.g,
in [1], the authors have implemented an NFA automaton with a shared
versioned buffer that models the queries. In [4], the authors
propose a new language that is much more expressive and compiles into a
topology graph for Storm.

So in Flink's case, I believe the proposed DSL would generate operator
graphs for the Flink compiler to schedule Jobgraphs over TaskManagers.
If we don't depend on the Windowing API, would we need to create new
operators such as the Projection, Conjunction and Union operators defined
in [4] ?
Also I would like to hear your thoughts on how to approach scaling the
pattern matching query. Note all these techniques talk about scaling a
single query.
I've read various ways such as

1.  Merging equivalent runs[1] -: This seems a good way to squash multiple
instances of pattern matching forks into a single one if they have the same
state.
But I'm not sure how we would implement this in Flink since this is a
runtime optimization.

2.  Implementing a matched version buffer[1] -: This would involve sharing
state of a buffer datastructure across multiple candidate match instances
for the pattern.

3.  Splitting complex composite patterns into simpler sub-patterns[4] and
executing separate queries to detect those sub-patterns. This might
translate into different
tasks and duplicating the source datastreams to all the new generated tasks.

Also since I don't know how the Flink compiler behaves, would some of the
optimizations involve making changes to it too?

Regards,
Akshay Dixit

[1] : Efficient Pattern Matching over Event Streams
<http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf>
[2] : On Supporting Kleene Closure over Event Streams
<http://people.cs.umass.edu/~yanlei/publications/sase-icde08.pdf>
[3] : Processing Flows of Information: From Data Stream to Complex Event
Processing
<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.396.1785&rep=rep1&type=pdf>
[4] : Distributing Complex Event Detection
<http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf>

On Mon, Mar 16, 2015 at 3:22 PM, Márton Balassi <balassi.mar...@gmail.com>
wrote:

> Dear Akshay,
>
> Thanks again for your interest and for the recent contribution to
> streaming.
>
> Both of the projects mentioned wold be largely appreciated by the
> community, and you can also propose other project suggestions here for
> discussion.
>
> Regarding FLINK-1534, the thesis I mentioned serves as a starting point and
> indeed the basic solution can be implemented with filtering and
> windowing/mapping with some state storing whether the cause of an event has
> been already seen. Solely relying on the now existing windowing API this
> however might cause performance issues if the events also have an
> expiration timeout - some optimization there would be included. The further
> challenge is to try to further exploit the parallel job execution of Flink
> to possibly scale a pattern matching query.
>
> Best,
>
> Marton
>
> On Sun, Mar 15, 2015 at 3:22 PM, Akshay Dixit <akshayd...@gmail.com>
> wrote:
>
> > Hi,
> > I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore, India. I'm
> > currently interested in distributed systems and stream processing and am
> > looking to delve deeper into the subject, and hope to get some insight by
> > contributing to Apache Flink. I've gathered some idea of the
> > flink-streaming codebase by recently working on a PR for FLINK-1450[2].
> >
> > Both FLINK-1617[3] and FLINK-1534[4] are interesting projects that I
> would
> > love to work on over the summer. I was wondering which amongst these
> would
> > be more appreciated by the community, so I can start working towards a
> > proposal for either one.
> >
> > Regarding FLINK-1534, I was wondering why would simply merging and
> > filtering the existing streams for events we want to detect not work?
> Also
> > on going through the document mentioned by @mbalassi in the JIRA
> > comment[5], the authors specify some Runtime Event Detection concepts in
> > Section 5.2. I'm assuming the project entails on building a similar
> analogy
> > using Flink and the deliverables would include working pattern matching
> > operators over Flink DataStreams as described in the report. If so, then
> > shouldn't it be trivial to implement the described the Binary operator
> > using a WindowedStream and a Filter?
> > I hope my questions don't seem misplaced here and I would appreciate
> links
> > to literature where I can learn more on the topic.
> >
> > Regards,
> > Akshay Dixit
> >
> > [1] : http://akshaydixi.me
> > [2] : https://github.com/apache/flink/pull/481
> > [3] : https://issues.apache.org/jira/browse/FLINK-1617
> > [4] : https://issues.apache.org/jira/browse/FLINK-1534
> > [5] :
> > http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf
> >
>

Reply via email to