Brian Lehman created FLINK-33538:
------------------------------------
Summary: Create a "best match" option for CEP when using the
optional operator in the Pattern API
Key: FLINK-33538
URL: https://issues.apache.org/jira/browse/FLINK-33538
Project: Flink
Issue Type: Improvement
Components: Library / CEP
Reporter: Brian Lehman
I’ll try to detail my issue and the initial experimentation I’ve done in the
CEP source code to show the potential of making this nearly as fast as having
no optional events in the pattern.
Using Flink CEP I have implemented a pattern that detects a sequence of events.
Each event has internal attributes that are checked and inter-event time deltas
are also checked as part of the pattern.
When I require all events in my sequence (say 10 elements long) Flink CEP works
well and is super fast at detecting the “perfect” sequences. Should one or more
of the events not get recorded, I would still like to detect the partial
sequence. Since I don’t know which events might be missing I must make all of
the events optional. While this worked, it was significantly slower, to the
point that it is an unworkable solution.
CEP is set to alert as fast as possible – so when everything is optional, once
it passes a single event the path to final in ComputationState is immediately
found and the potentialMatch is emitted. If a perfect match is coming, I will
not find the pattern unless the skip strategy is set to noSkip(). Because all
are optional, in the instance of a perfect match, CEP wants to emit all 2^10
possible matches when noSkip() is used. This is what causes the extreme
performance drop off.
I would like to add an option to Flink CEP that allows the “best match” when
the optional() attribute is used in the Pattern API. The default can still be
“fastest match” which would operate as it does today. In the “best match”
scenario, the potentialMatches would be held back (not emitted) until the full
pattern time has passed.
I have made some modifications to the source code (primarily
org.apache.flink.cep.nfa.NFA) that show this can work and be nearly as fast as
the perfect match option. My basic strategy is the following:
* Once a longer match is found, discard all of the shorter length sub-matches
from partialMatches and potentialMatches.
* Eg – [E1, $endState$]
* [E1, E2, $endState$] - discard the [E1, $endState$] and [E2, $endState$] and
those partialMatch paths as well
* [E1, E2, E3, $endState$] – discard the [E1, E2, $endState$], etc.
* DeweyNumber allows a way to quickly assess match lengths, by comparing
DeweyNumber.lengths.
* DeweyNumber provides a way to assess matches, with its versioning – matches
tend to use a version of zero (eg. 1.0.0.0.0.0 is a perfect match to the first
6 events of the pattern)
* Delay emitting the “best match” until the options are exhausted, eliminating
the shorter matches as quickly as possible.
I’m at a place in my modifications where it would be very beneficial to work
with an expert that would understand how best to accomplish this without
impacting current functionality/performance. I’m happy to collaborate and share
my work if that helps. My current modifications are a combination of mods and
logs to allow me to see what is going on internally in the CEP processing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)