Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

Jerry Lam Tue, 01 Mar 2016 08:25:19 -0800

Hi Herman,

Thank you for your reply!
This functionality usually finds its place in financial services which use
CEP (complex event processing) for correlation and pattern matching. Many
commercial products have this including Oracle and Teradata Aster Data MR
Analytics. I do agree the syntax a bit awkward but after you understand it,
it is actually very compact for expressing something that is very complex.
Esper has this feature partially implemented (
http://www.espertech.com/esper/release-5.1.0/esper-reference/html/match-recognize.html
).

I found the Teradata Analytics documentation best to describe the usage of
it. For example (note npath is similar to match_recognize):

SELECT last_pageid, MAX( count_page80 )
 FROM nPath(
 ON ( SELECT * FROM clicks WHERE category >= 0 )
 PARTITION BY sessionid
 ORDER BY ts
 PATTERN ( 'A.(B|C)*' )
 MODE ( OVERLAPPING )
 SYMBOLS ( pageid = 50 AS A,
           pageid = 80 AS B,
           pageid <> 80 AND category IN (9,10) AS C )
 RESULT ( LAST ( pageid OF ANY ( A,B,C ) ) AS last_pageid,
          COUNT ( * OF B ) AS count_page80,
          COUNT ( * OF ANY ( A,B,C ) ) AS count_any )
 )
 WHERE count_any >= 5
 GROUP BY last_pageid
 ORDER BY MAX( count_page80 )

The above means:
Find user click-paths starting at pageid 50 and passing exclusively through
either pageid 80 or pages in category 9 or category 10. Find the pageid of
the last page in the path and count the number of times page 80 was
visited. Report the maximum count for each last page, and sort the output
by the latter. Restrict to paths containing at least 5 pages. Ignore pages
in the sequence with category < 0.

If this query is written in pure SQL (if possible at all), it requires
several self-joins. The interesting thing about this feature is that it
integrates SQL+Streaming+ML in one (perhaps potentially graph too).

Best Regards,

Jerry

On Tue, Mar 1, 2016 at 9:39 AM, Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> Hi Jerry,
>
> This is not on any roadmap. I (shortly) browsed through this; and this
> looks like some sort of a window function with very awkward syntax. I think
> spark provided better constructs for this using dataframes/datasets/nested
> data...
>
> Feel free to submit a PR.
>
> Kind regards,
>
> Herman van Hövell
>
> 2016-03-01 15:16 GMT+01:00 Jerry Lam <chiling...@gmail.com>:
>
>> Hi Spark developers,
>>
>> Will you consider to add support for implementing "Pattern matching in
>> sequences of rows"? More specifically, I'm referring to this:
>> http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-pattern-recogniton-11.pdf
>>
>> This is a very cool/useful feature to pattern matching over live
>> stream/archived data. It is sorted of related to machine learning because
>> this is usually used in clickstream analysis or path analysis. Also it is
>> related to streaming because of the nature of the processing (time series
>> data mostly). It is SQL because there is a good way to express and optimize
>> the query.
>>
>> Best Regards,
>>
>> Jerry
>>
>
>

Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

Reply via email to