Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

Jerry Lam Tue, 01 Mar 2016 14:26:09 -0800

Hi Reynold,

You are right. It is about the audience. For instance, in many of my cases,
the SQL style is very attractive if not mandatory for people with minimum
programming knowledge. SQL has its place for communication. Last time I
show someone spark dataframe-style, they immediately said it is too
difficult to use. When I change it to SQL, they are suddenly happy and say
how you do this. It sounds stupid but that's what it is for now.


The following example will make some banks happy (copy from the Oracle
solution):

SELECT *
FROM Ticker MATCH_RECOGNIZE (
     PARTITION BY symbol
     ORDER BY tstamp
     MEASURES  STRT.tstamp AS start_tstamp,
               LAST(DOWN.tstamp) AS bottom_tstamp,
               LAST(UP.tstamp) AS end_tstamp
     ONE ROW PER MATCH
     AFTER MATCH SKIP TO LAST UP
     PATTERN (STRT DOWN+ UP+)
     DEFINE
        DOWN AS DOWN.price < PREV(DOWN.price),
        UP AS UP.price > PREV(UP.price)
     ) MR
ORDER BY MR.symbol, MR.start_tstamp;

Basically this query finds all cases where stock prices dipped to a bottom
price and then rose (the popular V-shape).  It might be confusing at first
but it is still readable for many users who know SQL. Note that the PATTERN
is interesting; it is a regular expression on the symbols defined (DOWN and
UP, STRT is not define so it means it matches any event).

Most CEP solutions have a SQL-like interface.

Best Regards,

Jerry



On Tue, Mar 1, 2016 at 4:44 PM, Reynold Xin <r...@databricks.com> wrote:

> There are definitely pros and cons for Scala vs SQL-style CEP. Scala might
> be more powerful, but the target audience is very different.
>
> How much usage is there for a CEP style SQL syntax in practice? I've never
> seen it coming up so far.
>
>
>
> On Tue, Mar 1, 2016 at 9:35 AM, Alex Kozlov <ale...@gmail.com> wrote:
>
>> Looked at the paper: while we can argue on the performance side, I think
>> semantically the Scala pattern matching is much more expressive.  The time
>> will decide.
>>
>> On Tue, Mar 1, 2016 at 9:07 AM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>>> Hi Alex,
>>>
>>> We went through this path already :) This is the reason we try other
>>> approaches. The recursion makes it very inefficient for some cases.
>>> For details, this paper describes it very well:
>>> https://people.cs.umass.edu/%7Eyanlei/publications/sase-sigmod08.pdf
>>> which is the same paper references in Flink ticket.
>>>
>>> Please let me know if I overlook something. Thank you for sharing this!
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>> On Tue, Mar 1, 2016 at 11:58 AM, Alex Kozlov <ale...@gmail.com> wrote:
>>>
>>>> For the purpose of full disclosure, I think Scala offers a much more
>>>> efficient pattern matching paradigm.  Using nPath is like using assembler
>>>> to program distributed systems.  Cannot tell much here today, but the
>>>> pattern would look like:
>>>>
>>>>      |     def matchSessions(h: Seq[Session[PageView]], id: String, p:
>>>> Seq[PageView]) :
>>>>
>>>> Seq[Session[PageView]] = {    |       p match {
>>>>
>>>>      |         case Nil => Nil
>>>>
>>>>      |         case PageView(ts1, "company.com>homepage") ::
>>>> PageView(ts2,
>>>>
>>>> "company.com>plus>products landing") :: tail if ts2 > ts1 + 600 =>
>>>>
>>>>      |           matchSessions(h, id, tail).+:(new Session(id, p))
>>>>
>>>>      |         case _ => matchSessions(h, id, p.tail)
>>>>
>>>>      |       }
>>>>
>>>> Look for Scala case statements with guards and upcoming book releases.
>>>>
>>>> http://docs.scala-lang.org/tutorials/tour/pattern-matching
>>>>
>>>> https://www.safaribooksonline.com/library/view/scala-cookbook/9781449340292/ch03s14.html
>>>>
>>>> On Tue, Mar 1, 2016 at 8:34 AM, Henri Dubois-Ferriere <
>>>> henr...@gmail.com> wrote:
>>>>
>>>>> fwiw Apache Flink just added CEP. Queries are constructed
>>>>> programmatically rather than in SQL, but the underlying functionality is
>>>>> similar.
>>>>>
>>>>> https://issues.apache.org/jira/browse/FLINK-3215
>>>>>
>>>>> On 1 March 2016 at 08:19, Jerry Lam <chiling...@gmail.com> wrote:
>>>>>
>>>>>> Hi Herman,
>>>>>>
>>>>>> Thank you for your reply!
>>>>>> This functionality usually finds its place in financial services
>>>>>> which use CEP (complex event processing) for correlation and pattern
>>>>>> matching. Many commercial products have this including Oracle and 
>>>>>> Teradata
>>>>>> Aster Data MR Analytics. I do agree the syntax a bit awkward but after 
>>>>>> you
>>>>>> understand it, it is actually very compact for expressing something that 
>>>>>> is
>>>>>> very complex. Esper has this feature partially implemented (
>>>>>> http://www.espertech.com/esper/release-5.1.0/esper-reference/html/match-recognize.html
>>>>>> ).
>>>>>>
>>>>>> I found the Teradata Analytics documentation best to describe the
>>>>>> usage of it. For example (note npath is similar to match_recognize):
>>>>>>
>>>>>> SELECT last_pageid, MAX( count_page80 )
>>>>>>  FROM nPath(
>>>>>>  ON ( SELECT * FROM clicks WHERE category >= 0 )
>>>>>>  PARTITION BY sessionid
>>>>>>  ORDER BY ts
>>>>>>  PATTERN ( 'A.(B|C)*' )
>>>>>>  MODE ( OVERLAPPING )
>>>>>>  SYMBOLS ( pageid = 50 AS A,
>>>>>>            pageid = 80 AS B,
>>>>>>            pageid <> 80 AND category IN (9,10) AS C )
>>>>>>  RESULT ( LAST ( pageid OF ANY ( A,B,C ) ) AS last_pageid,
>>>>>>           COUNT ( * OF B ) AS count_page80,
>>>>>>           COUNT ( * OF ANY ( A,B,C ) ) AS count_any )
>>>>>>  )
>>>>>>  WHERE count_any >= 5
>>>>>>  GROUP BY last_pageid
>>>>>>  ORDER BY MAX( count_page80 )
>>>>>>
>>>>>> The above means:
>>>>>> Find user click-paths starting at pageid 50 and passing exclusively
>>>>>> through either pageid 80 or pages in category 9 or category 10. Find the
>>>>>> pageid of the last page in the path and count the number of times page 80
>>>>>> was visited. Report the maximum count for each last page, and sort the
>>>>>> output by the latter. Restrict to paths containing at least 5 pages. 
>>>>>> Ignore
>>>>>> pages in the sequence with category < 0.
>>>>>>
>>>>>> If this query is written in pure SQL (if possible at all), it
>>>>>> requires several self-joins. The interesting thing about this feature is
>>>>>> that it integrates SQL+Streaming+ML in one (perhaps potentially graph 
>>>>>> too).
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> Jerry
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 1, 2016 at 9:39 AM, Herman van Hövell tot Westerflier <
>>>>>> hvanhov...@questtec.nl> wrote:
>>>>>>
>>>>>>> Hi Jerry,
>>>>>>>
>>>>>>> This is not on any roadmap. I (shortly) browsed through this; and
>>>>>>> this looks like some sort of a window function with very awkward 
>>>>>>> syntax. I
>>>>>>> think spark provided better constructs for this using
>>>>>>> dataframes/datasets/nested data...
>>>>>>>
>>>>>>> Feel free to submit a PR.
>>>>>>>
>>>>>>> Kind regards,
>>>>>>>
>>>>>>> Herman van Hövell
>>>>>>>
>>>>>>> 2016-03-01 15:16 GMT+01:00 Jerry Lam <chiling...@gmail.com>:
>>>>>>>
>>>>>>>> Hi Spark developers,
>>>>>>>>
>>>>>>>> Will you consider to add support for implementing "Pattern matching
>>>>>>>> in sequences of rows"? More specifically, I'm referring to this:
>>>>>>>> http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-pattern-recogniton-11.pdf
>>>>>>>>
>>>>>>>> This is a very cool/useful feature to pattern matching over live
>>>>>>>> stream/archived data. It is sorted of related to machine learning 
>>>>>>>> because
>>>>>>>> this is usually used in clickstream analysis or path analysis. Also it 
>>>>>>>> is
>>>>>>>> related to streaming because of the nature of the processing (time 
>>>>>>>> series
>>>>>>>> data mostly). It is SQL because there is a good way to express and 
>>>>>>>> optimize
>>>>>>>> the query.
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>>
>>>>>>>> Jerry
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Alex Kozlov
>>>> (408) 507-4987
>>>> (650) 887-2135 efax
>>>> ale...@gmail.com
>>>>
>>>
>>>
>>
>>
>> --
>> Alex Kozlov
>> (408) 507-4987
>> (650) 887-2135 efax
>> ale...@gmail.com
>>
>
>

Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

Reply via email to