Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-01 Thread Henri Dubois-Ferriere
fwiw Apache Flink just added CEP. Queries are constructed programmatically
rather than in SQL, but the underlying functionality is similar.

https://issues.apache.org/jira/browse/FLINK-3215

On 1 March 2016 at 08:19, Jerry Lam  wrote:

> Hi Herman,
>
> Thank you for your reply!
> This functionality usually finds its place in financial services which use
> CEP (complex event processing) for correlation and pattern matching. Many
> commercial products have this including Oracle and Teradata Aster Data MR
> Analytics. I do agree the syntax a bit awkward but after you understand it,
> it is actually very compact for expressing something that is very complex.
> Esper has this feature partially implemented (
> http://www.espertech.com/esper/release-5.1.0/esper-reference/html/match-recognize.html
> ).
>
> I found the Teradata Analytics documentation best to describe the usage of
> it. For example (note npath is similar to match_recognize):
>
> SELECT last_pageid, MAX( count_page80 )
>  FROM nPath(
>  ON ( SELECT * FROM clicks WHERE category >= 0 )
>  PARTITION BY sessionid
>  ORDER BY ts
>  PATTERN ( 'A.(B|C)*' )
>  MODE ( OVERLAPPING )
>  SYMBOLS ( pageid = 50 AS A,
>pageid = 80 AS B,
>pageid <> 80 AND category IN (9,10) AS C )
>  RESULT ( LAST ( pageid OF ANY ( A,B,C ) ) AS last_pageid,
>   COUNT ( * OF B ) AS count_page80,
>   COUNT ( * OF ANY ( A,B,C ) ) AS count_any )
>  )
>  WHERE count_any >= 5
>  GROUP BY last_pageid
>  ORDER BY MAX( count_page80 )
>
> The above means:
> Find user click-paths starting at pageid 50 and passing exclusively
> through either pageid 80 or pages in category 9 or category 10. Find the
> pageid of the last page in the path and count the number of times page 80
> was visited. Report the maximum count for each last page, and sort the
> output by the latter. Restrict to paths containing at least 5 pages. Ignore
> pages in the sequence with category < 0.
>
> If this query is written in pure SQL (if possible at all), it requires
> several self-joins. The interesting thing about this feature is that it
> integrates SQL+Streaming+ML in one (perhaps potentially graph too).
>
> Best Regards,
>
> Jerry
>
>
> On Tue, Mar 1, 2016 at 9:39 AM, Herman van Hövell tot Westerflier <
> hvanhov...@questtec.nl> wrote:
>
>> Hi Jerry,
>>
>> This is not on any roadmap. I (shortly) browsed through this; and this
>> looks like some sort of a window function with very awkward syntax. I think
>> spark provided better constructs for this using dataframes/datasets/nested
>> data...
>>
>> Feel free to submit a PR.
>>
>> Kind regards,
>>
>> Herman van Hövell
>>
>> 2016-03-01 15:16 GMT+01:00 Jerry Lam :
>>
>>> Hi Spark developers,
>>>
>>> Will you consider to add support for implementing "Pattern matching in
>>> sequences of rows"? More specifically, I'm referring to this:
>>> http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-pattern-recogniton-11.pdf
>>>
>>> This is a very cool/useful feature to pattern matching over live
>>> stream/archived data. It is sorted of related to machine learning because
>>> this is usually used in clickstream analysis or path analysis. Also it is
>>> related to streaming because of the nature of the processing (time series
>>> data mostly). It is SQL because there is a good way to express and optimize
>>> the query.
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>
>>
>


Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-19 Thread Henri Dubois-Ferriere
+1

On 19 November 2015 at 14:14, Reynold Xin  wrote:

> I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I
> think everybody is for that.
>
> https://issues.apache.org/jira/browse/SPARK-11807
>
> Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That is
> to say, keep only Hadoop 2.6 and greater.
>
> What are the community's thoughts on that?
>
>