[ 
https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14615911#comment-14615911
 ] 

Xiangrui Meng commented on SPARK-6487:
--------------------------------------

Had an offline discussion with [~Zhang JiaJin]:

1. SPADE vs. PrefixSpan. Both are popular methods for sequential pattern 
mining. Each has pros and cons. In some work, we saw PrefixSpan outperforms 
SPADE. However, it was on small dataset. We are gonna implement PrefixSpan. If 
SPADE is really needed for some important use cases, we can add it later.

2. Parallelizing PrefixSpan. The first step in PrefixSpan is to find frequent 
items, which is the same as the first step of FP-growth. Based on the number of 
frequent items, we can do one of the following:
  a) for each transaction and each frequent item in this transaction, emit 
(item, items following) as key-value pairs; generate frequent sequences locally 
for each key
  b) generate candidate sets of higher orders using Apriori-like algorithms 
until having enough number of candidates, and then a).

3. Support non-temporal sequences.

The first PR would implement 2a. 2b and 3 will be follow-up PRs.

> Add sequential pattern mining algorithm to Spark MLlib
> ------------------------------------------------------
>
>                 Key: SPARK-6487
>                 URL: https://issues.apache.org/jira/browse/SPARK-6487
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Zhang JiaJin
>            Assignee: Zhang JiaJin
>
> [~mengxr] [~zhangyouhua]
> Sequential pattern mining is an important branch in the pattern mining. In 
> the past the actual work, we use the sequence mining (mainly PrefixSpan 
> algorithm) to find the telecommunication signaling sequence pattern, achieved 
> good results. But once the data is too large, the operation time is too long, 
> even can not meet the the service requirements. We are ready to implement the 
> PrefixSpan algorithm in spark, and applied to our subsequent work. 
> The related Paper: 
> PrefixSpan: 
> Pei, Jian, et al. "Mining sequential patterns by pattern-growth: The 
> prefixspan approach." Knowledge and Data Engineering, IEEE Transactions on 
> 16.11 (2004): 1424-1440.
> Parallel Algorithm: 
> Cong, Shengnan, Jiawei Han, and David Padua. "Parallel mining of closed 
> sequential patterns." Proceedings of the eleventh ACM SIGKDD international 
> conference on Knowledge discovery in data mining. ACM, 2005.
> Distributed Algorithm: 
> Wei, Yong-qing, Dong Liu, and Lin-shan Duan. "Distributed PrefixSpan 
> algorithm based on MapReduce." Information Technology in Medicine and 
> Education (ITME), 2012 International Symposium on. Vol. 2. IEEE, 2012.
> Pattern mining and sequential mining Knowledge: 
> Han, Jiawei, et al. "Frequent pattern mining: current status and future 
> directions." Data Mining and Knowledge Discovery 15.1 (2007): 55-86.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to