[ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-20114:
-------------------------------
    Description: 
Creating this jira to track the feature parity for PrefixSpan and sequential 
pattern mining in Spark ml with DataFrame API. 

First list a few design issues to be discussed, then subtasks like Scala, 
Python and R API will be created.

# Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
which is not good to be used directly for predicting on new records. Please 
read  
http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
 for some background knowledge. Thanks Philippe Fournier-Viger for providing 
insights. If we want to keep using the Estimator/Transformer pattern, options 
are:
     #*  Implement a dummy transform for PrefixSpanModel, which will not add 
new column to the input DataSet. The PrefixSpanModel is only used to provide 
access for frequent sequential patterns.
     #*  Adding the feature to extract sequential rules from sequential 
patterns. Then use the sequential rules in the transform as FPGrowthModel.  The 
rules extracted are of the form X–> Y where X and Y are sequential patterns. 
But in practice, these rules are not very good as they are too precise and thus 
not noise tolerant.
#  Different from association rules and frequent itemsets, sequential rules can 
be extracted from the original dataset more efficiently using algorithms like 
RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
unordered, but X must appear before Y, which is more general and can work 
better in practice for prediction. 

I'd like to hear more from the users to see which kind of Sequential rules are 
more practical. 


  was:
Creating this jira to track the feature parity for PrefixSpan and sequential 
pattern mining in Spark ml with DataFrame API. 

First list a few design issues to be discussed, then subtasks like Scala, 
Python and R API will be created.

# Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
which is not good to be used directly for predicting on new records. Please 
read  
http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
 for some background knowledge. Thanks Philippe Fournier-Viger for providing 
insights. If we want to keep using the Estimator/Transformer pattern, options 
are:
     #*  Implement a dummy transform for PrefixSpanModel, which will not add 
new column to the input DataSet. 
     #*  Adding the feature to extract sequential rules from sequential 
patterns. Then use the sequential rules in the transform as FPGrowthModel.  The 
rules extracted are of the form X–> Y where X and Y are sequential patterns. 
But in practice, these rules are not very good as they are too precise and thus 
not noise tolerant.
#  Different from association rules and frequent itemsets, sequential rules can 
be extracted from the original dataset more efficiently using algorithms like 
RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
unordered, but X must appear before Y, which is more general and can work 
better in practice for prediction. 

I'd like to hear more from the users to see which kind of Sequential rules are 
more practical. 



> spark.ml parity for sequential pattern mining - PrefixSpan
> ----------------------------------------------------------
>
>                 Key: SPARK-20114
>                 URL: https://issues.apache.org/jira/browse/SPARK-20114
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: yuhao yang
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>      #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>      #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to