[ 
https://issues.apache.org/jira/browse/SPARK-12163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12163:
------------------------------
      Priority: Minor  (was: Major)
    Issue Type: Improvement  (was: Bug)

> FPGrowth unusable on some datasets without extensive tweaking of the support 
> threshold
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-12163
>                 URL: https://issues.apache.org/jira/browse/SPARK-12163
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Jaroslav Kuchar
>            Priority: Minor
>
> This problem occurs on standard machine learning UCI datasets. 
> Details for "audiology" dataset follows: It contains only 226 transactions 
> and 70 attributes. Mining of frequent itemsets with support threshold 0.95 
> will produce 73.162.705 itemsets., for support 0.94 – 366.880.771 itemsets.
> More details about experiment: 
> https://gist.github.com/jaroslav-kuchar/edbcbe72c5a884136db1
> The number of generated itemsets rapidly growths with a number of unique 
> items in transactions. Considering the combinatorial explosion, it can cause 
> performing CPU-intensive and long running tasks for various settings of the 
> support threshold. This extensive tweaking of the support threshold makes the 
> usage of the FPGrowth implementation unusable even for a small dataset.
> It would be useful to implement additional stopping criterions to control the 
> explosion of itemsets’ count in FPGrowth. We propose to implement optional 
> limit for maximum number of generated itemsets or maximum number of items per 
> itemset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to