[ https://issues.apache.org/jira/browse/SPARK-12163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-12163: ------------------------------ Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) > FPGrowth unusable on some datasets without extensive tweaking of the support > threshold > -------------------------------------------------------------------------------------- > > Key: SPARK-12163 > URL: https://issues.apache.org/jira/browse/SPARK-12163 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: Jaroslav Kuchar > Priority: Minor > > This problem occurs on standard machine learning UCI datasets. > Details for "audiology" dataset follows: It contains only 226 transactions > and 70 attributes. Mining of frequent itemsets with support threshold 0.95 > will produce 73.162.705 itemsets., for support 0.94 – 366.880.771 itemsets. > More details about experiment: > https://gist.github.com/jaroslav-kuchar/edbcbe72c5a884136db1 > The number of generated itemsets rapidly growths with a number of unique > items in transactions. Considering the combinatorial explosion, it can cause > performing CPU-intensive and long running tasks for various settings of the > support threshold. This extensive tweaking of the support threshold makes the > usage of the FPGrowth implementation unusable even for a small dataset. > It would be useful to implement additional stopping criterions to control the > explosion of itemsets’ count in FPGrowth. We propose to implement optional > limit for maximum number of generated itemsets or maximum number of items per > itemset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org