Jaroslav Kuchar created SPARK-12163: ---------------------------------------
Summary: FPGrowth unusable on some datasets without extensive tweaking of the support threshold Key: SPARK-12163 URL: https://issues.apache.org/jira/browse/SPARK-12163 Project: Spark Issue Type: Bug Components: MLlib Reporter: Jaroslav Kuchar This problem occurs on standard machine learning UCI datasets. Details for "audiology" dataset follows: It contains only 226 transactions and 70 attributes. Mining of frequent itemsets with support threshold 0.95 will produce 73.162.705 itemsets., for support 0.94 – 366.880.771 itemsets. More details about experiment: https://gist.github.com/jaroslav-kuchar/edbcbe72c5a884136db1 The number of generated itemsets rapidly growths with a number of unique items in transactions. Considering the combinatorial explosion, it can cause performing CPU-intensive and long running tasks for various settings of the support threshold. This extensive tweaking of the support threshold makes the usage of the FPGrowth implementation unusable even for a small dataset. It would be useful to implement additional stopping criterions to control the explosion of itemsets’ count in FPGrowth. We propose to implement optional limit for maximum number of generated itemsets or maximum number of items per itemset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org