Jaroslav Kuchar created SPARK-12163:
---------------------------------------

             Summary: FPGrowth unusable on some datasets without extensive 
tweaking of the support threshold
                 Key: SPARK-12163
                 URL: https://issues.apache.org/jira/browse/SPARK-12163
             Project: Spark
          Issue Type: Bug
          Components: MLlib
            Reporter: Jaroslav Kuchar


This problem occurs on standard machine learning UCI datasets. 
Details for "audiology" dataset follows: It contains only 226 transactions and 
70 attributes. Mining of frequent itemsets with support threshold 0.95 will 
produce 73.162.705 itemsets., for support 0.94 – 366.880.771 itemsets.
More details about experiment: 
https://gist.github.com/jaroslav-kuchar/edbcbe72c5a884136db1

The number of generated itemsets rapidly growths with a number of unique items 
in transactions. Considering the combinatorial explosion, it can cause 
performing CPU-intensive and long running tasks for various settings of the 
support threshold. This extensive tweaking of the support threshold makes the 
usage of the FPGrowth implementation unusable even for a small dataset.

It would be useful to implement additional stopping criterions to control the 
explosion of itemsets’ count in FPGrowth. We propose to implement optional 
limit for maximum number of generated itemsets or maximum number of items per 
itemset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to