[
https://issues.apache.org/jira/browse/SPARK-12163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-12163:
---------------------------------
Labels: bulk-closed (was: )
> FPGrowth unusable on some datasets without extensive tweaking of the support
> threshold
> --------------------------------------------------------------------------------------
>
> Key: SPARK-12163
> URL: https://issues.apache.org/jira/browse/SPARK-12163
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Reporter: Jaroslav Kuchar
> Priority: Minor
> Labels: bulk-closed
>
> This problem occurs on standard machine learning UCI datasets.
> Details for "audiology" dataset follows: It contains only 226 transactions
> and 70 attributes. Mining of frequent itemsets with support threshold 0.95
> will produce 73.162.705 itemsets., for support 0.94 – 366.880.771 itemsets.
> More details about experiment:
> https://gist.github.com/jaroslav-kuchar/edbcbe72c5a884136db1
> The number of generated itemsets rapidly growths with a number of unique
> items in transactions. Considering the combinatorial explosion, it can cause
> performing CPU-intensive and long running tasks for various settings of the
> support threshold. This extensive tweaking of the support threshold makes the
> usage of the FPGrowth implementation unusable even for a small dataset.
> It would be useful to implement additional stopping criterions to control the
> explosion of itemsets’ count in FPGrowth. We propose to implement optional
> limit for maximum number of generated itemsets or maximum number of items per
> itemset.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]