[
https://issues.apache.org/jira/browse/SPARK-38037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17490654#comment-17490654
]
zhengruifeng commented on SPARK-38037:
--
I can reproduce it by:
{code:java}
import org.apache.spark.ml.fpm._
val seq = Seq.range(0, 40)
val dataset = spark.createDataset(Seq(seq, seq, seq)).toDF("items")
val fp = new FPGrowth().setItemsCol("items").setMinSupport(0.9)
val model = fp.fit(dataset)
model.freqItemsets.count {code}
the result contains all the 1-length to 40-length combinations, which is not
computable.
according to jackylk's comments in
https://issues.apache.org/jira/browse/SPARK-4001: "But in case of smaller data
and if frequent itemset is more, Apriori is more efficient. This is because
FP-Growth need to construct a FP Tree out of the input data set, and it needs
some time."
I guess this is because FPGrowth need to build a FPTree with too many frequent
items.
I have one question:
if a frequent itemset is long (whose length >= 40), then is it useful in
practice?
> Spark MLlib FPGrowth not working with 40+ items in Frequent Item set
>
>
> Key: SPARK-38037
> URL: https://issues.apache.org/jira/browse/SPARK-38037
> Project: Spark
> Issue Type: Bug
> Components: ML
>Affects Versions: 3.2.0
> Environment: Stanalone Linux server
> 32 GB RAM
> 4 core
>
>Reporter: RJ
>Priority: Major
>
> We have been using Spark FPGrowth and it works well with millions of
> transactions (records) when the frequent items in the Frequent Itemset is
> less than 25. Beyond 25 it runs into computational limit. For 40+ items in
> the Frequent Itemset the process never return.
> To reproduce, you can create a simple data set of 3 transactions with equal
> items (40 of them) and run FPgrowth with 0.9 support, the process never
> completes. Below is a sample data I have used to narrow down the problem:
> |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
> |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
> |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
>
> While the computation grows (2^n -1) with each item in Frequent Itemset, it
> surely should be able to handle 40 or more items in a Frequest Itemset
>
> Is this a FPGrowth implementation limitation,
> are there any tuning parameters that I am missing? Thank you.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org