[jira] [Commented] (SPARK-38037) Spark MLlib FPGrowth not working with 40+ items in Frequent Item set

2022-02-10 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17490654#comment-17490654
 ] 

zhengruifeng commented on SPARK-38037:
--

I can reproduce it by:
{code:java}
import org.apache.spark.ml.fpm._

val seq = Seq.range(0, 40)

val dataset = spark.createDataset(Seq(seq, seq, seq)).toDF("items")

val fp = new FPGrowth().setItemsCol("items").setMinSupport(0.9)

val model = fp.fit(dataset)

model.freqItemsets.count {code}
 

the result contains all the 1-length to 40-length combinations, which is not 
computable.

 

according to jackylk's  comments in 
https://issues.apache.org/jira/browse/SPARK-4001: "But in case of smaller data 
and if frequent itemset is more, Apriori is more efficient. This is because 
FP-Growth need to construct a FP Tree out of the input data set, and it needs 
some time."

 

I guess this is because FPGrowth need to build a FPTree with too many frequent 
items.

 

I have one question:

if a frequent itemset is long (whose length >= 40), then is it useful in 
practice?

 

 

> Spark MLlib FPGrowth not working with 40+ items in Frequent Item set
> 
>
> Key: SPARK-38037
> URL: https://issues.apache.org/jira/browse/SPARK-38037
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.2.0
> Environment: Stanalone Linux server
> 32 GB RAM
> 4 core
>  
>Reporter: RJ
>Priority: Major
>
> We have been using Spark FPGrowth and it works well with millions of 
> transactions (records) when the frequent items in the Frequent Itemset is 
> less than 25. Beyond 25 it runs into computational limit. For 40+ items in 
> the Frequent Itemset the process never return.
> To reproduce, you can create a simple data set of 3 transactions with equal 
> items (40 of them) and run FPgrowth with 0.9 support, the process never 
> completes. Below is a sample data I have used to narrow down the problem:
> |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
> |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
> |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
>  
> While the computation grows (2^n -1) with each item in Frequent Itemset, it 
> surely should be able to handle 40 or more items in a Frequest Itemset
>  
> Is this a FPGrowth implementation limitation,
> are there any tuning parameters that I am missing? Thank you.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38037) Spark MLlib FPGrowth not working with 40+ items in Frequent Item set

2022-02-08 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489225#comment-17489225
 ] 

zhengruifeng commented on SPARK-38037:
--

could you please provide a simple script to reproduce this issue?

> Spark MLlib FPGrowth not working with 40+ items in Frequent Item set
> 
>
> Key: SPARK-38037
> URL: https://issues.apache.org/jira/browse/SPARK-38037
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.2.0
> Environment: Stanalone Linux server
> 32 GB RAM
> 4 core
>  
>Reporter: RJ
>Priority: Major
>
> We have been using Spark FPGrowth and it works well with millions of 
> transactions (records) when the frequent items in the Frequent Itemset is 
> less than 25. Beyond 25 it runs into computational limit. For 40+ items in 
> the Frequent Itemset the process never return.
> To reproduce, you can create a simple data set of 3 transactions with equal 
> items (40 of them) and run FPgrowth with 0.9 support, the process never 
> completes. Below is a sample data I have used to narrow down the problem:
> |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
> |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
> |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
>  
> While the computation grows (2^n -1) with each item in Frequent Itemset, it 
> surely should be able to handle 40 or more items in a Frequest Itemset
>  
> Is this a FPGrowth implementation limitation,
> are there any tuning parameters that I am missing? Thank you.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org