[jira] [Updated] (SPARK-38037) Spark MLlib FPGrowth not working with 40+ items in Frequent Item set

RJ (Jira) Sat, 29 Jan 2022 06:39:04 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-38037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


RJ updated SPARK-38037:
-----------------------
    Description: 
We have been using Spark FPGrowth and it works well with millions of 
transactions (records) when the frequent items in the Frequent Itemset is less 
than 25. Beyond 25 it runs into computational limit. For 40+ items in the 
Frequent Itemset the process never return.

To reproduce, you can create a simple data set of 3 transactions with equal 
items (40 of them) and run FPgrowth with 0.9 support, the process never 
completes. Below is a sample data I have used to narrow down the problem:
|I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
|I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
|I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|

 

While the computation grows (2^n -1) with each item in Frequent Itemset, it 
surely should be able to handle 40 or more items in a Frequest Itemset

 

Is this a FPGrowth implementation limitation,

are there any tuning parameters that I am missing? Thank you.

  was:
We have been using Spark FPGrowth and it works well with millions of 
transactions (records) when the frequent items in the Frequent Itemset is less 
than 25. Beyond 25 it runs into computational limit. For 40+ items in the 
Frequent Itemset the process never return.

To reproduce, you can create a simple data set of 3 transactions with equal 
items (40 of them) and run FPgrowth with 0.9 support, the process never 
completes. Below is a sample data I have used to narrow down the problem:
 
|I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
|I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
|I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|

 

While the computation grows (2^n -1) with each item in Frequent Itemset, it 
surely should be able to handle 40 or more items in a Frequest Itemset

 

 


> Spark MLlib FPGrowth not working with 40+ items in Frequent Item set
> --------------------------------------------------------------------
>
>                 Key: SPARK-38037
>                 URL: https://issues.apache.org/jira/browse/SPARK-38037
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 3.2.0
>         Environment: Stanalone Linux server
> 32 GB RAM
> 4 core
>  
>            Reporter: RJ
>            Priority: Major
>
> We have been using Spark FPGrowth and it works well with millions of 
> transactions (records) when the frequent items in the Frequent Itemset is 
> less than 25. Beyond 25 it runs into computational limit. For 40+ items in 
> the Frequent Itemset the process never return.
> To reproduce, you can create a simple data set of 3 transactions with equal 
> items (40 of them) and run FPgrowth with 0.9 support, the process never 
> completes. Below is a sample data I have used to narrow down the problem:
> |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
> |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
> |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
>  
> While the computation grows (2^n -1) with each item in Frequent Itemset, it 
> surely should be able to handle 40 or more items in a Frequest Itemset
>  
> Is this a FPGrowth implementation limitation,
> are there any tuning parameters that I am missing? Thank you.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38037) Spark MLlib FPGrowth not working with 40+ items in Frequent Item set

Reply via email to