FPGrowth does not handle large result sets

Ritu Raj Tiwari Tue, 12 Jan 2016 16:44:27 -0800

Folks:We are running into a problem where FPGrowth seems to choke on data sets 
that we think are not too large. We have about 200,000 transactions. Each 
transaction is composed of on an average 50 items. There are about 17,000 
unique item (SKUs) that might show up in any transaction.
When running locally with 12G ram given to the PySpark process, the FPGrowth 
code fails with out of memory error for minSupport of 0.001. The failure occurs 
when we try to enumerate and save the frequent itemsets. Looking at the 
FPGrowth code 
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala),
 it seems this is because the genFreqItems() method tries to collect() all 
items. Is there a way the code could be rewritten so it does not try to collect 
and therefore store all frequent item sets in memory?
Thanks for any insights.
-Raj

FPGrowth does not handle large result sets

Reply via email to