Folks:We are running into a problem where FPGrowth seems to choke on data sets
that we think are not too large. We have about 200,000 transactions. Each
transaction is composed of on an average 50 items. There are about 17,000
unique item (SKUs) that might show up in any transaction.
When running locally with 12G ram given to the PySpark process, the FPGrowth
code fails with out of memory error for minSupport of 0.001. The failure occurs
when we try to enumerate and save the frequent itemsets. Looking at the
FPGrowth code
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala),
it seems this is because the genFreqItems() method tries to collect() all
items. Is there a way the code could be rewritten so it does not try to collect
and therefore store all frequent item sets in memory?
Thanks for any insights.
-Raj