Thanks Sean! I'll start with higher support threshold and work my way down.
On Wednesday, January 13, 2016 8:57 AM, Sean Owen <so...@cloudera.com> wrote: You're looking for subsets of items that appear in at least 200 of 200,000 transactions, which could be a whole lot. Keep in mind there are 25,000 items, sure, but already 625,000,000 possible pairs of items, and trillions of possible 3-item subsets. This sounds like it's just far too low. Start with 0.1 and work down. I don't think there's a general formula since if each transaction just contained 1 item, no sets would be frequent, and if every transaction has ever item, than all sets are frequent and that number is indescribably large. On Wed, Jan 13, 2016 at 4:32 PM, Ritu Raj Tiwari <rituraj_tiw...@yahoo.com> wrote: > Hi Sean: > Thanks for checking out my question here. Its possible I am making a newbie > error. Based on my dataset of about 200,000 transactions and a minimum > support level of 0.001, I am looking for items that appear at least 200 > times. Given that the items in my transactions are drawn from a set of about > 25,000 (I previously thought 17,000), what would be a rational way to > determine the (peak) memory needs of my driver node? > > -Raj > > > On Wednesday, January 13, 2016 1:18 AM, Sean Owen <so...@cloudera.com> > wrote: > > > As I said in your JIRA, the collect() in question is bringing results > back to the driver to return them. The assumption is that there aren't > a vast number of frequent items. If they are, they aren't 'frequent' > and your min support is too low. > > On Wed, Jan 13, 2016 at 12:43 AM, Ritu Raj Tiwari > <rituraj_tiw...@yahoo.com.invalid> wrote: >> Folks: >> We are running into a problem where FPGrowth seems to choke on data sets >> that we think are not too large. We have about 200,000 transactions. Each >> transaction is composed of on an average 50 items. There are about 17,000 >> unique item (SKUs) that might show up in any transaction. >> >> When running locally with 12G ram given to the PySpark process, the >> FPGrowth >> code fails with out of memory error for minSupport of 0.001. The failure >> occurs when we try to enumerate and save the frequent itemsets. Looking at >> the FPGrowth code >> >> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala), >> it seems this is because the genFreqItems() method tries to collect() all >> items. Is there a way the code could be rewritten so it does not try to >> collect and therefore store all frequent item sets in memory? >> >> Thanks for any insights. >> >> -Raj > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org