Thanks Sean! I'll start with higher support threshold and work my way down. 

    On Wednesday, January 13, 2016 8:57 AM, Sean Owen <> 

 You're looking for subsets of items that appear in at least 200 of
200,000 transactions, which could be a whole lot. Keep in mind there
are 25,000 items, sure, but already 625,000,000 possible pairs of
items, and trillions of possible 3-item subsets. This sounds like it's
just far too low. Start with 0.1 and work down. I don't think there's
a general formula since if each transaction just contained 1 item, no
sets would be frequent, and if every transaction has ever item, than
all sets are frequent and that number is indescribably large.

On Wed, Jan 13, 2016 at 4:32 PM, Ritu Raj Tiwari
<> wrote:
> Hi Sean:
> Thanks for checking out my question here. Its possible I am making a newbie
> error. Based on my dataset of about 200,000 transactions and a minimum
> support level of 0.001, I am looking for items that appear at least 200
> times. Given that the items in my transactions are drawn from a set of about
> 25,000 (I previously thought 17,000), what would be a rational way to
> determine the (peak) memory needs of my driver node?
> -Raj
> On Wednesday, January 13, 2016 1:18 AM, Sean Owen <>
> wrote:
> As I said in your JIRA, the collect() in question is bringing results
> back to the driver to return them. The assumption is that there aren't
> a vast number of frequent items. If they are, they aren't 'frequent'
> and your min support is too low.
> On Wed, Jan 13, 2016 at 12:43 AM, Ritu Raj Tiwari
> <> wrote:
>> Folks:
>> We are running into a problem where FPGrowth seems to choke on data sets
>> that we think are not too large. We have about 200,000 transactions. Each
>> transaction is composed of on an average 50 items. There are about 17,000
>> unique item (SKUs) that might show up in any transaction.
>> When running locally with 12G ram given to the PySpark process, the
>> FPGrowth
>> code fails with out of memory error for minSupport of 0.001. The failure
>> occurs when we try to enumerate and save the frequent itemsets. Looking at
>> the FPGrowth code
>> (,
>> it seems this is because the genFreqItems() method tries to collect() all
>> items. Is there a way the code could be rewritten so it does not try to
>> collect and therefore store all frequent item sets in memory?
>> Thanks for any insights.
>> -Raj
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:


Reply via email to