Re: FPGrowth does not handle large result sets

Sean Owen Wed, 13 Jan 2016 08:58:09 -0800

You're looking for subsets of items that appear in at least 200 of
200,000 transactions, which could be a whole lot. Keep in mind there
are 25,000 items, sure, but already 625,000,000 possible pairs of
items, and trillions of possible 3-item subsets. This sounds like it's
just far too low. Start with 0.1 and work down. I don't think there's
a general formula since if each transaction just contained 1 item, no
sets would be frequent, and if every transaction has ever item, than
all sets are frequent and that number is indescribably large.


On Wed, Jan 13, 2016 at 4:32 PM, Ritu Raj Tiwari
<rituraj_tiw...@yahoo.com> wrote:
> Hi Sean:
> Thanks for checking out my question here. Its possible I am making a newbie
> error. Based on my dataset of about 200,000 transactions and a minimum
> support level of 0.001, I am looking for items that appear at least 200
> times. Given that the items in my transactions are drawn from a set of about
> 25,000 (I previously thought 17,000), what would be a rational way to
> determine the (peak) memory needs of my driver node?
>
> -Raj
>
>
> On Wednesday, January 13, 2016 1:18 AM, Sean Owen <so...@cloudera.com>
> wrote:
>
>
> As I said in your JIRA, the collect() in question is bringing results
> back to the driver to return them. The assumption is that there aren't
> a vast number of frequent items. If they are, they aren't 'frequent'
> and your min support is too low.
>
> On Wed, Jan 13, 2016 at 12:43 AM, Ritu Raj Tiwari
> <rituraj_tiw...@yahoo.com.invalid> wrote:
>> Folks:
>> We are running into a problem where FPGrowth seems to choke on data sets
>> that we think are not too large. We have about 200,000 transactions. Each
>> transaction is composed of on an average 50 items. There are about 17,000
>> unique item (SKUs) that might show up in any transaction.
>>
>> When running locally with 12G ram given to the PySpark process, the
>> FPGrowth
>> code fails with out of memory error for minSupport of 0.001. The failure
>> occurs when we try to enumerate and save the frequent itemsets. Looking at
>> the FPGrowth code
>>
>> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala),
>> it seems this is because the genFreqItems() method tries to collect() all
>> items. Is there a way the code could be rewritten so it does not try to
>> collect and therefore store all frequent item sets in memory?
>>
>> Thanks for any insights.
>>
>> -Raj
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: FPGrowth does not handle large result sets

Reply via email to