Re: FPGrowth does not handle large result sets

Ritu Raj Tiwari Tue, 12 Jan 2016 18:55:29 -0800

I have been giving it 8-12G

-Raj
Sent from my iPhone


> On Jan 12, 2016, at 6:50 PM, Sabarish Sasidharan 
> <sabarish.sasidha...@manthan.com> wrote:
> 
> How much RAM are you giving to the driver? 17000 items being collected 
> shouldn't fail unless your driver memory is too low.
> 
> Regards
> Sab
> 
>> On 13-Jan-2016 6:14 am, "Ritu Raj Tiwari" <rituraj_tiw...@yahoo.com.invalid> 
>> wrote:
>> Folks:
>> We are running into a problem where FPGrowth seems to choke on data sets 
>> that we think are not too large. We have about 200,000 transactions. Each 
>> transaction is composed of on an average 50 items. There are about 17,000 
>> unique item (SKUs) that might show up in any transaction.
>> 
>> When running locally with 12G ram given to the PySpark process, the FPGrowth 
>> code fails with out of memory error for minSupport of 0.001. The failure 
>> occurs when we try to enumerate and save the frequent itemsets. Looking at 
>> the FPGrowth code 
>> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala),
>>  it seems this is because the genFreqItems() method tries to collect() all 
>> items. Is there a way the code could be rewritten so it does not try to 
>> collect and therefore store all frequent item sets in memory?
>> 
>> Thanks for any insights.
>> 
>> -Raj

Re: FPGrowth does not handle large result sets

Reply via email to