Hi Sean:Thanks for checking out my question here. Its possible I am making a 
newbie error. Based on my dataset of about 200,000 transactions and a minimum 
support level of 0.001, I am looking for items that appear at least 200 times. 
Given that the items in my transactions are drawn from a set of about 25,000 (I 
previously thought 17,000), what would be a rational way to determine the 
(peak) memory needs of my driver node?
-Raj 

    On Wednesday, January 13, 2016 1:18 AM, Sean Owen <so...@cloudera.com> 
wrote:
 

 As I said in your JIRA, the collect() in question is bringing results
back to the driver to return them. The assumption is that there aren't
a vast number of frequent items. If they are, they aren't 'frequent'
and your min support is too low.

On Wed, Jan 13, 2016 at 12:43 AM, Ritu Raj Tiwari
<rituraj_tiw...@yahoo.com.invalid> wrote:
> Folks:
> We are running into a problem where FPGrowth seems to choke on data sets
> that we think are not too large. We have about 200,000 transactions. Each
> transaction is composed of on an average 50 items. There are about 17,000
> unique item (SKUs) that might show up in any transaction.
>
> When running locally with 12G ram given to the PySpark process, the FPGrowth
> code fails with out of memory error for minSupport of 0.001. The failure
> occurs when we try to enumerate and save the frequent itemsets. Looking at
> the FPGrowth code
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala),
> it seems this is because the genFreqItems() method tries to collect() all
> items. Is there a way the code could be rewritten so it does not try to
> collect and therefore store all frequent item sets in memory?
>
> Thanks for any insights.
>
> -Raj

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



   

Reply via email to