As I said in your JIRA, the collect() in question is bringing results
back to the driver to return them. The assumption is that there aren't
a vast number of frequent items. If they are, they aren't 'frequent'
and your min support is too low.
On Wed, Jan 13, 2016 at 12:43 AM, Ritu Raj Tiwari
Thanks Sean! I'll start with higher support threshold and work my way down.
On Wednesday, January 13, 2016 8:57 AM, Sean Owen
wrote:
You're looking for subsets of items that appear in at least 200 of
200,000 transactions, which could be a whole lot. Keep in mind
Hi Sean:Thanks for checking out my question here. Its possible I am making a
newbie error. Based on my dataset of about 200,000 transactions and a minimum
support level of 0.001, I am looking for items that appear at least 200 times.
Given that the items in my transactions are drawn from a set
You're looking for subsets of items that appear in at least 200 of
200,000 transactions, which could be a whole lot. Keep in mind there
are 25,000 items, sure, but already 625,000,000 possible pairs of
items, and trillions of possible 3-item subsets. This sounds like it's
just far too low. Start
I have been giving it 8-12G
-Raj
Sent from my iPhone
> On Jan 12, 2016, at 6:50 PM, Sabarish Sasidharan
> wrote:
>
> How much RAM are you giving to the driver? 17000 items being collected
> shouldn't fail unless your driver memory is too low.
>
> Regards
>
Folks:We are running into a problem where FPGrowth seems to choke on data sets
that we think are not too large. We have about 200,000 transactions. Each
transaction is composed of on an average 50 items. There are about 17,000
unique item (SKUs) that might show up in any transaction.
When