Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-22 Thread Ben Teeuwen
I don’t think scaling RAM is a sane strategy to fixing these problems with using a dataframe / transformer approach to creating large sparse vectors. One, though yes it will delay when it will fail, it will still fail. The original case I emailed about I tried this, and after waiting 50

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Davies Liu
The OOM happen in driver, you may also need more memory for driver. On Fri, Aug 19, 2016 at 2:33 PM, Davies Liu wrote: > You are using lots of tiny executors (128 executor with only 2G > memory), could you try with bigger executor (for example 16G x 16)? > > On Fri, Aug

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Davies Liu
You are using lots of tiny executors (128 executor with only 2G memory), could you try with bigger executor (for example 16G x 16)? On Fri, Aug 19, 2016 at 8:19 AM, Ben Teeuwen wrote: > > So I wrote some code to reproduce the problem. > > I assume here that a pipeline should

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Ben Teeuwen
So I wrote some code to reproduce the problem. I assume here that a pipeline should be able to transform a categorical feature with a few million levels. So I create a dataframe with the categorical feature (‘id’), apply a StringIndexer and OneHotEncoder transformer, and run a loop where I

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-11 Thread Nick Pentreath
Ok, interesting. Would be interested to see how it compares. By the way, the feature size you select for the hasher should be a power of 2 (e.g. 2**24 to 2**26 may be worth trying) to ensure the feature indexes are evenly distributed (see the section on HashingTF under

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-11 Thread Ben Teeuwen
Thanks Nick, I played around with the hashing trick. When I set numFeatures to the amount of distinct values for the largest sparse feature, I ended up with half of them colliding. When raising the numFeatures to have less collisions, I soon ended up with the same memory problems as before. To

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Nick Pentreath
Sure, I understand there are some issues with handling this missing value situation in StringIndexer currently. Your workaround is not ideal but I see that it is probably the only mechanism available currently to avoid the problem. But the OOM issues seem to be more about the feature cardinality

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Ben Teeuwen
Hi Nick, Thanks for the suggestion. Reducing the dimensionality is an option, thanks, but let’s say I really want to do this :). The reason why it’s so big is that I’m unifying my training and test data, and I don’t want to drop rows in the test data just because one of the features was

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Nick Pentreath
Hi Ben Perhaps with this size cardinality it is worth looking at feature hashing for your problem. Spark has the HashingTF transformer that works on a column of "sentences" (i.e. [string]). For categorical features you can hack it a little by converting your feature value into a

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Ben Teeuwen
I raised driver memory to 30G and maxresultsize to 25G, this time in pyspark. Code run: cat_int = ['bigfeature'] stagesIndex = [] stagesOhe = [] for c in cat_int: stagesIndex.append(StringIndexer(inputCol=c, outputCol="{}Index".format(c))) stagesOhe.append(OneHotEncoder(dropLast= False,

OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-03 Thread Ben Teeuwen
Hi, I want to one hot encode a column containing 56 million distinct values. My dataset is 800m rows + 17 columns. I first apply a StringIndexer, but it already breaks there giving a OOM java heap space error. I launch my app on YARN with: /opt/spark/2.0.0/bin/spark-shell --executor-memory 10G