subject:"OOM with StringIndexer, 800m rows \& 56m distinct value column"

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-22 Thread Ben Teeuwen

I don’t think scaling RAM is a sane strategy to fixing these problems with using a dataframe / transformer approach to creating large sparse vectors. One, though yes it will delay when it will fail, it will still fail. The original case I emailed about I tried this, and after waiting 50

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Davies Liu

The OOM happen in driver, you may also need more memory for driver. On Fri, Aug 19, 2016 at 2:33 PM, Davies Liu wrote: > You are using lots of tiny executors (128 executor with only 2G > memory), could you try with bigger executor (for example 16G x 16)? > > On Fri, Aug

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Davies Liu

You are using lots of tiny executors (128 executor with only 2G memory), could you try with bigger executor (for example 16G x 16)? On Fri, Aug 19, 2016 at 8:19 AM, Ben Teeuwen wrote: > > So I wrote some code to reproduce the problem. > > I assume here that a pipeline should

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-19 Thread Ben Teeuwen

So I wrote some code to reproduce the problem. I assume here that a pipeline should be able to transform a categorical feature with a few million levels. So I create a dataframe with the categorical feature (‘id’), apply a StringIndexer and OneHotEncoder transformer, and run a loop where I

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-11 Thread Nick Pentreath

Ok, interesting. Would be interested to see how it compares. By the way, the feature size you select for the hasher should be a power of 2 (e.g. 2**24 to 2**26 may be worth trying) to ensure the feature indexes are evenly distributed (see the section on HashingTF under

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-11 Thread Ben Teeuwen

Thanks Nick, I played around with the hashing trick. When I set numFeatures to the amount of distinct values for the largest sparse feature, I ended up with half of them colliding. When raising the numFeatures to have less collisions, I soon ended up with the same memory problems as before. To

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Nick Pentreath

Sure, I understand there are some issues with handling this missing value situation in StringIndexer currently. Your workaround is not ideal but I see that it is probably the only mechanism available currently to avoid the problem. But the OOM issues seem to be more about the feature cardinality

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Ben Teeuwen

Hi Nick, Thanks for the suggestion. Reducing the dimensionality is an option, thanks, but let’s say I really want to do this :). The reason why it’s so big is that I’m unifying my training and test data, and I don’t want to drop rows in the test data just because one of the features was

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Nick Pentreath

Hi Ben Perhaps with this size cardinality it is worth looking at feature hashing for your problem. Spark has the HashingTF transformer that works on a column of "sentences" (i.e. [string]). For categorical features you can hack it a little by converting your feature value into a

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-04 Thread Ben Teeuwen

I raised driver memory to 30G and maxresultsize to 25G, this time in pyspark. Code run: cat_int = ['bigfeature'] stagesIndex = [] stagesOhe = [] for c in cat_int: stagesIndex.append(StringIndexer(inputCol=c, outputCol="{}Index".format(c))) stagesOhe.append(OneHotEncoder(dropLast= False,

OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-03 Thread Ben Teeuwen

Hi, I want to one hot encode a column containing 56 million distinct values. My dataset is 800m rows + 17 columns. I first apply a StringIndexer, but it already breaks there giving a OOM java heap space error. I launch my app on YARN with: /opt/spark/2.0.0/bin/spark-shell --executor-memory 10G

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

OOM with StringIndexer, 800m rows & 56m distinct value column

11 matches

Site Navigation

Mail list logo

Footer information