I don’t think scaling RAM is a sane strategy to fixing these problems with
using a dataframe / transformer approach to creating large sparse vectors.
One, though yes it will delay when it will fail, it will still fail. The
original case I emailed about I tried this, and after waiting 50
The OOM happen in driver, you may also need more memory for driver.
On Fri, Aug 19, 2016 at 2:33 PM, Davies Liu wrote:
> You are using lots of tiny executors (128 executor with only 2G
> memory), could you try with bigger executor (for example 16G x 16)?
>
> On Fri, Aug
You are using lots of tiny executors (128 executor with only 2G
memory), could you try with bigger executor (for example 16G x 16)?
On Fri, Aug 19, 2016 at 8:19 AM, Ben Teeuwen wrote:
>
> So I wrote some code to reproduce the problem.
>
> I assume here that a pipeline should
So I wrote some code to reproduce the problem.
I assume here that a pipeline should be able to transform a categorical feature
with a few million levels.
So I create a dataframe with the categorical feature (‘id’), apply a
StringIndexer and OneHotEncoder transformer, and run a loop where I
Ok, interesting. Would be interested to see how it compares.
By the way, the feature size you select for the hasher should be a power of
2 (e.g. 2**24 to 2**26 may be worth trying) to ensure the feature indexes
are evenly distributed (see the section on HashingTF under
Thanks Nick, I played around with the hashing trick. When I set numFeatures to
the amount of distinct values for the largest sparse feature, I ended up with
half of them colliding. When raising the numFeatures to have less collisions, I
soon ended up with the same memory problems as before. To
Sure, I understand there are some issues with handling this missing value
situation in StringIndexer currently. Your workaround is not ideal but I
see that it is probably the only mechanism available currently to avoid the
problem.
But the OOM issues seem to be more about the feature cardinality
Hi Nick,
Thanks for the suggestion. Reducing the dimensionality is an option, thanks,
but let’s say I really want to do this :).
The reason why it’s so big is that I’m unifying my training and test data, and
I don’t want to drop rows in the test data just because one of the features was
Hi Ben
Perhaps with this size cardinality it is worth looking at feature hashing
for your problem. Spark has the HashingTF transformer that works on a
column of "sentences" (i.e. [string]).
For categorical features you can hack it a little by converting your
feature value into a
I raised driver memory to 30G and maxresultsize to 25G, this time in pyspark.
Code run:
cat_int = ['bigfeature']
stagesIndex = []
stagesOhe = []
for c in cat_int:
stagesIndex.append(StringIndexer(inputCol=c, outputCol="{}Index".format(c)))
stagesOhe.append(OneHotEncoder(dropLast= False,
Hi,
I want to one hot encode a column containing 56 million distinct values. My
dataset is 800m rows + 17 columns.
I first apply a StringIndexer, but it already breaks there giving a OOM java
heap space error.
I launch my app on YARN with:
/opt/spark/2.0.0/bin/spark-shell --executor-memory 10G
11 matches
Mail list logo