Hi all -
I can't do binary encoding because I need to trace back to the exact
categorical variable and that is difficult in binary encoding, I believe.
Each categorical variable has a range, but on average it is about 10
categories. I return a sparse matrix from the encoder. Regardless of the
enc
Hi Sarah, I have some reflection questions. You don't need to answer all
of them :) how many categories (approximately) do you have in each of those
20M categorical variables? How many samples do you have? Maybe you should
consider different encoding strategies such as binary encoding. Also, this
Hi Joel -
Are you sure? I ran it and it actually uses bit more memory instead of
less, same code just run with a different docker container.
Max memory used by a single task: 50.41GB
vs
Max memory used by a single task: 51.15GB
Cheers,
Sarah
On Wed, Aug 1, 2018 at 7:19 PM, Sarah Wait Zaranek
In the developer version, yes? Looking for the new memory savings :)
On Wed, Aug 1, 2018, 17:29 Joel Nothman wrote:
> Use OneHotEncoder
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
Use OneHotEncoder
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
Hello,
I have installed the dev version (0.20.dev0), should I just use Categorical
Encoder or is the functionality already rolled up into OneHotEncoder. I get
the following message:
File "", line 1, in
File "/scikit-learn/sklearn/preprocessing/data.py", line 2839, in *init*
"CategoricalEncoder br
Thanks, this makes sense. I will try using the CategoricalEncoder to see
the difference. It wouldn't be such a big deal if my input matrix wasn't so
large. Thanks again for all your help.
Cheers,
Sarah
On Mon, Feb 5, 2018 at 10:33 PM, Joel Nothman
wrote:
> Yes, the output CSR representation re
Yes, the output CSR representation requires:
1 (dtype) value per entry
1 int32 per entry
1 int32 per row
The intermediate COO representation requires:
1 (dtype) value per entry
2 int32 per entry
So as long as the transformation from COO to CSR is done over the whole
data, it will occupy roughly 5
Yes, of course. What I mean is the I start out with 19 Gigs (initial
matrix size) or so, it balloons to 100 Gigs *within the encoder function*
and returns 28 Gigs (sparse one-hot matrix size). These numbers aren't
exact, but you can see my point.
Cheers,
Sarah
On Mon, Feb 5, 2018 at 9:50 PM, Jo
OneHotEncoder will not magically reduce the size of your input. It will
necessarily increase the memory of the input data as long as we are storing
the results in scipy.sparse matrices. The sparse representation will be
less expensive than the dense representation, but it won't be less
expensive th
Hi Joel -
I am also seeing a huge overhead in memory for calling the onehot-encoder.
I have hacked it by running it splitting by matrix into 4-5 smaller
matrices (by columns) and then concatenating the results. But, I am seeing
upwards of 100 Gigs overhead. Should I file a bug report? Or is this
Great. Thank you for all your help.
Cheers,
Sarah
On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman
wrote:
> If you specify n_values=[list_of_vals_for_column1,
> list_of_vals_for_column2], you should be able to engineer it to how you
> want.
>
> On 5 February 2018 at 16:31, Sarah Wait Zaranek
> w
If you specify n_values=[list_of_vals_for_column1,
list_of_vals_for_column2], you should be able to engineer it to how you
want.
On 5 February 2018 at 16:31, Sarah Wait Zaranek
wrote:
> If I use the n+1 approach, then I get the correct matrix, except with the
> columns of zeros:
>
> >>> test
> a
If I use the n+1 approach, then I get the correct matrix, except with the
columns of zeros:
>>> test
array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.],
[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1.,
Hi Joel -
Conceptually, that makes sense. But when I assign n_values, I can't make
it match the result when you don't specify them. See below. I used the
number of unique levels per column.
>>> enc = OneHotEncoder(sparse=False)
>>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1,
If each input column is encoded as a value from 0 to the (number of
possible values for that column - 1) then n_values for that column should
be the highest value + 1, which is also the number of levels per column.
Does that make sense?
Actually, I've realised there's a somewhat slow and unnecessa
Sorry - your second message popped up when I was writing my response. I
will look at this as well. Thanks for being so speedy!
Cheers,
Sarah
On Sun, Feb 4, 2018 at 11:30 PM, Joel Nothman
wrote:
> You will also benefit from assume_finite (see http://scikit-learn.org/
> stable/modules/generat
Hi Joel -
20 million categorical variables. It comes from segmenting the genome into
20 million parts. Genomes are big :) For n_values, I am a bit confused.
Is the input the same as the output for n values. Originally, I thought it
was just the number of levels per column, but it seems like it
You will also benefit from assume_finite (see
http://scikit-learn.org/stable/modules/generated/sklearn.config_context.html
)
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
20 million categories, or 20 million categorical variables?
OneHotEncoder is pretty efficient if you specify n_values.
On 5 February 2018 at 15:10, Sarah Wait Zaranek
wrote:
> Hello -
>
> I was just wondering if there was a way to improve performance on the
> one-hot encoder. Or, is there any
Hello -
I was just wondering if there was a way to improve performance on the
one-hot encoder. Or, is there any plans to do so in the future? I am
working with a matrix that will ultimately have 20 million categorical
variables, and my bottleneck is the one-hot encoder.
Let me know if this isn'
21 matches
Mail list logo