Re: [scikit-learn] One-hot encoding

2018-08-03 Thread Sarah Wait Zaranek
Hi all - I can't do binary encoding because I need to trace back to the exact categorical variable and that is difficult in binary encoding, I believe. Each categorical variable has a range, but on average it is about 10 categories. I return a sparse matrix from the encoder. Regardless of the enc

Re: [scikit-learn] One-hot encoding

2018-08-03 Thread Fernando Marcos Wittmann
Hi Sarah, I have some reflection questions. You don't need to answer all of them :) how many categories (approximately) do you have in each of those 20M categorical variables? How many samples do you have? Maybe you should consider different encoding strategies such as binary encoding. Also, this

Re: [scikit-learn] One-hot encoding

2018-08-02 Thread Sarah Wait Zaranek
Hi Joel - Are you sure? I ran it and it actually uses bit more memory instead of less, same code just run with a different docker container. Max memory used by a single task: 50.41GB vs Max memory used by a single task: 51.15GB Cheers, Sarah On Wed, Aug 1, 2018 at 7:19 PM, Sarah Wait Zaranek

Re: [scikit-learn] One-hot encoding

2018-08-01 Thread Sarah Wait Zaranek
In the developer version, yes? Looking for the new memory savings :) On Wed, Aug 1, 2018, 17:29 Joel Nothman wrote: > Use OneHotEncoder > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] One-hot encoding

2018-08-01 Thread Joel Nothman
Use OneHotEncoder ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] One-hot encoding

2018-08-01 Thread Sarah Wait Zaranek
Hello, I have installed the dev version (0.20.dev0), should I just use Categorical Encoder or is the functionality already rolled up into OneHotEncoder. I get the following message: File "", line 1, in File "/scikit-learn/sklearn/preprocessing/data.py", line 2839, in *init* "CategoricalEncoder br

Re: [scikit-learn] One-hot encoding

2018-02-05 Thread Sarah Wait Zaranek
Thanks, this makes sense. I will try using the CategoricalEncoder to see the difference. It wouldn't be such a big deal if my input matrix wasn't so large. Thanks again for all your help. Cheers, Sarah On Mon, Feb 5, 2018 at 10:33 PM, Joel Nothman wrote: > Yes, the output CSR representation re

Re: [scikit-learn] One-hot encoding

2018-02-05 Thread Joel Nothman
Yes, the output CSR representation requires: 1 (dtype) value per entry 1 int32 per entry 1 int32 per row The intermediate COO representation requires: 1 (dtype) value per entry 2 int32 per entry So as long as the transformation from COO to CSR is done over the whole data, it will occupy roughly 5

Re: [scikit-learn] One-hot encoding

2018-02-05 Thread Sarah Wait Zaranek
Yes, of course. What I mean is the I start out with 19 Gigs (initial matrix size) or so, it balloons to 100 Gigs *within the encoder function* and returns 28 Gigs (sparse one-hot matrix size). These numbers aren't exact, but you can see my point. Cheers, Sarah On Mon, Feb 5, 2018 at 9:50 PM, Jo

Re: [scikit-learn] One-hot encoding

2018-02-05 Thread Joel Nothman
OneHotEncoder will not magically reduce the size of your input. It will necessarily increase the memory of the input data as long as we are storing the results in scipy.sparse matrices. The sparse representation will be less expensive than the dense representation, but it won't be less expensive th

Re: [scikit-learn] One-hot encoding

2018-02-05 Thread Sarah Wait Zaranek
Hi Joel - I am also seeing a huge overhead in memory for calling the onehot-encoder. I have hacked it by running it splitting by matrix into 4-5 smaller matrices (by columns) and then concatenating the results. But, I am seeing upwards of 100 Gigs overhead. Should I file a bug report? Or is this

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Sarah Wait Zaranek
Great. Thank you for all your help. Cheers, Sarah On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman wrote: > If you specify n_values=[list_of_vals_for_column1, > list_of_vals_for_column2], you should be able to engineer it to how you > want. > > On 5 February 2018 at 16:31, Sarah Wait Zaranek > w

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Joel Nothman
If you specify n_values=[list_of_vals_for_column1, list_of_vals_for_column2], you should be able to engineer it to how you want. On 5 February 2018 at 16:31, Sarah Wait Zaranek wrote: > If I use the n+1 approach, then I get the correct matrix, except with the > columns of zeros: > > >>> test > a

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Sarah Wait Zaranek
If I use the n+1 approach, then I get the correct matrix, except with the columns of zeros: >>> test array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1.,

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Sarah Wait Zaranek
Hi Joel - Conceptually, that makes sense. But when I assign n_values, I can't make it match the result when you don't specify them. See below. I used the number of unique levels per column. >>> enc = OneHotEncoder(sparse=False) >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1,

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Joel Nothman
If each input column is encoded as a value from 0 to the (number of possible values for that column - 1) then n_values for that column should be the highest value + 1, which is also the number of levels per column. Does that make sense? Actually, I've realised there's a somewhat slow and unnecessa

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Sarah Wait Zaranek
​Sorry - your second message popped up when I was writing my response. I will look at this as well. Thanks for being so speedy! Cheers, Sarah​ On Sun, Feb 4, 2018 at 11:30 PM, Joel Nothman wrote: > You will also benefit from assume_finite (see http://scikit-learn.org/ > stable/modules/generat

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Sarah Wait Zaranek
Hi Joel - 20 million categorical variables. It comes from segmenting the genome into 20 million parts. Genomes are big :) For n_values, I am a bit confused. Is the input the same as the output for n values. Originally, I thought it was just the number of levels per column, but it seems like it

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Joel Nothman
You will also benefit from assume_finite (see http://scikit-learn.org/stable/modules/generated/sklearn.config_context.html ) ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] One-hot encoding

2018-02-04 Thread Joel Nothman
20 million categories, or 20 million categorical variables? OneHotEncoder is pretty efficient if you specify n_values. On 5 February 2018 at 15:10, Sarah Wait Zaranek wrote: > Hello - > > I was just wondering if there was a way to improve performance on the > one-hot encoder. Or, is there any

[scikit-learn] One-hot encoding

2018-02-04 Thread Sarah Wait Zaranek
Hello - I was just wondering if there was a way to improve performance on the one-hot encoder. Or, is there any plans to do so in the future? I am working with a matrix that will ultimately have 20 million categorical variables, and my bottleneck is the one-hot encoder. Let me know if this isn'