What do you mean? It's pretty trivial to implement a one-hot encoding, the
issue is that if you use a non-sparse format then you'll end up with a
matrix which is far too dense to be practical, for anything but trivial
examples.
On Fri, Jun 21, 2013 at 10:46 AM, Maheshakya Wijewardena <
pmaheshak
I'd like to analyse a bit and encode using that method to cohere with
random forests in scikit-learn.
On Fri, Jun 21, 2013 at 2:08 PM, Peter Prettenhofer <
peter.prettenho...@gmail.com> wrote:
> ? you already use one-hot encoding in your example (
> preprocessing.OneHotEncoder)
>
>
> 2013/6/21 M
? you already use one-hot encoding in your example (
preprocessing.OneHotEncoder)
2013/6/21 Maheshakya Wijewardena
> can anyone give me a sample algorithm for one hot encoding used in
> scikit-learn?
>
>
> On Thu, Jun 20, 2013 at 8:37 PM, Peter Prettenhofer <
> peter.prettenho...@gmail.com> wro
can anyone give me a sample algorithm for one hot encoding used in
scikit-learn?
On Thu, Jun 20, 2013 at 8:37 PM, Peter Prettenhofer <
peter.prettenho...@gmail.com> wrote:
> you can try an ordinal encoding instead - just map each categorical value
> to an integer so that you end up with 8 numeri
you can try an ordinal encoding instead - just map each categorical value
to an integer so that you end up with 8 numerical features - if you use
enough trees and grow them deep it may work
2013/6/20 Maheshakya Wijewardena
> And yes Gilles, It is the Amazon challenge :D
>
>
> On Thu, Jun 20, 20
And yes Gilles, It is the Amazon challenge :D
On Thu, Jun 20, 2013 at 8:21 PM, Maheshakya Wijewardena <
pmaheshak...@gmail.com> wrote:
> The shape of X after encoding is (32769, 16600). Seems as if that is too
> big to be converted into a dense matrix. Can Random forest handle this
> amount of f
The shape of X after encoding is (32769, 16600). Seems as if that is too
big to be converted into a dense matrix. Can Random forest handle this
amount of features?
On Thu, Jun 20, 2013 at 7:31 PM, Olivier Grisel wrote:
> 2013/6/20 Lars Buitinck :
> > 2013/6/20 Olivier Grisel :
> >>> Actually twi
2013/6/20 Lars Buitinck :
> 2013/6/20 Gilles Louppe :
>> This looks like the dataset from the Amazon challenge currently
>> running on Kaggle. When one-hot-encoded, you end up with rhoughly
>> 15000 binary features, which means that the dense representation
>> requires at least 32000*15000*4 bytes
2013/6/20 Lars Buitinck :
> 2013/6/20 Olivier Grisel :
>>> Actually twice as much, even on a 32-bit platform (float size is
>>> always 64 bits).
>>
>> The decision tree code always uses 32 bits floats:
>>
>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L38
>>
>> b
So Maheshakya's `toarray` might work with
`X.astype(np.float32).toarray('F')`...
(But by "might work" I mean won't throw a ValueError...)
On Thu, Jun 20, 2013 at 11:56 PM, Olivier Grisel
wrote:
> 2013/6/20 Lars Buitinck :
> > 2013/6/20 Gilles Louppe :
> >> This looks like the dataset from the Am
2013/6/20 Olivier Grisel :
>> Actually twice as much, even on a 32-bit platform (float size is
>> always 64 bits).
>
> The decision tree code always uses 32 bits floats:
>
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L38
>
> but you have to cast your data to `dt
2013/6/20 Gilles Louppe :
> This looks like the dataset from the Amazon challenge currently
> running on Kaggle. When one-hot-encoded, you end up with rhoughly
> 15000 binary features, which means that the dense representation
> requires at least 32000*15000*4 bytes to hold in memory (or even twice
What is the cardinality of each feature?
--
This SF.net email is sponsored by Windows:
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
___
Scikit-learn-general mai
Hi,
This looks like the dataset from the Amazon challenge currently
running on Kaggle. When one-hot-encoded, you end up with rhoughly
15000 binary features, which means that the dense representation
requires at least 32000*15000*4 bytes to hold in memory (or even twice
as as more depending on your
Hi Maheshakya,
It's probably right: your feature space is too big and sparse to be
reasonable for random forests. What sort of categorical data are you
encoding? What is the shape of the matrix after applying one-hot encoding?
If you need to use random forests, and not a method that natively hand
2013/6/20 Maheshakya Wijewardena :
> The shape is (32769, 8). There are 8 categorical variables before applying
> OneHotEncoding.
And what is the shape after?
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
--
The shape is (32769, 8). There are 8 categorical variables before applying
OneHotEncoding.
On Thu, Jun 20, 2013 at 5:43 PM, Peter Prettenhofer <
peter.prettenho...@gmail.com> wrote:
>
> Hi,
>
> seems like your sparse matrix is too large to be converted to a dense
> matrix. What shape does X have
Hi,
seems like your sparse matrix is too large to be converted to a dense
matrix. What shape does X have? How many categorical variables do you have
(before applying the OneHotTransformer)?
--
This SF.net email is sponsore
Hi,
I'm new to scikit-learn. I'm trying use preprocessing.OneHotEncoder to
encode my training and test data. After encoding I tried to train Random
forest classifier using that data. But I get the following error when
fitting.
(Here the error trace)
99 model.fit(X_train, y_train)100
19 matches
Mail list logo