On 22 March 2014 06:16, Lars Buitinck <[email protected]> wrote:
> 2014-03-22 0:04 GMT+01:00 Anitha Gollamudi <[email protected]>:
>> Here the shape of X_train and X_test are obviously different.
>>
>>>>> print X_train.shape
>> (11, 1617899)
>>>>> print X_test.shape
>> (3, 83715)
>>>>>
>>
>> So an exception is raised:
>>
>> ValueError: X has 83715 features per sample; expecting 1617899
>>
>> Is this an expected behaviour?
>
> Yes, or we wouldn't do this explicit check. The number of columns in X
> should *always* be equal to the number at training time and the same
> columns should be used to indicate the same features. The vectorizers
> in sklearn.feature_extraction enfore this, so that the models
> themselves can be kept agnostic of the meaning of the columns.
>
> Question: how did you do the feature extraction?

OK. I misunderstood the documentation when it said the classifiers can
work with CSR matrices. I have assumed feature_extraction as not
important and so did not implement in this case. The input is the
libSVM format multi-label and multi-class data. I have loaded using
load_svmlight_file and fed the data to classifier. Clearly, this looks
incorrect.

A related question: Since the feature set is quite large (~1.6 million
features and 2.6million samples for training), when I tried to TFIDF
transformation, it resulted in memory error (my machine has 4GB
memory). To overcome this limitation, I am trying out out-of-core
scaling with mini-batches. Is this the correct way? What other ways
can be tried?

Also do you suggest using HashingVectorizer for feature extraction? (I
am doing a document classification for wiki data set)


-Anitha

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to