I see three points where you can blow your memory :

1. To have a sparse output with LabelBinarizer (in my pr), you need
to pass the argument dense_output=False to the constructor.     
Otherwise, it will try to create a dense indicator matrix.

2. If the model of each SGD classifier is not sparse,  you might blow your 
memory.

3. Depending on the number of testing samples,  you can blow your memory by
concatenating all testing samples decision function.

Issue 2 is model dependant, but Issue 3 can be treated by following Joel’s 
advice.

Best,
Arnaud

On 24 Mar 2014, at 19:49, Anitha Gollamudi <[email protected]> wrote:

> Quoting Arnaud Joly <[email protected]>:
>> Can you provide a gist of your code as to help you?
> 
> I have an implementation that mimics OnevsRestClassifier  I want to
> eventually try partial_fit since the number of samples is large. Here
> is the rough outline.
> 
> ========================================================
> #Train data = 2.6 million * 1.6 million matrix in dense form
> # Multi label text classification so has a very sparse matrix
> # Each data has about 100 features (upper limit)
> X_train, y_train = load_svmlight_file("train.csv", multilabel=True)
> X_test, y_test = load_svmlight_file("test.csv", multilabel=True)
> 
> classifier = SGDClassifier()
> 
> # Binarize labels
> lb = LabelBinarizer()
> Y = lb.fit_transform(y_train)
> 
> estimatorlist = []
> #fit to classifier for each column
> for i in range(Y.shape[1]):
>    estimatorlist.extend([clone(classifier)])
>    estimatorlist[i].fit(X_train, Y[:,i])    #should be partial_fit eventually
> 
> print("\n");
> print("Trained data successfully\n\n")
> 
> 
> #Testing
> Y_tmp = np.array([est.decision_function(X_test) for est in estimatorlist]).T
> 
> e = estimatorlist[0]
> thresh = 0 if hasattr(e, "decision_function") and is_classifier(e) else .5
> predicted = lb.inverse_transform(Y_tmp, threshold=thresh)
> 
> ================================================================
> 
> 
>> 
>> 
>> The pr 2458 isnt finished yet and there is possibly some quirk cases where
>> it might fail. However in the branch
>> https://github.com/arjoly/scikit-learn/commits/sparse-label_binarizer,
>> I almost finished the label binarizer part.
> 
> I have checked out this branch separately and tried. It still results
> in MemoryError.
> 
> 
>> 
>> I can try to find some time to finish the sparse label binarizer if you
>> want to add sparse output support to the OneVsRestClassifier.
> 
> Ideally I would like to use OnevsRestClassifier, but as explained
> above without a partial_fit, I am not sure the class can help me at
> this point. I see two issues:
> 
> First - Memory Error due to labelBinarizer
> Second - Memory Error due to training a large data set.
> 
> Because of the second issue, I am using out-of-core classification
> approach and therefore mimiced the OnevsRestClassifier implementation
> as necessary. If this is not correct, suggestions are welcome.
> 
>> Several hints on how to do it properly has already been suggested in
>> - https://github.com/scikit-learn/scikit-learn/pull/2458
>> -
> 
> I applied the patch file from pull#2458 as suggested in githelp to my
> local clone.
> 
> 
> curl https://github.com/scikit-learn/scikit-learn/pull/2458.patch | git am
> 
> .git/rebase-apply/patch:31: trailing whitespace.
>        Y = csc_matrix((data, (row, col)), shape = (len(y), len(classes)))
> error: patch failed: sklearn/preprocessing/label.py:395
> error: sklearn/preprocessing/label.py: patch does not apply
> Patch failed at 0001 Modify label_binarize
> 
> -Anitha
> 
> 
> On 24 March 2014 03:17, Arnaud Joly <[email protected]> wrote:
>> Hi,
>> 
>> Can you provide a gist of your code as to help you?
>> 
>> 
>> The pr 2458 isnt finished yet and there is possibly some quirk cases where
>> it might fail. However in the branch
>> https://github.com/arjoly/scikit-learn/commits/sparse-label_binarizer,
>> I almost finished the label binarizer part.
>> 
>> I can try to find some time to finish the sparse label binarizer if you
>> want to add sparse output support to the OneVsRestClassifier.
>> Several hints on how to do it properly has already been suggested in
>> - https://github.com/scikit-learn/scikit-learn/pull/2458
>> -
>> https://github.com/arjoly/scikit-learn/commit/fc18c35047bd14b71862377cc34ad9b0ae9f5b65
>> 
>> Best,
>> Arnaud
>> 
>> On 24 Mar 2014, at 01:17, Anitha Gollamudi <[email protected]> wrote:
>> 
>> Hi
>> 
>> 
>> For a multi-label classification problem, I am using labelBinarizer
>> which is giving me memory error.  I found a pull request for this
>> issue. I am wondering if I can pull it now?
>> 
>> https://github.com/scikit-learn/scikit-learn/pull/2458
>> 
>> 
>> Also applying to the latest trunk(master), gives me an error. I am not
>> sure if I am doing right.  Can someone suggest an alternative to using
>> LabelBinarizer? (almost a newbie to scikit, so any suggestion welcome)
>> 
>> 
>> curl https://github.com/scikit-learn/scikit-learn/pull/2458.patch | git am
>> 
>> % Total    % Received % Xferd  Average Speed   Time    Time     Time
>> Current
>>                                Dload  Upload   Total   Spent    Left  Speed
>> 100 61961  100 61961    0     0   123k      0 --:--:-- --:--:-- --:--:--
>> 360k
>> Applying: Modify label_binarize
>> /home/ag68/scikit/scikit-learn/.git/rebase-apply/patch:31: trailing
>> whitespace.
>>       Y = csc_matrix((data, (row, col)), shape = (len(y), len(classes)))
>> error: sklearn/preprocessing/label.py: does not match index
>> Patch failed at 0001 Modify label_binarize
>> 
>> 
>> -Anitha
>> 
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and their
>> applications. Written by three acclaimed leaders in the field,
>> this first edition is now available. Download your free book today!
>> http://p.sf.net/sfu/13534_NeoTech
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and their
>> applications. Written by three acclaimed leaders in the field,
>> this first edition is now available. Download your free book today!
>> http://p.sf.net/sfu/13534_NeoTech
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
> 
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to