Quoting Arnaud Joly <[email protected]>:
> Can you provide a gist of your code as to help you?

I have an implementation that mimics OnevsRestClassifier  I want to
eventually try partial_fit since the number of samples is large. Here
is the rough outline.

========================================================
#Train data = 2.6 million * 1.6 million matrix in dense form
# Multi label text classification so has a very sparse matrix
# Each data has about 100 features (upper limit)
X_train, y_train = load_svmlight_file("train.csv", multilabel=True)
X_test, y_test = load_svmlight_file("test.csv", multilabel=True)

classifier = SGDClassifier()

# Binarize labels
lb = LabelBinarizer()
Y = lb.fit_transform(y_train)

estimatorlist = []
#fit to classifier for each column
for i in range(Y.shape[1]):
    estimatorlist.extend([clone(classifier)])
    estimatorlist[i].fit(X_train, Y[:,i])    #should be partial_fit eventually

print("\n");
print("Trained data successfully\n\n")


#Testing
Y_tmp = np.array([est.decision_function(X_test) for est in estimatorlist]).T

e = estimatorlist[0]
thresh = 0 if hasattr(e, "decision_function") and is_classifier(e) else .5
predicted = lb.inverse_transform(Y_tmp, threshold=thresh)

================================================================


>
>
> The pr 2458 isnt finished yet and there is possibly some quirk cases where
> it might fail. However in the branch
> https://github.com/arjoly/scikit-learn/commits/sparse-label_binarizer,
> I almost finished the label binarizer part.

I have checked out this branch separately and tried. It still results
in MemoryError.


>
> I can try to find some time to finish the sparse label binarizer if you
> want to add sparse output support to the OneVsRestClassifier.

Ideally I would like to use OnevsRestClassifier, but as explained
above without a partial_fit, I am not sure the class can help me at
this point. I see two issues:

First - Memory Error due to labelBinarizer
Second - Memory Error due to training a large data set.

Because of the second issue, I am using out-of-core classification
approach and therefore mimiced the OnevsRestClassifier implementation
as necessary. If this is not correct, suggestions are welcome.

> Several hints on how to do it properly has already been suggested in
> - https://github.com/scikit-learn/scikit-learn/pull/2458
> -

I applied the patch file from pull#2458 as suggested in githelp to my
local clone.


curl https://github.com/scikit-learn/scikit-learn/pull/2458.patch | git am

.git/rebase-apply/patch:31: trailing whitespace.
        Y = csc_matrix((data, (row, col)), shape = (len(y), len(classes)))
error: patch failed: sklearn/preprocessing/label.py:395
error: sklearn/preprocessing/label.py: patch does not apply
Patch failed at 0001 Modify label_binarize

-Anitha


On 24 March 2014 03:17, Arnaud Joly <[email protected]> wrote:
> Hi,
>
> Can you provide a gist of your code as to help you?
>
>
> The pr 2458 isnt finished yet and there is possibly some quirk cases where
> it might fail. However in the branch
> https://github.com/arjoly/scikit-learn/commits/sparse-label_binarizer,
> I almost finished the label binarizer part.
>
> I can try to find some time to finish the sparse label binarizer if you
> want to add sparse output support to the OneVsRestClassifier.
> Several hints on how to do it properly has already been suggested in
> - https://github.com/scikit-learn/scikit-learn/pull/2458
> -
> https://github.com/arjoly/scikit-learn/commit/fc18c35047bd14b71862377cc34ad9b0ae9f5b65
>
> Best,
> Arnaud
>
> On 24 Mar 2014, at 01:17, Anitha Gollamudi <[email protected]> wrote:
>
> Hi
>
>
> For a multi-label classification problem, I am using labelBinarizer
> which is giving me memory error.  I found a pull request for this
> issue. I am wondering if I can pull it now?
>
> https://github.com/scikit-learn/scikit-learn/pull/2458
>
>
> Also applying to the latest trunk(master), gives me an error. I am not
> sure if I am doing right.  Can someone suggest an alternative to using
> LabelBinarizer? (almost a newbie to scikit, so any suggestion welcome)
>
>
> curl https://github.com/scikit-learn/scikit-learn/pull/2458.patch | git am
>
>  % Total    % Received % Xferd  Average Speed   Time    Time     Time
> Current
>                                 Dload  Upload   Total   Spent    Left  Speed
> 100 61961  100 61961    0     0   123k      0 --:--:-- --:--:-- --:--:--
> 360k
> Applying: Modify label_binarize
> /home/ag68/scikit/scikit-learn/.git/rebase-apply/patch:31: trailing
> whitespace.
>        Y = csc_matrix((data, (row, col)), shape = (len(y), len(classes)))
> error: sklearn/preprocessing/label.py: does not match index
> Patch failed at 0001 Modify label_binarize
>
>
> -Anitha
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to