Re: [Scikit-learn-general] Unable to test a dummy classifier with a score function that requires a probability estimate

2013-08-15 Thread Arnaud Joly
The class _ThresholdScorer in sklearn.metrics.scorer need to be patched to accept multi-label input. A pull request is welcomed ! Best regards, Arnaud On 14 Aug 2013, at 17:35, Josh Wasserstein wrote: > Say I define the following scoring function: > > def multi_label_macro_auc(y_gt, y_pred):

Re: [Scikit-learn-general] Classification accuracy too high

2013-08-15 Thread Jason Williams
By removing label from the training set, and then rerun the process (fit, predict, etc.). The result looks reasonable.  Thank you very much.  - Original Message - From: Andreas Mueller To: Jason Williams ; scikit-learn-general@lists.sourceforge.net Cc: Sent: Thursday, 15 August 2013,

Re: [Scikit-learn-general] Classification accuracy too high

2013-08-15 Thread Andreas Mueller
On 08/15/2013 01:08 PM, Jason Williams wrote: > I follow the sample at > http://blog.yhathq.com/posts/random-forests-in-python.html where it randomly > assigns true, false to the dataset > > np.random.uniform(0, 1, len(df)) <= .75 > > then partition dataset into train set and test set. I use

Re: [Scikit-learn-general] Tackling Dataset bias

2013-08-15 Thread Yogesh Karpate
Thanks a lot Olivier for suggesting Alex Blog. My apologies!! I rephrase my problem. I have two data set of Brain MR images, lets call it A and B. A is acquired in one country and B in another. The data-set A contains both patients having pathology and healthy volunteers where as data-set B contain

Re: [Scikit-learn-general] Classification accuracy too high

2013-08-15 Thread Jason Williams
I follow the sample at  http://blog.yhathq.com/posts/random-forests-in-python.html where it randomly assigns true, false to the dataset     np.random.uniform(0, 1, len(df)) <= .75 then partition dataset into train set and test set. I use the same way for creating model      rfc = RandomForestC

Re: [Scikit-learn-general] Does LinearSVC support probability/soft outputs out of the box?

2013-08-15 Thread Olivier Grisel
LinearSVC does not predict probabilities but the linear decision function is made available as the decision_function method. It should be possible to train a calibration model to turn those raw decision values as probabilities using an IsotonicRegression model [1] and cross-validation. There is n

Re: [Scikit-learn-general] Tackling Dataset bias

2013-08-15 Thread Olivier Grisel
I don't really understand what are the samples, the labels and the features in your case and how much unlabeled data do you have and what do you mean by "I have completed the classification task on 1st database.": if you have labeled datasets what does "completion of the classification task" mean?.

[Scikit-learn-general] Tackling Dataset bias

2013-08-15 Thread Yogesh Karpate
Hello Folks ! I have two different brain MR image databases acquired across two different countries. I need to perform patch based supervised binary classification task (+ pathology and - Normal). The 1st database contains both +pathology patients and -normal subjects whereas second

Re: [Scikit-learn-general] RidgeClassifier

2013-08-15 Thread Gilles Louppe
You can also try Nearest-Neighbors. They accept as well an output matrix since 0.14. On 15 August 2013 09:52, Gilles Louppe wrote: >> If I understand you correctly, you're trying to do multilabel classification >> by converting the problem to a multitask binary classification problem. >> Unfortun

Re: [Scikit-learn-general] RidgeClassifier

2013-08-15 Thread Gilles Louppe
> If I understand you correctly, you're trying to do multilabel classification > by converting the problem to a multitask binary classification problem. > Unfortunately, no classifier in scikit-learn can accept an output matrix. > You need to solve each task independently by fitting a classifier wi

Re: [Scikit-learn-general] Classification accuracy too high

2013-08-15 Thread Gilles Louppe
Hi Jason, It looks like you are evaluating your error on your training data, aren't you? It will give you a (very) poor estimate of the generalization error of your model. Instead, try your model on an independent part of your dataset (in particular, one which has a not been used to fit to your fo

Re: [Scikit-learn-general] Selective multiclass

2013-08-15 Thread Joel Nothman
Or perhaps since it's a bug in multiprocessing's queuing protocol, joblib could handle it by writing oversize object to disk, assuming there's enough free space in $TMPDIR. On Thu, Aug 15, 2013 at 5:32 PM, Joel Nothman wrote: > I've been getting "SystemError: NULL result without error in > PyObj

Re: [Scikit-learn-general] Selective multiclass

2013-08-15 Thread Joel Nothman
I've been getting "SystemError: NULL result without error in PyObject_Call" when trying to perform a parallel grid search (with logistic regression, n_jobs>=2) with a very large matrix. So it comes down to http://bugs.python.org/issue17560. It would be good if we could make the error a little less

Re: [Scikit-learn-general] Does LinearSVC support probability/soft outputs out of the box?

2013-08-15 Thread Lars Buitinck
2013/8/15 Josh Wasserstein : > It looks like it doesn't, but I just wanted to make sure. No. You can use LogisticRegression, which uses the same training algorithm (Liblinear) but a different objective function (log-loss). --

Re: [Scikit-learn-general] Classification accuracy too high

2013-08-15 Thread Robert Layton
The first thing I'd do is publish the result (just kidding!). Try it with another data set first, especially one that has an example in the docs. If you are still getting top marks, it may be your "framework" around the code. (are you doing proper test/train splits, etc) If it drops, consider that

[Scikit-learn-general] Classification accuracy too high

2013-08-15 Thread Jason Williams
I ran a few test based on Random Forest Classifier (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) with default setting. The classification (repeated the classification procedure several times) is nearly 100% correct. That seems to be overfitting.