Re: [scikit-learn] logistic regression results are not stable between solvers

2019-10-09 Thread Guillaume Lemaître
Ups I did not see the answer of Roman. Sorry about that. It is coming back
to the same conclusion :)

On Wed, 9 Oct 2019 at 23:37, Guillaume Lemaître 
wrote:

> Uhm actually increasing to 1 samples solve the convergence issue.
> SAGA is not designed to work with a so small sample size most probably.
>
> On Wed, 9 Oct 2019 at 23:36, Guillaume Lemaître 
> wrote:
>
>> I slightly change the bench such that it uses pipeline and plotted the
>> coefficient:
>>
>> https://gist.github.com/glemaitre/8fcc24bdfc7dc38ca0c09c56e26b9386
>>
>> I only see one of the 10 splits where SAGA is not converging, otherwise
>> the coefficients
>> look very close (I don't attach the figure here but they can be plotted
>> using the snippet).
>> So apart from this second split, the other differences seems to be
>> numerical instability.
>>
>> Where I have some concern is regarding the convergence rate of SAGA but I
>> have no
>> intuition to know if this is normal or not.
>>
>> On Wed, 9 Oct 2019 at 23:22, Roman Yurchak  wrote:
>>
>>> Ben,
>>>
>>> I can confirm your results with penalty='none' and C=1e9. In both cases,
>>> you are running a mostly unpenalized logisitic regression. Usually
>>> that's less numerically stable than with a small regularization,
>>> depending on the data collinearity.
>>>
>>> Running that same code with
>>>   - larger penalty ( smaller C values)
>>>   - or larger number of samples
>>>   yields for me the same coefficients (up to some tolerance).
>>>
>>> You can also see that SAGA convergence is not good by the fact that it
>>> needs 196000 epochs/iterations to converge.
>>>
>>> Actually, I have often seen convergence issues with SAG on small
>>> datasets (in unit tests), not fully sure why.
>>>
>>> --
>>> Roman
>>>
>>> On 09/10/2019 22:10, serafim loukas wrote:
>>> > The predictions across solver are exactly the same when I run the code.
>>> > I am using 0.21.3 version. What is yours?
>>> >
>>> >
>>> > In [13]: import sklearn
>>> >
>>> > In [14]: sklearn.__version__
>>> > Out[14]: '0.21.3'
>>> >
>>> >
>>> > Serafeim
>>> >
>>> >
>>> >
>>> >> On 9 Oct 2019, at 21:44, Benoît Presles <
>>> benoit.pres...@u-bourgogne.fr
>>> >> > wrote:
>>> >>
>>> >> (y_pred_lbfgs==y_pred_saga).all() == False
>>> >
>>> >
>>> > ___
>>> > scikit-learn mailing list
>>> > scikit-learn@python.org
>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>> >
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>>
>> --
>> Guillaume Lemaitre
>> Scikit-learn @ Inria Foundation
>> https://glemaitre.github.io/
>>
>
>
> --
> Guillaume Lemaitre
> Scikit-learn @ Inria Foundation
> https://glemaitre.github.io/
>


-- 
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] logistic regression results are not stable between solvers

2019-10-09 Thread Guillaume Lemaître
Uhm actually increasing to 1 samples solve the convergence issue.
SAGA is not designed to work with a so small sample size most probably.

On Wed, 9 Oct 2019 at 23:36, Guillaume Lemaître 
wrote:

> I slightly change the bench such that it uses pipeline and plotted the
> coefficient:
>
> https://gist.github.com/glemaitre/8fcc24bdfc7dc38ca0c09c56e26b9386
>
> I only see one of the 10 splits where SAGA is not converging, otherwise
> the coefficients
> look very close (I don't attach the figure here but they can be plotted
> using the snippet).
> So apart from this second split, the other differences seems to be
> numerical instability.
>
> Where I have some concern is regarding the convergence rate of SAGA but I
> have no
> intuition to know if this is normal or not.
>
> On Wed, 9 Oct 2019 at 23:22, Roman Yurchak  wrote:
>
>> Ben,
>>
>> I can confirm your results with penalty='none' and C=1e9. In both cases,
>> you are running a mostly unpenalized logisitic regression. Usually
>> that's less numerically stable than with a small regularization,
>> depending on the data collinearity.
>>
>> Running that same code with
>>   - larger penalty ( smaller C values)
>>   - or larger number of samples
>>   yields for me the same coefficients (up to some tolerance).
>>
>> You can also see that SAGA convergence is not good by the fact that it
>> needs 196000 epochs/iterations to converge.
>>
>> Actually, I have often seen convergence issues with SAG on small
>> datasets (in unit tests), not fully sure why.
>>
>> --
>> Roman
>>
>> On 09/10/2019 22:10, serafim loukas wrote:
>> > The predictions across solver are exactly the same when I run the code.
>> > I am using 0.21.3 version. What is yours?
>> >
>> >
>> > In [13]: import sklearn
>> >
>> > In [14]: sklearn.__version__
>> > Out[14]: '0.21.3'
>> >
>> >
>> > Serafeim
>> >
>> >
>> >
>> >> On 9 Oct 2019, at 21:44, Benoît Presles > >> > wrote:
>> >>
>> >> (y_pred_lbfgs==y_pred_saga).all() == False
>> >
>> >
>> > ___
>> > scikit-learn mailing list
>> > scikit-learn@python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
> Guillaume Lemaitre
> Scikit-learn @ Inria Foundation
> https://glemaitre.github.io/
>


-- 
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] logistic regression results are not stable between solvers

2019-10-09 Thread Guillaume Lemaître
I slightly change the bench such that it uses pipeline and plotted the
coefficient:

https://gist.github.com/glemaitre/8fcc24bdfc7dc38ca0c09c56e26b9386

I only see one of the 10 splits where SAGA is not converging, otherwise the
coefficients
look very close (I don't attach the figure here but they can be plotted
using the snippet).
So apart from this second split, the other differences seems to be
numerical instability.

Where I have some concern is regarding the convergence rate of SAGA but I
have no
intuition to know if this is normal or not.

On Wed, 9 Oct 2019 at 23:22, Roman Yurchak  wrote:

> Ben,
>
> I can confirm your results with penalty='none' and C=1e9. In both cases,
> you are running a mostly unpenalized logisitic regression. Usually
> that's less numerically stable than with a small regularization,
> depending on the data collinearity.
>
> Running that same code with
>   - larger penalty ( smaller C values)
>   - or larger number of samples
>   yields for me the same coefficients (up to some tolerance).
>
> You can also see that SAGA convergence is not good by the fact that it
> needs 196000 epochs/iterations to converge.
>
> Actually, I have often seen convergence issues with SAG on small
> datasets (in unit tests), not fully sure why.
>
> --
> Roman
>
> On 09/10/2019 22:10, serafim loukas wrote:
> > The predictions across solver are exactly the same when I run the code.
> > I am using 0.21.3 version. What is yours?
> >
> >
> > In [13]: import sklearn
> >
> > In [14]: sklearn.__version__
> > Out[14]: '0.21.3'
> >
> >
> > Serafeim
> >
> >
> >
> >> On 9 Oct 2019, at 21:44, Benoît Presles  >> > wrote:
> >>
> >> (y_pred_lbfgs==y_pred_saga).all() == False
> >
> >
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] logistic regression results are not stable between solvers

2019-10-09 Thread Roman Yurchak

Ben,

I can confirm your results with penalty='none' and C=1e9. In both cases, 
you are running a mostly unpenalized logisitic regression. Usually 
that's less numerically stable than with a small regularization, 
depending on the data collinearity.


Running that same code with
 - larger penalty ( smaller C values)
 - or larger number of samples
 yields for me the same coefficients (up to some tolerance).

You can also see that SAGA convergence is not good by the fact that it 
needs 196000 epochs/iterations to converge.


Actually, I have often seen convergence issues with SAG on small 
datasets (in unit tests), not fully sure why.


--
Roman

On 09/10/2019 22:10, serafim loukas wrote:

The predictions across solver are exactly the same when I run the code.
I am using 0.21.3 version. What is yours?


In [13]: import sklearn

In [14]: sklearn.__version__
Out[14]: '0.21.3'


Serafeim



On 9 Oct 2019, at 21:44, Benoît Presles > wrote:


(y_pred_lbfgs==y_pred_saga).all() == False



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] logistic regression results are not stable between solvers

2019-10-09 Thread serafim loukas
The predictions across solver are exactly the same when I run the code.
I am using 0.21.3 version. What is yours?


In [13]: import sklearn

In [14]: sklearn.__version__
Out[14]: '0.21.3'


Serafeim



On 9 Oct 2019, at 21:44, Benoît Presles 
mailto:benoit.pres...@u-bourgogne.fr>> wrote:

(y_pred_lbfgs==y_pred_saga).all() == False

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] logistic regression results are not stable between solvers

2019-10-09 Thread Benoît Presles

Dear scikit-learn users,

I did what you suggested (see code below) and I still do not get the 
same results between solvers. I do not have the same predictions and I 
do not have the same coefficients.


Best regards,
Ben


Here is the new source code:

from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
#
RANDOM_SEED = 2
#
X_sim, y_sim = make_classification(n_samples=400,
   n_features=45,
   n_informative=10,
   n_redundant=0,
   n_repeated=0,
   n_classes=2,
   n_clusters_per_class=1,
   random_state=RANDOM_SEED,
   shuffle=False)
#
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, 
random_state=RANDOM_SEED)

for train_index_split, test_index_split in sss.split(X_sim, y_sim):
    X_split_train, X_split_test = X_sim[train_index_split], 
X_sim[test_index_split]
    y_split_train, y_split_test = y_sim[train_index_split], 
y_sim[test_index_split]

    ss = StandardScaler()
    X_split_train = ss.fit_transform(X_split_train)
    X_split_test = ss.transform(X_split_test)
    #
    classifier_lbfgs = LogisticRegression(fit_intercept=True, 
max_iter=2000, verbose=0, random_state=RANDOM_SEED, C=1e9,
    solver='lbfgs', penalty='none', 
tol=1e-6)

    classifier_lbfgs.fit(X_split_train, y_split_train)
    print('classifier lbfgs iter:',  classifier_lbfgs.n_iter_)
    print(classifier_lbfgs.coef_)
    classifier_saga = LogisticRegression(fit_intercept=True, 
max_iter=2000, verbose=0, random_state=RANDOM_SEED, C=1e9,
    solver='saga', penalty='none', 
tol=1e-6)

    classifier_saga.fit(X_split_train, y_split_train)
    print('classifier saga iter:', classifier_saga.n_iter_)
    print(classifier_saga.coef_)
    #
    y_pred_lbfgs = classifier_lbfgs.predict(X_split_test)
    y_pred_saga  = classifier_saga.predict(X_split_test)
    #
    if (y_pred_lbfgs==y_pred_saga).all() == False:
    print('lbfgs does not give the same results as saga :-( !')
    exit(1)


Le 09/10/2019 à 20:25, Guillaume Lemaître a écrit :

Could you generate more samples, set penalty to none, reduce the tolerance and 
check the coefficients instead of predictions. This is sure to be sure that 
this is not only a numerical error.




Sent from my phone - sorry to be brief and potential misspell.



  Original Message



From: benoit.pres...@u-bourgogne.fr
Sent: 8 October 2019 20:27
To: scikit-learn@python.org
Reply to: scikit-learn@python.org
Subject: [scikit-learn] logistic regression results are not stable between 
solvers


Dear scikit-learn users,

I am using logistic regression to make some predictions. On my own data,
I do not get the same results between solvers. I managed to reproduce
this issue on synthetic data (see the code below).
All solvers seem to converge (n_iter_ < max_iter), so why do I get
different results?
If results between solvers are not stable, which one to choose?


Best regards,
Ben

--

Here is the code I used to generate synthetic data:

from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
#
RANDOM_SEED = 2
#
X_sim, y_sim = make_classification(n_samples=200,
    n_features=45,
    n_informative=10,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    n_clusters_per_class=1,
    random_state=RANDOM_SEED,
    shuffle=False)
#
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2,
random_state=RANDOM_SEED)
for train_index_split, test_index_split in sss.split(X_sim, y_sim):
     X_split_train, X_split_test = X_sim[train_index_split],
X_sim[test_index_split]
     y_split_train, y_split_test = y_sim[train_index_split],
y_sim[test_index_split]
     ss = StandardScaler()
     X_split_train = ss.fit_transform(X_split_train)
     X_split_test = ss.transform(X_split_test)
     #
     classifier_lbfgs = LogisticRegression(fit_intercept=True,
max_iter=2000, verbose=1, random_state=RANDOM_SEED, C=1e9,
     solver='lbfgs')
     classifier_lbfgs.fit(X_split_train, y_split_train)
     print('classifier lbfgs iter:',  classifier_lbfgs.n_iter_)
     classifier_saga = LogisticRegression(fit_intercept=True,
max_iter=2000, verbose=1, random_state=RANDOM_SEED, C=1e9,
     solver='saga')
     classifier_saga.fit(X_split_train, y_split_train)
   

Re: [scikit-learn] logistic regression results are not stable between solvers

2019-10-09 Thread Guillaume Lemaître
Could you generate more samples, set penalty to none, reduce the tolerance and 
check the coefficients instead of predictions. This is sure to be sure that 
this is not only a numerical error. 




Sent from my phone - sorry to be brief and potential misspell.



  Original Message  



From: benoit.pres...@u-bourgogne.fr
Sent: 8 October 2019 20:27
To: scikit-learn@python.org
Reply to: scikit-learn@python.org
Subject: [scikit-learn] logistic regression results are not stable between 
solvers


Dear scikit-learn users,

I am using logistic regression to make some predictions. On my own data,
I do not get the same results between solvers. I managed to reproduce
this issue on synthetic data (see the code below).
All solvers seem to converge (n_iter_ < max_iter), so why do I get
different results?
If results between solvers are not stable, which one to choose?


Best regards,
Ben

--

Here is the code I used to generate synthetic data:

from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
#
RANDOM_SEED = 2
#
X_sim, y_sim = make_classification(n_samples=200,
   n_features=45,
   n_informative=10,
   n_redundant=0,
   n_repeated=0,
   n_classes=2,
   n_clusters_per_class=1,
   random_state=RANDOM_SEED,
   shuffle=False)
#
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2,
random_state=RANDOM_SEED)
for train_index_split, test_index_split in sss.split(X_sim, y_sim):
    X_split_train, X_split_test = X_sim[train_index_split],
X_sim[test_index_split]
    y_split_train, y_split_test = y_sim[train_index_split],
y_sim[test_index_split]
    ss = StandardScaler()
    X_split_train = ss.fit_transform(X_split_train)
    X_split_test = ss.transform(X_split_test)
    #
    classifier_lbfgs = LogisticRegression(fit_intercept=True,
max_iter=2000, verbose=1, random_state=RANDOM_SEED, C=1e9,
    solver='lbfgs')
    classifier_lbfgs.fit(X_split_train, y_split_train)
    print('classifier lbfgs iter:',  classifier_lbfgs.n_iter_)
    classifier_saga = LogisticRegression(fit_intercept=True,
max_iter=2000, verbose=1, random_state=RANDOM_SEED, C=1e9,
    solver='saga')
    classifier_saga.fit(X_split_train, y_split_train)
    print('classifier saga iter:', classifier_saga.n_iter_)
    #
    y_pred_lbfgs = classifier_lbfgs.predict(X_split_test)
    y_pred_saga  = classifier_saga.predict(X_split_test)
    #
    if (y_pred_lbfgs==y_pred_saga).all() == False:
    print('lbfgs does not give the same results as saga :-( !')
    exit()

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] logistic regression results are not stable between solvers

2019-10-09 Thread Benoît Presles

Dear scikit-learn users,

Do you think it is a bug in scikit-learn?

Best regards,
Ben


Le 08/10/2019 à 20:19, Benoît Presles a écrit :

As you can notice in the code below, I do scale the data. I do not get any 
convergence warning and moreover I always have n_iter_ < max_iter.



Le 8 oct. 2019 à 19:51, Andreas Mueller  a écrit :

I'm pretty sure SAGA is not converging. Unless you scale the data, SAGA is very 
slow to converge.


On 10/8/19 7:19 PM, Benoît Presles wrote:
Dear scikit-learn users,

I am using logistic regression to make some predictions. On my own data, I do 
not get the same results between solvers. I managed to reproduce this issue on 
synthetic data (see the code below).
All solvers seem to converge (n_iter_ < max_iter), so why do I get different 
results?
If results between solvers are not stable, which one to choose?


Best regards,
Ben

--

Here is the code I used to generate synthetic data:

from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
#
RANDOM_SEED = 2
#
X_sim, y_sim = make_classification(n_samples=200,
n_features=45,
n_informative=10,
n_redundant=0,
n_repeated=0,
n_classes=2,
n_clusters_per_class=1,
random_state=RANDOM_SEED,
shuffle=False)
#
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, 
random_state=RANDOM_SEED)
for train_index_split, test_index_split in sss.split(X_sim, y_sim):
 X_split_train, X_split_test = X_sim[train_index_split], 
X_sim[test_index_split]
 y_split_train, y_split_test = y_sim[train_index_split], 
y_sim[test_index_split]
 ss = StandardScaler()
 X_split_train = ss.fit_transform(X_split_train)
 X_split_test = ss.transform(X_split_test)
 #
 classifier_lbfgs = LogisticRegression(fit_intercept=True, 
max_iter=2000, verbose=1, random_state=RANDOM_SEED, C=1e9,
 solver='lbfgs')
 classifier_lbfgs.fit(X_split_train, y_split_train)
 print('classifier lbfgs iter:',  classifier_lbfgs.n_iter_)
 classifier_saga = LogisticRegression(fit_intercept=True, 
max_iter=2000, verbose=1, random_state=RANDOM_SEED, C=1e9,
 solver='saga')
 classifier_saga.fit(X_split_train, y_split_train)
 print('classifier saga iter:', classifier_saga.n_iter_)
 #
 y_pred_lbfgs = classifier_lbfgs.predict(X_split_test)
 y_pred_saga  = classifier_saga.predict(X_split_test)
 #
 if (y_pred_lbfgs==y_pred_saga).all() == False:
 print('lbfgs does not give the same results as saga :-( !')
 exit()

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn