Re: [scikit-learn] why the modification in the df-idf formula?

2024-05-29 Thread Sole Galli via scikit-learn
Hi Sebastian,

Thank you so much for sending the link. So, by the looks of it, the 
modification is introduced so that we start weighting at 0 (or 1 after adding 
the plus 1 to the result of the log) those words that appear in all documents. 
Otherwise, they'd receive a negative value.

Thank you!
Best
Sole

On Tuesday, May 28th, 2024 at 4:52 PM, Sebastian Raschka 
 wrote:

> Hi Sole,
>
> It’s been a long time, but I remember helping with drafting the Tf-idf text 
> in the documentation as part of a scikit-learn sprint at SciPy a looong time 
> ago where I mentioned this difference (since it initially surprised me, 
> because I couldn’t get it to match my from-scratch implementation). As far as 
> I remember, the sklearn version addressed some instability issues for certain 
> edge cases.
>
> I am not sure if that helps, but I have briefly compared the textbook vs the 
> sklearn tf-idf here: 
> https://github.com/rasbt/machine-learning-book/blob/main/ch08/ch08.ipynb
>
> Best,
> Sebastian
>
> --
> Sebastian Raschka, PhD
> Machine learning and AI researcher, 
> [https://sebastianraschka.com](https://sebastianraschka.com/)
>
> Staff Research Engineer at Lightning AI, https://lightning.ai
>
> On May 28, 2024 at 9:43 AM -0500, Sole Galli via scikit-learn 
> , wrote:
>
>> Hi guys,
>>
>> I'd like to understand why sklearn's implementation of tf-idf is different 
>> from the standard textbook notation as described in the docs: 
>> https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting
>>
>> Do you have any reference that I could take a look at? I didn't manage to 
>> find them in the docs, maybe I missed something?
>>
>> Thank you!
>>
>> Best wishes
>> Sole
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] why the modification in the df-idf formula?

2024-05-28 Thread Sole Galli via scikit-learn
Hi guys,

I'd like to understand why sklearn's implementation of tf-idf is different from 
the standard textbook notation as described in the docs: 
https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

Do you have any reference that I could take a look at? I didn't manage to find 
them in the docs, maybe I missed something?

Thank you!

Best wishes
Sole___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] target encoder: fit_transform vs fit.transform

2024-03-18 Thread Sole Galli via scikit-learn
Hey team,

I am going over the TargetEncoder documentation and I want to make sure I 
understand this correctly.

Is the intention of fit_transform's cross fit just to understand/ analyse / 
determine somehow how this transformer would perform?

Because if I got this right, the attribute values (category-number mappings) 
are determined over the entire training set, both in fit_transform and fit, so 
when we call transform over a new data set, say test, we'd obtain the same 
result regardless of whether we fit the transformer with fit or fit_transform. 
Correct?

Thank you for your input!
Best
Sole

Sent with [Proton Mail](https://proton.me/) secure email.___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] obtaining intervals from the decision tree struture

2023-03-07 Thread Sole Galli via scikit-learn
Hello,

I would like to obtain final intervals from the decision tree structure. I am 
not interested in every node, just the limits that take a sample to a final 
decision /leaf.

For example, if the tree structure is this one:

|--- feature_0 <= 0.08
|   |--- class: 0
|--- feature_0 >  0.08
|   |--- feature_0 <= 8.50
|   |   |--- feature_0 <= 1.50
|   |   |   |--- class: 1
|   |   |--- feature_0 >  1.50
|   |   |   |--- class: 1
|   |--- feature_0 >  8.50
|   |   |--- feature_0 <= 60.25
|   |   |   |--- class: 0
|   |   |--- feature_0 >  60.25
|   |   |   |--- class: 0

Then, I would like to obtain these limits:

0-0.08 ; 0.08-1.50; 1.50-8.50 ; 8.50-60; >60

Potentially as the following numpy array:

[-np.inf, 0.08, 1.5, 8.5, 60, np.inf]

Is it possible?

I have a stackoverflow question here for more details and code
https://stackoverflow.com/questions/75663472/how-to-obtain-the-interval-limits-from-a-decision-tree-with-scikit-learn

Thank you!
Sole

Sent with [Proton Mail](https://proton.me/) secure email.___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] mutual information for continuous variables with scikit-learn

2023-02-01 Thread Sole Galli via scikit-learn
Hey,

My understanding is that with sklearn you can compare 2 continuous variables 
like this:

mutual_info_regression(data["var1"].to_frame(), data["var"], 
discrete_features=[False])

Where var1 and var are continuous.

You can also compare multiple continuous variables against one continuous 
variables like this:

mutual_info_regression(data[["var1", "var_2", "var_3"]], data["var"],

discrete_features=[False, False, False])

I understand Sklearn uses nonparametric methods based on entropy estimation 
from k-nearest neighbors as explained in Nearest-neighbor approach to estimate 
the MI. Taken from Ross, 2014, PLoS ONE 9(2): e87357.

More details here: 
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html

And I've got a blog post about Mutual info with Python here: 
https://www.blog.trainindata.com/mutual-information-with-python/

Cheers
Sole

Soledad Galli
https://www.trainindata.com/

Sent with [Proton Mail](https://proton.me/) secure email.

--- Original Message ---
On Wednesday, February 1st, 2023 at 10:32 AM, m m  wrote:

> Hello,
>
> I have two continuous variables (heart rate samples over a period of time), 
> and would like to compute mutual information between them as a measure of 
> similarity.
>
> I've read some posts suggesting to use the mutual_info_score from 
> scikit-learn but will this work for continuous variables? One stackoverflow 
> answer suggested converting the data into probabilities with np.histogram2d() 
> and passing the contingency table to the mutual_info_score.
>
> from sklearn.metrics import mutual_info_score
>
> def calc_MI(x, y, bins):
> c_xy = np.histogram2d(x, y, bins)[0]
> mi = mutual_info_score(None, None, contingency=c_xy)
> return mi
>
> # generate data
> L = np.linalg.cholesky( [[1.0, 0.60], [0.60, 1.0]])
> uncorrelated = np.random.standard_normal((2, 300))
> correlated = np.dot(L, uncorrelated)
> A = correlated[0]
> B = correlated[1]
> x = (A - np.mean(A)) / np.std(A)
> y = (B - np.mean(B)) / np.std(B)
>
> # calculate MI
> mi = calc_MI(x, y, 50)
>
> Is calc_MI a valid approach? I'm asking because I also read that when 
> variables are continuous, then the sums in the formula for discrete data 
> become integrals, but I'm not sure if this procedure is implemented in 
> scikit-learn?
>
> Thanks!___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] methods available from last estimator in pipeline

2022-09-24 Thread Sole Galli via scikit-learn
Did you try:

pipeline.named_steps["the_string_name_for_knn"].kneighbours

?

pipeline should be replaced by the name you gave to your pipeline and the 
string in named_steps is the name you have to the knn when setting the pipe.

Sole

Sent with Proton Mail secure email.

--- Original Message ---
On Friday, September 23rd, 2022 at 10:16 PM, Gregory, Matthew 
 wrote:


> Hi all,
> 
> I have what is probably a silly question. I read this passage on [1]:
> 
> """
> The pipeline has all the methods that the last estimator in the pipeline has, 
> i.e. if the last estimator is a classifier, the Pipeline can be used as a 
> classifier. If the last estimator is a transformer, again, so is the pipeline.
> """
> 
> I'm trying to create a pipeline where my last estimator is a 
> KNeighborsClassifier and, instead of predict(), I was hoping to use 
> kneighbors(). But unfortunately, when in a pipeline, I'm getting this 
> AttributeError:
> 
> AttributeError: 'Pipeline' object has no attribute 'kneighbors'
> 
> Is kneighbors() really available from the Pipeline? Or is there an 
> alternative way to call an element in the Pipeline to use it? I tried 
> "pipe[-1].kneighbors(X)", but that doesn't seem to be applying the earlier 
> transforms in the pipeline.
> 
> Thanks for any pointers,
> matt
> 
> [1] https://scikit-learn.org/stable/modules/compose.html
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] View full sized k_means.labels_

2022-05-29 Thread Sole Galli via scikit-learn
Maybe with numpy.set_printoptions?

See thread here:
https://stackoverflow.com/questions/1987694/how-to-print-the-full-numpy-array-without-truncation


Soledad Galli
https://www.trainindata.com/

Sent with Proton Mail secure email.
--- Original Message ---
On Friday, May 13th, 2022 at 10:35 AM, Mahmood Naderan  
wrote:


> Hi,
> I have used the following lines of codes
>
> k_means = KMeans(n_clusters=i,
> random_state=4).fit(principalComponents_dataFrame)
> print(k_means.labels_)
>
> But the problem is for large vectors of labels, I see shortened
> version like this:
>
> [4 4 0 ... 0 0 0]
>
> How can I force it to print the full length vector?
>
> Regards,
> Mahmood
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] intermediate data state in a Pipeline

2022-04-11 Thread Sole Galli via scikit-learn
Hello community,

Say I have a pipeline with 3 data transformations, i.e., SimpleImputer, 
OrdinalEncoder and StandardScaler, and a Lasso at the end. And I want to obtain 
a copy of the transformed data that would be input to the Lasso.

Is there a way other than selecting all the steps of the pipeline prior to the 
Lasso and applying transform sequentially?

Thank you!

Sent with [ProtonMail](https://protonmail.com/) secure email.___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] random forests and multil-class probability

2021-07-27 Thread Sole Galli via scikit-learn
Thank you!

So when in the multiclass document says that for the algorithms that support 
intrinsically multiclass, which are listed 
[here](https://scikit-learn.org/stable/modules/multiclass.html), when it says 
that they do not need to be wrapped by the OnevsRest, it means that there is no 
need, because they can indeed handle multi class, each one in their own way.

But, if I want to plot PR curves or ROC curves, then I do need to wrap them 
because those metrics are calculated as a 1 vs rest manner, and this is not how 
it is handled by the algos. Is my understanding correct?

Thank you!

‐‐‐ Original Message ‐‐‐
On Tuesday, July 27th, 2021 at 11:33 AM, Nicolas Hug  wrote:

> To add to Guillaume's answer: the native multiclass support for forests/trees 
> is described here: 
> https://scikit-learn.org/stable/modules/tree.html#multi-output-problems
>
> It's not a one-vs-rest strategy and can be summed up as:
>
>>> -
>>>
>>> Store n output values in leaves, instead of 1;
>>>
>>> -
>>>
>>> Use splitting criteria that compute the average reduction across all n 
>>> outputs.
>
> Nicolas
>
> On 27/07/2021 10:22, Guillaume Lemaître wrote:
>
>>> On 27 Jul 2021, at 11:08, Sole Galli via scikit-learn
>>> [](mailto:scikit-learn@python.org)
>>> wrote:
>>>
>>> Hello community,
>>>
>>> Do I understand correctly that Random Forests are trained as a 1 vs rest 
>>> when the target has more than 2 classes? Say the target takes values 0, 1 
>>> and 2, then the model would train 3 estimators 1 per class under the hood?.
>>
>> Each decision tree of the forest is natively supporting multi class.
>>
>>> The predict_proba output is an array with 3 columns, containing the 
>>> probability of each class. If it is 1 vs rest. am I correct to assume that 
>>> the sum of the probabilities for the 3 classes should not necessarily add 
>>> up to 1? are they normalized? how is it done so that they do add up to 1?
>>
>> According to the above answer, the sum for each row of the array given by 
>> `predict_proba` will sum to 1.
>> According to the documentation, the probabilities are computed as:
>>
>> The predicted class probabilities of an input sample are computed as the 
>> mean predicted class probabilities of the trees in the forest. The class 
>> probability of a single tree is the fraction of samples of the same class in 
>> a leaf.
>>
>>> Thank you
>>> Sole
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>>
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>>
>> https://mail.python.org/mailman/listinfo/scikit-learn___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] random forests and multil-class probability

2021-07-27 Thread Sole Galli via scikit-learn
Thank you!

I was confused because in the multiclass documentation it says that for those 
estimators that have multiclass support built in, like Decision trees and 
Random Forests, then we do not need to use the wrapper classes like the 
OnevsRest.

Thus I have the following question, if I want to determine the PR curves or the 
ROC curve, say with micro-average, do I need to wrap them with the 1 vs rest? 
Or it does not matter? The probability values do change slightly.

Thank you!





‐‐‐ Original Message ‐‐‐

On Tuesday, July 27th, 2021 at 11:22 AM, Guillaume Lemaître 
 wrote:

> > On 27 Jul 2021, at 11:08, Sole Galli via scikit-learn 
> > scikit-learn@python.org wrote:
> >
> > Hello community,
> >
> > Do I understand correctly that Random Forests are trained as a 1 vs rest 
> > when the target has more than 2 classes? Say the target takes values 0, 1 
> > and 2, then the model would train 3 estimators 1 per class under the hood?.
>
> Each decision tree of the forest is natively supporting multi class.
>
> > The predict_proba output is an array with 3 columns, containing the 
> > probability of each class. If it is 1 vs rest. am I correct to assume that 
> > the sum of the probabilities for the 3 classes should not necessarily add 
> > up to 1? are they normalized? how is it done so that they do add up to 1?
>
> According to the above answer, the sum for each row of the array given by 
> `predict_proba` will sum to 1.
>
> According to the documentation, the probabilities are computed as:
>
> The predicted class probabilities of an input sample are computed as the mean 
> predicted class probabilities of the trees in the forest. The class 
> probability of a single tree is the fraction of samples of the same class in 
> a leaf.
>
> > Thank you
> >
> > Sole
> >
> > scikit-learn mailing list
> >
> > scikit-learn@python.org
> >
> > https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] random forests and multil-class probability

2021-07-27 Thread Sole Galli via scikit-learn
Hello community,

Do I understand correctly that Random Forests are trained as a 1 vs rest when 
the target has more than 2 classes? Say the target takes values 0, 1 and 2, 
then the model would train 3 estimators 1 per class under the hood?.

The predict_proba output is an array with 3 columns, containing the probability 
of each class. If it is 1 vs rest. am I correct to assume that the sum of the 
probabilities for the 3 classes should not necessarily add up to 1? are they 
normalized? how is it done so that they do add up to 1?

Thank you
Sole___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] function transformer

2021-06-21 Thread Sole Galli via scikit-learn
The FunctionTransformer will apply the transformation coded your function to 
the entire dataset passed to the transform() method.

I find it hard to see how this could work to add additional columns to the 
dataset, but I guess it might depend on how you designed your function.

Did you try passing your function to the FunctionTransformer and then apply the 
transform() method on your data and see the result?

Alternatively, you could create your own class to add additional columns to 
your data and pass that class within the pipeline.

Or, easier, use the 
[CombineWithFeatureReference](https://feature-engine.readthedocs.io/en/latest/creation/CombineWithReferenceFeature.html)
 transformer from another open source package for feature engineering 
(Feature-engine), which does exactly what you want to do.

Hope this helps

Soledad Galli
https://www.trainindata.com/

‐‐‐ Original Message ‐‐‐
On Friday, June 18th, 2021 at 12:45 PM, Manprit Singh 
 wrote:

> Dear sir ,
>
> Just need to know if I can use a function transformer to generate new columns 
> in the data set .
>
> Just see the below written pipeline
>
> num_pipeline = Pipeline([('imputer', SimpleImputer(strategy="median")),
> ('attribs_adder', column_adder),
> ('std_scaler', StandardScaler()),
> ])
> This pipeline is for numerical attributes in the dataset, firstly it will 
> treat all mising values in the data set using SimpleImputer , then i have 
> made a function to add three more columns in the existing data, i have made a 
> function transformer with this function and then StandardScaler .
>
> The columns being added are generated from existing columns (by element wise 
> division of two columns) . So Using a function transformer is ok ?___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] check_estimator _NotAnArray

2021-05-12 Thread Sole Galli via scikit-learn
fyi, just posted a question in stackoverflow:

https://stackoverflow.com/questions/67500110/what-is-the-check-transformer-data-not-an-array-test-from-sklearns-check-estima

Are there any plans to expand the docs on the check_estimators test?

it would be really helpful to have a general idea of why each test is 
important, and the consequences of failing this or that test. At least it would 
be useful for me :p

Thank you!

Sole

‐‐‐ Original Message ‐‐‐
On Monday, May 10, 2021 3:28 PM, Sole Galli via scikit-learn 
 wrote:

> Hello everyone,
>
> I am trying to get Feature-engine transformers pass the check_estimator tests 
> and there is one test, that I am not too sure what it is intended for.
>
> The transformers fail the check_transformer_data_not_an_array because the 
> input is a _NotAnArray class, and Feature-engine transformers don't like that.
>
> What is this check intended for? Is it to ensure compatibility with some 
> other sklearn class? if yes, which ones?
>
> I would appreciate any info or links to docs/ issues.
>
> Thanks a lot!
>
> Sole___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] check_estimator _NotAnArray

2021-05-10 Thread Sole Galli via scikit-learn
Hello everyone,

I am trying to get Feature-engine transformers pass the check_estimator tests 
and there is one test, that I am not too sure what it is intended for.

The transformers fail the check_transformer_data_not_an_array because the input 
is a _NotAnArray class, and Feature-engine transformers don't like that.

What is this check intended for? Is it to ensure compatibility with some other 
sklearn class? if yes, which ones?

I would appreciate any info or links to docs/ issues.

Thanks a lot!

Sole___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] IterativeImputer

2021-01-04 Thread Sole Galli via scikit-learn
Hello team,

I am reading in some of the MICE original articles that supposedly, each 
variable should be modelled upon the other ones in the data, with a suitable 
model. So for example, if the variable with NA is binary, it should be modelled 
with classification, or if continuous with a regression model.

Am I correct to understand that this is not possible yet with the 
IterativeImputer? because I should set the estimator in the estimator parameter 
and that will be used for all variables.

Is there a workaround?

Thanks a lot!

Regards

Soledad Galli
https://www.trainindata.com/___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] sample_weight vs class_weight

2020-12-05 Thread Sole Galli via scikit-learn
Thank you guys! very helpful :)

Soledad Galli
https://www.trainindata.com/

‐‐‐ Original Message ‐‐‐
On Friday, December 4, 2020 12:06 PM, mrschots  wrote:

> I have been using both in time-series classification. I put a exponential 
> decay in sample_weights AND class weights as a dictionary.
>
> BR/Schots
>
> Em sex., 4 de dez. de 2020 às 12:01, Nicolas Hug  escreveu:
>
>> Basically passing class weights should be equivalent to passing 
>> per-class-constant sample weights.
>>
>>> why do some estimators allow to pass weights both as a dict in the init or 
>>> as sample weights in fit? what's the logic?
>>
>> SW is a per-sample property (aligned with X and y) so we avoid passing those 
>> to init because the data isn't known when initializing the estimator. It's 
>> only known when calling fit. In general we avoid passing data-related info 
>> into init so that the same instance can be fitted on any data (with 
>> different number of samples, different classes, etc.).
>>
>> We allow to pass class_weight in init because the 'balanced' option is 
>> data-agnostic. Arguably, allowing a dict with actual class values violates 
>> the above argument (of not having data-related stuff in init), so I guess 
>> that's where the logic ends ;)
>>
>> As to why one would use both, I'm not so sure honestly.
>>
>> Nicolas
>>
>> On 12/4/20 10:40 AM, Sole Galli via scikit-learn wrote:
>>
>>> Actually, I found the answer. Both seem to be optimising the loss function 
>>> for the various algorithms, below I include some links.
>>>
>>> If, we pass class_weight and sample_weight, then the final cost / weight is 
>>> a combination of both.
>>>
>>> I have a follow up question: in which scenario would we use both? why do 
>>> some estimators allow to pass weights both as a dict in the init or as 
>>> sample weights in fit? what's the logic? I found it a bit confusing at the 
>>> beginning.
>>>
>>> Thank you!
>>>
>>> https://stackoverflow.com/questions/30805192/scikit-learn-random-forest-class-weight-and-sample-weight-parameters
>>>
>>> https://stackoverflow.com/questions/30972029/how-does-the-class-weight-parameter-in-scikit-learn-work/30982811#30982811
>>>
>>> Soledad Galli
>>> https://www.trainindata.com/
>>>
>>> ‐‐‐ Original Message ‐‐‐
>>> On Thursday, December 3, 2020 11:55 AM, Sole Galli via scikit-learn 
>>> [](mailto:scikit-learn@python.org) wrote:
>>>
>>>> Hello team,
>>>>
>>>> What is the difference in the implementation of class_weight and 
>>>> sample_weight in those algorithms that support both? like random forest or 
>>>> logistic regression?
>>>>
>>>> Are both modifying the loss function? in a similar way?
>>>>
>>>> Thank you!
>>>>
>>>> Sole
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>>
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
> --
> Schots___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] sample_weight vs class_weight

2020-12-04 Thread Sole Galli via scikit-learn
Actually, I found the answer. Both seem to be optimising the loss function for 
the various algorithms, below I include some links.

If, we pass class_weight and sample_weight, then the final cost / weight is a 
combination of both.

I have a follow up question: in which scenario would we use both? why do some 
estimators allow to pass weights both as a dict in the init or as sample 
weights in fit? what's the logic? I found it a bit confusing at the beginning.

Thank you!

https://stackoverflow.com/questions/30805192/scikit-learn-random-forest-class-weight-and-sample-weight-parameters

https://stackoverflow.com/questions/30972029/how-does-the-class-weight-parameter-in-scikit-learn-work/30982811#30982811

Soledad Galli
https://www.trainindata.com/

‐‐‐ Original Message ‐‐‐
On Thursday, December 3, 2020 11:55 AM, Sole Galli via scikit-learn 
 wrote:

> Hello team,
>
> What is the difference in the implementation of class_weight and 
> sample_weight in those algorithms that support both? like random forest or 
> logistic regression?
>
> Are both modifying the loss function? in a similar way?
>
> Thank you!
>
> Sole___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] sample_weight vs class_weight

2020-12-03 Thread Sole Galli via scikit-learn
Hello team,

What is the difference in the implementation of class_weight and sample_weight 
in those algorithms that support both? like random forest or logistic 
regression?

Are both modifying the loss function? in a similar way?

Thank you!

Sole___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] imbalanced datasets return uncalibrated predictions - why?

2020-11-18 Thread Sole Galli via scikit-learn
Thank you guys, that was actually very helpful.

Best regards
Sole

Soledad Galli
https://www.trainindata.com/

‐‐‐ Original Message ‐‐‐

On Tuesday, November 17th, 2020 at 10:54 AM, Roman Yurchak 
 wrote:

> On 17/11/2020 09:57, Sole Galli via scikit-learn wrote:
>
> > And I understand that it has to do with the cost function, because if we
> >
> > re-balance the dataset with say class_weight = 'balance'. then the
> >
> > probabilities seem to be calibrated as a result.
>
> As far I know, logistic regression will have well calibrated
>
> probabilities even in the imbalanced case. However, with the default
>
> decision threshold at 0.5, some of the infrequent categories may never
>
> be predicted since their probability is too low.
>
> If you use class_weight = 'balanced' the probabilities will no longer
>
> be well calibrated, however you would predict some of those infrequent
>
> categories.
>
> See discussions in
>
> https://github.com/scikit-learn/scikit-learn/issues/10613 and linked issues.
>
> -
>
> Roman
>
> scikit-learn mailing list
>
> scikit-learn@python.org
>
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] imbalanced datasets return uncalibrated predictions - why?

2020-11-17 Thread Sole Galli via scikit-learn
Hello team,

I am trying to understand why does logistic regression return uncalibrated 
probabilities with values tending to low probabilities for the positive (rare) 
cases, when trained on an imbalanced dataset.

I've read a number of articles, all seem to agree that this is the case, many 
show empirical proof, but no mathematical demo. When I test it myself, I can 
see that this is indeed the case, Logit on imbalanced datasets returns 
uncalibrated probs.

And I understand that it has to do with the cost function, because if we 
re-balance the dataset with say class_weight = 'balance'. then the 
probabilities seem to be calibrated as a result.

I was wondering if any of you knows the mathematical demo that supports this 
conclusion? Any mathematical demo, or clear explanation of why logit would 
return uncalibrated probs when trained on an imbalanced dataset?

Any link to a relevant article, video, presentation, etc, will be greatly 
appreciated.

Thanks a lot!

Sole___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Imputers and DataFrame objects

2020-08-19 Thread Sole Galli via scikit-learn
Did you have a look at the package feature-engine? It has its own imputers and 
encoders that allow you to select the columns to transform and returns a 
dataframe. It also has a sklear wrapper that wraps sklearn transformers so that 
they return a dataframe instead of a numpy array.

Cheers.

Sole

Sent from ProtonMail mobile

 Original Message 
On 18 Aug 2020, 13:56, Ram Rachum wrote:

> On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham  wrote:
>
>> Hi Ram,
>>
>> These are great questions!
>
> Thank you for the detailed answers.
>
>>> The task was to remove these irregularities. So for the "?" items, replace 
>>> them with mean, and for the "one", "two" etc. replace with a numerical 
>>> value.
>>
>> If your primary task is "data cleaning", then pandas is usually the optimal 
>> tool. If "preprocessing your data for Machine Learning" is your primary 
>> task, then scikit-learn is usually the optimal tool. There is some overlap 
>> between what is considered "cleaning" and "preprocessing", but I mention 
>> this distinction because it can help you decide what tool to use.
>
> Okay, but here's one example where it gets tricky. For a column with numbers 
> written like "one", "two" and missing values "?", I had to do two things: 
> Change them to numbers (1, 2), and then, instead of the missing values, add 
> the most common element, or mean or whatever. When I tried to use 
> LabelEncoder to do the first part, it complained about the missing values. I 
> couldn't fix these missing values until the labels were changed to ints. So 
> that put me in a frustrating Catch-22 situation, and all the while I'm 
> thinking "It would be so much simpler to just write my own logic in a 
> for-loop rather than try to get Pandas and scikit-learn working together.
>
> Any insights about that?
>
>>> For one, I couldn't figure out how to apply SimpleImputer on just one 
>>> column in the DataFrame, and then get the results in the form of a 
>>> dataframe.
>>
>> Like most scikit-learn transformers, SimpleImputer expects 2-dimensional 
>> input. In your case, this would be a 1-column DataFrame (such as 
>> df[['col']]) rather than a Series (such as df['col']).
>>
>> Also like most scikit-learn transformers, SimpleImputer outputs a NumPy 
>> array. If you need the output to be a DataFrame, one option is to convert 
>> the array to a pandas object and concatenate it to the original DataFrame.
>
> Well, I did do that in the `process_column` helper function in the code I 
> linked to above. But it kind of felt like... What am I using a framework for 
> to begin with? Because that kind of logistics is the reason I want to use a 
> framework instead of managing my own arrays and imputing logic.
>
> Thanks for your help Kevin.___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] climate friendly software licence

2020-06-30 Thread Sole Galli via scikit-learn
Hi Olivier, Gabriel, and further team,

Thank you so much for your views.

I understand enforcement is an issue. And I don't have yet an answer on if and 
how the license could be enforced.

I also think that this is a second step. First would be making the use of the 
software illegal. This would de-legitimise these companies from using these 
packages, which would then hopefully prevent these companies from presenting 
their destructive work in open source meetings like pydata, or openly hosting 
tech hub communities where they share the use of this software in an attempt to 
recruit talent, because now the use of the software is illegal. It would also 
make organisations like NumFocus stop accepting fossil fuel companies as 
sponsors, as they did in London 2019 and giving them a space to promote their 
work. Technical people may also ask twice before joining these companies, if 
now the use of software is not allowed, even at face value.

So I think, even if the license can't be enforced, it does have some power. 
But, as I said, at the moment I know very little of enforcement and whether 
package developers could get sued for adding this restriction.

Yes, there is a lot we can do as individuals to decrease our carbon footprint, 
some of us do, and certainly we should put the right people in power, but 
individual effort is not enough and electing politicians happens only every so 
many years. We need to do more than that, because the climate situation is very 
precarious and very urgent unfortunately.

Art organisations, newspapers, some banks and many pensions are cutting ties 
with fossil fuel companies. I think tech should take the plunge as well. If 
this is not the right way, would you have any suggestions?

Cheers

Sole


‐‐‐ Original Message ‐‐‐
On Monday, June 29, 2020 3:50 PM, Olivier Grisel  
wrote:

> Hi Sole,
>
> I personally support climate change actions very much and I am
> convinced climate change is the number 1 challenge of our time. In an
> attempt to act in a consistent way with that belief, I declined
> several times to keynote at conferences either organized by the fossil
> fuel industry or to conferences that would have required me to fly a
> long distance to give a presentation.
>
> However, I don't think software licensing is a right tool to advance this 
> cause.
>
> How would we enforce it? What would happen if we don't enforce it? Who
> is "we", especially when our library is embedded in 3-rd party
> software product and the end-users are not necessarily aware of all
> the upstream dependencies?
>
> What about gray-cases, e.g. a company that does not fossil directly
> extraction per-se but works as a consultancy with a majority of
> customers in the fossil fuel extraction industry? What if a
> significant part of their consultancy is to help them detect methane
> leaks in satellite data? How would we audit this? With which
> resources? How would we get a consensual decision on those gray cases?
>
> What about the hypocrisy of using or contributing to software under
> that license while regularly using fossil fuel powered transportation
> or in a working or leaving building heated with fossil fuels? Or
> buying goods transported this way over long distances?
>
> Instead, I would rather encourage everyone to vote for legislators and
> governments that progressively set bans on the development and
> commercialization of fossil fuel based technologies and to voice your
> support for such legislations in public debates. I encourage everybody
> to look twice before accepting to work for a company involved in
> fossil fuel extraction one way or another or involved in fossil-fuel
> intensive activities.
>
> 

[scikit-learn] climate friendly software licence

2020-06-29 Thread Sole Galli via scikit-learn
Hello Scikit-learn team,

I've come across this:
https://twitter.com/tristanharris/status/1277136696568508418?s=12

Basically, it is an initiative to include in software license a prohibition of 
use by fossil fuel extractivist companies.

I would like to know your views on this? Is this something that you would pick 
up from Scikit-learn?

Are there some legal concerns to be aware of? or anything else that should be 
considered?

Because it sounds quite powerful and straightforward to me.

I would be really keen to hear from you.

Thanks a lot

Sole___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn