Re: [scikit-learn] why the modification in the df-idf formula?
Hi Sebastian, Thank you so much for sending the link. So, by the looks of it, the modification is introduced so that we start weighting at 0 (or 1 after adding the plus 1 to the result of the log) those words that appear in all documents. Otherwise, they'd receive a negative value. Thank you! Best Sole On Tuesday, May 28th, 2024 at 4:52 PM, Sebastian Raschka wrote: > Hi Sole, > > It’s been a long time, but I remember helping with drafting the Tf-idf text > in the documentation as part of a scikit-learn sprint at SciPy a looong time > ago where I mentioned this difference (since it initially surprised me, > because I couldn’t get it to match my from-scratch implementation). As far as > I remember, the sklearn version addressed some instability issues for certain > edge cases. > > I am not sure if that helps, but I have briefly compared the textbook vs the > sklearn tf-idf here: > https://github.com/rasbt/machine-learning-book/blob/main/ch08/ch08.ipynb > > Best, > Sebastian > > -- > Sebastian Raschka, PhD > Machine learning and AI researcher, > [https://sebastianraschka.com](https://sebastianraschka.com/) > > Staff Research Engineer at Lightning AI, https://lightning.ai > > On May 28, 2024 at 9:43 AM -0500, Sole Galli via scikit-learn > , wrote: > >> Hi guys, >> >> I'd like to understand why sklearn's implementation of tf-idf is different >> from the standard textbook notation as described in the docs: >> https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting >> >> Do you have any reference that I could take a look at? I didn't manage to >> find them in the docs, maybe I missed something? >> >> Thank you! >> >> Best wishes >> Sole >> >> ___ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] why the modification in the df-idf formula?
Hi guys, I'd like to understand why sklearn's implementation of tf-idf is different from the standard textbook notation as described in the docs: https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting Do you have any reference that I could take a look at? I didn't manage to find them in the docs, maybe I missed something? Thank you! Best wishes Sole___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] target encoder: fit_transform vs fit.transform
Hey team, I am going over the TargetEncoder documentation and I want to make sure I understand this correctly. Is the intention of fit_transform's cross fit just to understand/ analyse / determine somehow how this transformer would perform? Because if I got this right, the attribute values (category-number mappings) are determined over the entire training set, both in fit_transform and fit, so when we call transform over a new data set, say test, we'd obtain the same result regardless of whether we fit the transformer with fit or fit_transform. Correct? Thank you for your input! Best Sole Sent with [Proton Mail](https://proton.me/) secure email.___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] obtaining intervals from the decision tree struture
Hello, I would like to obtain final intervals from the decision tree structure. I am not interested in every node, just the limits that take a sample to a final decision /leaf. For example, if the tree structure is this one: |--- feature_0 <= 0.08 | |--- class: 0 |--- feature_0 > 0.08 | |--- feature_0 <= 8.50 | | |--- feature_0 <= 1.50 | | | |--- class: 1 | | |--- feature_0 > 1.50 | | | |--- class: 1 | |--- feature_0 > 8.50 | | |--- feature_0 <= 60.25 | | | |--- class: 0 | | |--- feature_0 > 60.25 | | | |--- class: 0 Then, I would like to obtain these limits: 0-0.08 ; 0.08-1.50; 1.50-8.50 ; 8.50-60; >60 Potentially as the following numpy array: [-np.inf, 0.08, 1.5, 8.5, 60, np.inf] Is it possible? I have a stackoverflow question here for more details and code https://stackoverflow.com/questions/75663472/how-to-obtain-the-interval-limits-from-a-decision-tree-with-scikit-learn Thank you! Sole Sent with [Proton Mail](https://proton.me/) secure email.___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] mutual information for continuous variables with scikit-learn
Hey, My understanding is that with sklearn you can compare 2 continuous variables like this: mutual_info_regression(data["var1"].to_frame(), data["var"], discrete_features=[False]) Where var1 and var are continuous. You can also compare multiple continuous variables against one continuous variables like this: mutual_info_regression(data[["var1", "var_2", "var_3"]], data["var"], discrete_features=[False, False, False]) I understand Sklearn uses nonparametric methods based on entropy estimation from k-nearest neighbors as explained in Nearest-neighbor approach to estimate the MI. Taken from Ross, 2014, PLoS ONE 9(2): e87357. More details here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html And I've got a blog post about Mutual info with Python here: https://www.blog.trainindata.com/mutual-information-with-python/ Cheers Sole Soledad Galli https://www.trainindata.com/ Sent with [Proton Mail](https://proton.me/) secure email. --- Original Message --- On Wednesday, February 1st, 2023 at 10:32 AM, m m wrote: > Hello, > > I have two continuous variables (heart rate samples over a period of time), > and would like to compute mutual information between them as a measure of > similarity. > > I've read some posts suggesting to use the mutual_info_score from > scikit-learn but will this work for continuous variables? One stackoverflow > answer suggested converting the data into probabilities with np.histogram2d() > and passing the contingency table to the mutual_info_score. > > from sklearn.metrics import mutual_info_score > > def calc_MI(x, y, bins): > c_xy = np.histogram2d(x, y, bins)[0] > mi = mutual_info_score(None, None, contingency=c_xy) > return mi > > # generate data > L = np.linalg.cholesky( [[1.0, 0.60], [0.60, 1.0]]) > uncorrelated = np.random.standard_normal((2, 300)) > correlated = np.dot(L, uncorrelated) > A = correlated[0] > B = correlated[1] > x = (A - np.mean(A)) / np.std(A) > y = (B - np.mean(B)) / np.std(B) > > # calculate MI > mi = calc_MI(x, y, 50) > > Is calc_MI a valid approach? I'm asking because I also read that when > variables are continuous, then the sums in the formula for discrete data > become integrals, but I'm not sure if this procedure is implemented in > scikit-learn? > > Thanks!___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] methods available from last estimator in pipeline
Did you try: pipeline.named_steps["the_string_name_for_knn"].kneighbours ? pipeline should be replaced by the name you gave to your pipeline and the string in named_steps is the name you have to the knn when setting the pipe. Sole Sent with Proton Mail secure email. --- Original Message --- On Friday, September 23rd, 2022 at 10:16 PM, Gregory, Matthew wrote: > Hi all, > > I have what is probably a silly question. I read this passage on [1]: > > """ > The pipeline has all the methods that the last estimator in the pipeline has, > i.e. if the last estimator is a classifier, the Pipeline can be used as a > classifier. If the last estimator is a transformer, again, so is the pipeline. > """ > > I'm trying to create a pipeline where my last estimator is a > KNeighborsClassifier and, instead of predict(), I was hoping to use > kneighbors(). But unfortunately, when in a pipeline, I'm getting this > AttributeError: > > AttributeError: 'Pipeline' object has no attribute 'kneighbors' > > Is kneighbors() really available from the Pipeline? Or is there an > alternative way to call an element in the Pipeline to use it? I tried > "pipe[-1].kneighbors(X)", but that doesn't seem to be applying the earlier > transforms in the pipeline. > > Thanks for any pointers, > matt > > [1] https://scikit-learn.org/stable/modules/compose.html > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] View full sized k_means.labels_
Maybe with numpy.set_printoptions? See thread here: https://stackoverflow.com/questions/1987694/how-to-print-the-full-numpy-array-without-truncation Soledad Galli https://www.trainindata.com/ Sent with Proton Mail secure email. --- Original Message --- On Friday, May 13th, 2022 at 10:35 AM, Mahmood Naderan wrote: > Hi, > I have used the following lines of codes > > k_means = KMeans(n_clusters=i, > random_state=4).fit(principalComponents_dataFrame) > print(k_means.labels_) > > But the problem is for large vectors of labels, I see shortened > version like this: > > [4 4 0 ... 0 0 0] > > How can I force it to print the full length vector? > > Regards, > Mahmood > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] intermediate data state in a Pipeline
Hello community, Say I have a pipeline with 3 data transformations, i.e., SimpleImputer, OrdinalEncoder and StandardScaler, and a Lasso at the end. And I want to obtain a copy of the transformed data that would be input to the Lasso. Is there a way other than selecting all the steps of the pipeline prior to the Lasso and applying transform sequentially? Thank you! Sent with [ProtonMail](https://protonmail.com/) secure email.___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] random forests and multil-class probability
Thank you! So when in the multiclass document says that for the algorithms that support intrinsically multiclass, which are listed [here](https://scikit-learn.org/stable/modules/multiclass.html), when it says that they do not need to be wrapped by the OnevsRest, it means that there is no need, because they can indeed handle multi class, each one in their own way. But, if I want to plot PR curves or ROC curves, then I do need to wrap them because those metrics are calculated as a 1 vs rest manner, and this is not how it is handled by the algos. Is my understanding correct? Thank you! ‐‐‐ Original Message ‐‐‐ On Tuesday, July 27th, 2021 at 11:33 AM, Nicolas Hug wrote: > To add to Guillaume's answer: the native multiclass support for forests/trees > is described here: > https://scikit-learn.org/stable/modules/tree.html#multi-output-problems > > It's not a one-vs-rest strategy and can be summed up as: > >>> - >>> >>> Store n output values in leaves, instead of 1; >>> >>> - >>> >>> Use splitting criteria that compute the average reduction across all n >>> outputs. > > Nicolas > > On 27/07/2021 10:22, Guillaume Lemaître wrote: > >>> On 27 Jul 2021, at 11:08, Sole Galli via scikit-learn >>> [](mailto:scikit-learn@python.org) >>> wrote: >>> >>> Hello community, >>> >>> Do I understand correctly that Random Forests are trained as a 1 vs rest >>> when the target has more than 2 classes? Say the target takes values 0, 1 >>> and 2, then the model would train 3 estimators 1 per class under the hood?. >> >> Each decision tree of the forest is natively supporting multi class. >> >>> The predict_proba output is an array with 3 columns, containing the >>> probability of each class. If it is 1 vs rest. am I correct to assume that >>> the sum of the probabilities for the 3 classes should not necessarily add >>> up to 1? are they normalized? how is it done so that they do add up to 1? >> >> According to the above answer, the sum for each row of the array given by >> `predict_proba` will sum to 1. >> According to the documentation, the probabilities are computed as: >> >> The predicted class probabilities of an input sample are computed as the >> mean predicted class probabilities of the trees in the forest. The class >> probability of a single tree is the fraction of samples of the same class in >> a leaf. >> >>> Thank you >>> Sole >>> >>> ___ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> ___ >> scikit-learn mailing list >> scikit-learn@python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] random forests and multil-class probability
Thank you! I was confused because in the multiclass documentation it says that for those estimators that have multiclass support built in, like Decision trees and Random Forests, then we do not need to use the wrapper classes like the OnevsRest. Thus I have the following question, if I want to determine the PR curves or the ROC curve, say with micro-average, do I need to wrap them with the 1 vs rest? Or it does not matter? The probability values do change slightly. Thank you! ‐‐‐ Original Message ‐‐‐ On Tuesday, July 27th, 2021 at 11:22 AM, Guillaume Lemaître wrote: > > On 27 Jul 2021, at 11:08, Sole Galli via scikit-learn > > scikit-learn@python.org wrote: > > > > Hello community, > > > > Do I understand correctly that Random Forests are trained as a 1 vs rest > > when the target has more than 2 classes? Say the target takes values 0, 1 > > and 2, then the model would train 3 estimators 1 per class under the hood?. > > Each decision tree of the forest is natively supporting multi class. > > > The predict_proba output is an array with 3 columns, containing the > > probability of each class. If it is 1 vs rest. am I correct to assume that > > the sum of the probabilities for the 3 classes should not necessarily add > > up to 1? are they normalized? how is it done so that they do add up to 1? > > According to the above answer, the sum for each row of the array given by > `predict_proba` will sum to 1. > > According to the documentation, the probabilities are computed as: > > The predicted class probabilities of an input sample are computed as the mean > predicted class probabilities of the trees in the forest. The class > probability of a single tree is the fraction of samples of the same class in > a leaf. > > > Thank you > > > > Sole > > > > scikit-learn mailing list > > > > scikit-learn@python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] random forests and multil-class probability
Hello community, Do I understand correctly that Random Forests are trained as a 1 vs rest when the target has more than 2 classes? Say the target takes values 0, 1 and 2, then the model would train 3 estimators 1 per class under the hood?. The predict_proba output is an array with 3 columns, containing the probability of each class. If it is 1 vs rest. am I correct to assume that the sum of the probabilities for the 3 classes should not necessarily add up to 1? are they normalized? how is it done so that they do add up to 1? Thank you Sole___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] function transformer
The FunctionTransformer will apply the transformation coded your function to the entire dataset passed to the transform() method. I find it hard to see how this could work to add additional columns to the dataset, but I guess it might depend on how you designed your function. Did you try passing your function to the FunctionTransformer and then apply the transform() method on your data and see the result? Alternatively, you could create your own class to add additional columns to your data and pass that class within the pipeline. Or, easier, use the [CombineWithFeatureReference](https://feature-engine.readthedocs.io/en/latest/creation/CombineWithReferenceFeature.html) transformer from another open source package for feature engineering (Feature-engine), which does exactly what you want to do. Hope this helps Soledad Galli https://www.trainindata.com/ ‐‐‐ Original Message ‐‐‐ On Friday, June 18th, 2021 at 12:45 PM, Manprit Singh wrote: > Dear sir , > > Just need to know if I can use a function transformer to generate new columns > in the data set . > > Just see the below written pipeline > > num_pipeline = Pipeline([('imputer', SimpleImputer(strategy="median")), > ('attribs_adder', column_adder), > ('std_scaler', StandardScaler()), > ]) > This pipeline is for numerical attributes in the dataset, firstly it will > treat all mising values in the data set using SimpleImputer , then i have > made a function to add three more columns in the existing data, i have made a > function transformer with this function and then StandardScaler . > > The columns being added are generated from existing columns (by element wise > division of two columns) . So Using a function transformer is ok ?___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] check_estimator _NotAnArray
fyi, just posted a question in stackoverflow: https://stackoverflow.com/questions/67500110/what-is-the-check-transformer-data-not-an-array-test-from-sklearns-check-estima Are there any plans to expand the docs on the check_estimators test? it would be really helpful to have a general idea of why each test is important, and the consequences of failing this or that test. At least it would be useful for me :p Thank you! Sole ‐‐‐ Original Message ‐‐‐ On Monday, May 10, 2021 3:28 PM, Sole Galli via scikit-learn wrote: > Hello everyone, > > I am trying to get Feature-engine transformers pass the check_estimator tests > and there is one test, that I am not too sure what it is intended for. > > The transformers fail the check_transformer_data_not_an_array because the > input is a _NotAnArray class, and Feature-engine transformers don't like that. > > What is this check intended for? Is it to ensure compatibility with some > other sklearn class? if yes, which ones? > > I would appreciate any info or links to docs/ issues. > > Thanks a lot! > > Sole___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] check_estimator _NotAnArray
Hello everyone, I am trying to get Feature-engine transformers pass the check_estimator tests and there is one test, that I am not too sure what it is intended for. The transformers fail the check_transformer_data_not_an_array because the input is a _NotAnArray class, and Feature-engine transformers don't like that. What is this check intended for? Is it to ensure compatibility with some other sklearn class? if yes, which ones? I would appreciate any info or links to docs/ issues. Thanks a lot! Sole___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] IterativeImputer
Hello team, I am reading in some of the MICE original articles that supposedly, each variable should be modelled upon the other ones in the data, with a suitable model. So for example, if the variable with NA is binary, it should be modelled with classification, or if continuous with a regression model. Am I correct to understand that this is not possible yet with the IterativeImputer? because I should set the estimator in the estimator parameter and that will be used for all variables. Is there a workaround? Thanks a lot! Regards Soledad Galli https://www.trainindata.com/___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] sample_weight vs class_weight
Thank you guys! very helpful :) Soledad Galli https://www.trainindata.com/ ‐‐‐ Original Message ‐‐‐ On Friday, December 4, 2020 12:06 PM, mrschots wrote: > I have been using both in time-series classification. I put a exponential > decay in sample_weights AND class weights as a dictionary. > > BR/Schots > > Em sex., 4 de dez. de 2020 às 12:01, Nicolas Hug escreveu: > >> Basically passing class weights should be equivalent to passing >> per-class-constant sample weights. >> >>> why do some estimators allow to pass weights both as a dict in the init or >>> as sample weights in fit? what's the logic? >> >> SW is a per-sample property (aligned with X and y) so we avoid passing those >> to init because the data isn't known when initializing the estimator. It's >> only known when calling fit. In general we avoid passing data-related info >> into init so that the same instance can be fitted on any data (with >> different number of samples, different classes, etc.). >> >> We allow to pass class_weight in init because the 'balanced' option is >> data-agnostic. Arguably, allowing a dict with actual class values violates >> the above argument (of not having data-related stuff in init), so I guess >> that's where the logic ends ;) >> >> As to why one would use both, I'm not so sure honestly. >> >> Nicolas >> >> On 12/4/20 10:40 AM, Sole Galli via scikit-learn wrote: >> >>> Actually, I found the answer. Both seem to be optimising the loss function >>> for the various algorithms, below I include some links. >>> >>> If, we pass class_weight and sample_weight, then the final cost / weight is >>> a combination of both. >>> >>> I have a follow up question: in which scenario would we use both? why do >>> some estimators allow to pass weights both as a dict in the init or as >>> sample weights in fit? what's the logic? I found it a bit confusing at the >>> beginning. >>> >>> Thank you! >>> >>> https://stackoverflow.com/questions/30805192/scikit-learn-random-forest-class-weight-and-sample-weight-parameters >>> >>> https://stackoverflow.com/questions/30972029/how-does-the-class-weight-parameter-in-scikit-learn-work/30982811#30982811 >>> >>> Soledad Galli >>> https://www.trainindata.com/ >>> >>> ‐‐‐ Original Message ‐‐‐ >>> On Thursday, December 3, 2020 11:55 AM, Sole Galli via scikit-learn >>> [](mailto:scikit-learn@python.org) wrote: >>> >>>> Hello team, >>>> >>>> What is the difference in the implementation of class_weight and >>>> sample_weight in those algorithms that support both? like random forest or >>>> logistic regression? >>>> >>>> Are both modifying the loss function? in a similar way? >>>> >>>> Thank you! >>>> >>>> Sole >>> >>> ___ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> ___ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > -- > Schots___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] sample_weight vs class_weight
Actually, I found the answer. Both seem to be optimising the loss function for the various algorithms, below I include some links. If, we pass class_weight and sample_weight, then the final cost / weight is a combination of both. I have a follow up question: in which scenario would we use both? why do some estimators allow to pass weights both as a dict in the init or as sample weights in fit? what's the logic? I found it a bit confusing at the beginning. Thank you! https://stackoverflow.com/questions/30805192/scikit-learn-random-forest-class-weight-and-sample-weight-parameters https://stackoverflow.com/questions/30972029/how-does-the-class-weight-parameter-in-scikit-learn-work/30982811#30982811 Soledad Galli https://www.trainindata.com/ ‐‐‐ Original Message ‐‐‐ On Thursday, December 3, 2020 11:55 AM, Sole Galli via scikit-learn wrote: > Hello team, > > What is the difference in the implementation of class_weight and > sample_weight in those algorithms that support both? like random forest or > logistic regression? > > Are both modifying the loss function? in a similar way? > > Thank you! > > Sole___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] sample_weight vs class_weight
Hello team, What is the difference in the implementation of class_weight and sample_weight in those algorithms that support both? like random forest or logistic regression? Are both modifying the loss function? in a similar way? Thank you! Sole___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] imbalanced datasets return uncalibrated predictions - why?
Thank you guys, that was actually very helpful. Best regards Sole Soledad Galli https://www.trainindata.com/ ‐‐‐ Original Message ‐‐‐ On Tuesday, November 17th, 2020 at 10:54 AM, Roman Yurchak wrote: > On 17/11/2020 09:57, Sole Galli via scikit-learn wrote: > > > And I understand that it has to do with the cost function, because if we > > > > re-balance the dataset with say class_weight = 'balance'. then the > > > > probabilities seem to be calibrated as a result. > > As far I know, logistic regression will have well calibrated > > probabilities even in the imbalanced case. However, with the default > > decision threshold at 0.5, some of the infrequent categories may never > > be predicted since their probability is too low. > > If you use class_weight = 'balanced' the probabilities will no longer > > be well calibrated, however you would predict some of those infrequent > > categories. > > See discussions in > > https://github.com/scikit-learn/scikit-learn/issues/10613 and linked issues. > > - > > Roman > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] imbalanced datasets return uncalibrated predictions - why?
Hello team, I am trying to understand why does logistic regression return uncalibrated probabilities with values tending to low probabilities for the positive (rare) cases, when trained on an imbalanced dataset. I've read a number of articles, all seem to agree that this is the case, many show empirical proof, but no mathematical demo. When I test it myself, I can see that this is indeed the case, Logit on imbalanced datasets returns uncalibrated probs. And I understand that it has to do with the cost function, because if we re-balance the dataset with say class_weight = 'balance'. then the probabilities seem to be calibrated as a result. I was wondering if any of you knows the mathematical demo that supports this conclusion? Any mathematical demo, or clear explanation of why logit would return uncalibrated probs when trained on an imbalanced dataset? Any link to a relevant article, video, presentation, etc, will be greatly appreciated. Thanks a lot! Sole___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] Imputers and DataFrame objects
Did you have a look at the package feature-engine? It has its own imputers and encoders that allow you to select the columns to transform and returns a dataframe. It also has a sklear wrapper that wraps sklearn transformers so that they return a dataframe instead of a numpy array. Cheers. Sole Sent from ProtonMail mobile Original Message On 18 Aug 2020, 13:56, Ram Rachum wrote: > On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham wrote: > >> Hi Ram, >> >> These are great questions! > > Thank you for the detailed answers. > >>> The task was to remove these irregularities. So for the "?" items, replace >>> them with mean, and for the "one", "two" etc. replace with a numerical >>> value. >> >> If your primary task is "data cleaning", then pandas is usually the optimal >> tool. If "preprocessing your data for Machine Learning" is your primary >> task, then scikit-learn is usually the optimal tool. There is some overlap >> between what is considered "cleaning" and "preprocessing", but I mention >> this distinction because it can help you decide what tool to use. > > Okay, but here's one example where it gets tricky. For a column with numbers > written like "one", "two" and missing values "?", I had to do two things: > Change them to numbers (1, 2), and then, instead of the missing values, add > the most common element, or mean or whatever. When I tried to use > LabelEncoder to do the first part, it complained about the missing values. I > couldn't fix these missing values until the labels were changed to ints. So > that put me in a frustrating Catch-22 situation, and all the while I'm > thinking "It would be so much simpler to just write my own logic in a > for-loop rather than try to get Pandas and scikit-learn working together. > > Any insights about that? > >>> For one, I couldn't figure out how to apply SimpleImputer on just one >>> column in the DataFrame, and then get the results in the form of a >>> dataframe. >> >> Like most scikit-learn transformers, SimpleImputer expects 2-dimensional >> input. In your case, this would be a 1-column DataFrame (such as >> df[['col']]) rather than a Series (such as df['col']). >> >> Also like most scikit-learn transformers, SimpleImputer outputs a NumPy >> array. If you need the output to be a DataFrame, one option is to convert >> the array to a pandas object and concatenate it to the original DataFrame. > > Well, I did do that in the `process_column` helper function in the code I > linked to above. But it kind of felt like... What am I using a framework for > to begin with? Because that kind of logistics is the reason I want to use a > framework instead of managing my own arrays and imputing logic. > > Thanks for your help Kevin.___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] climate friendly software licence
Hi Olivier, Gabriel, and further team, Thank you so much for your views. I understand enforcement is an issue. And I don't have yet an answer on if and how the license could be enforced. I also think that this is a second step. First would be making the use of the software illegal. This would de-legitimise these companies from using these packages, which would then hopefully prevent these companies from presenting their destructive work in open source meetings like pydata, or openly hosting tech hub communities where they share the use of this software in an attempt to recruit talent, because now the use of the software is illegal. It would also make organisations like NumFocus stop accepting fossil fuel companies as sponsors, as they did in London 2019 and giving them a space to promote their work. Technical people may also ask twice before joining these companies, if now the use of software is not allowed, even at face value. So I think, even if the license can't be enforced, it does have some power. But, as I said, at the moment I know very little of enforcement and whether package developers could get sued for adding this restriction. Yes, there is a lot we can do as individuals to decrease our carbon footprint, some of us do, and certainly we should put the right people in power, but individual effort is not enough and electing politicians happens only every so many years. We need to do more than that, because the climate situation is very precarious and very urgent unfortunately. Art organisations, newspapers, some banks and many pensions are cutting ties with fossil fuel companies. I think tech should take the plunge as well. If this is not the right way, would you have any suggestions? Cheers Sole ‐‐‐ Original Message ‐‐‐ On Monday, June 29, 2020 3:50 PM, Olivier Grisel wrote: > Hi Sole, > > I personally support climate change actions very much and I am > convinced climate change is the number 1 challenge of our time. In an > attempt to act in a consistent way with that belief, I declined > several times to keynote at conferences either organized by the fossil > fuel industry or to conferences that would have required me to fly a > long distance to give a presentation. > > However, I don't think software licensing is a right tool to advance this > cause. > > How would we enforce it? What would happen if we don't enforce it? Who > is "we", especially when our library is embedded in 3-rd party > software product and the end-users are not necessarily aware of all > the upstream dependencies? > > What about gray-cases, e.g. a company that does not fossil directly > extraction per-se but works as a consultancy with a majority of > customers in the fossil fuel extraction industry? What if a > significant part of their consultancy is to help them detect methane > leaks in satellite data? How would we audit this? With which > resources? How would we get a consensual decision on those gray cases? > > What about the hypocrisy of using or contributing to software under > that license while regularly using fossil fuel powered transportation > or in a working or leaving building heated with fossil fuels? Or > buying goods transported this way over long distances? > > Instead, I would rather encourage everyone to vote for legislators and > governments that progressively set bans on the development and > commercialization of fossil fuel based technologies and to voice your > support for such legislations in public debates. I encourage everybody > to look twice before accepting to work for a company involved in > fossil fuel extraction one way or another or involved in fossil-fuel > intensive activities. > >
[scikit-learn] climate friendly software licence
Hello Scikit-learn team, I've come across this: https://twitter.com/tristanharris/status/1277136696568508418?s=12 Basically, it is an initiative to include in software license a prohibition of use by fossil fuel extractivist companies. I would like to know your views on this? Is this something that you would pick up from Scikit-learn? Are there some legal concerns to be aware of? or anything else that should be considered? Because it sounds quite powerful and straightforward to me. I would be really keen to hear from you. Thanks a lot Sole___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn