Re: [scikit-learn] why the modification in the df-idf formula?

2024-05-28 Thread Sebastian Raschka
-- Sebastian Raschka, PhD Machine learning and AI researcher, https://sebastianraschka.com Staff Research Engineer at Lightning AI, https://lightning.ai On May 28, 2024 at 9:43 AM -0500, Sole Galli via scikit-learn , wrote: > Hi guys, > > I'd like to understand why sklearn&#x

Re: [scikit-learn] New core developer: Tim Head

2023-03-08 Thread Sebastian Raschka
Awesome news! Congrats Tim! Cheers, Sebastian On Mar 8, 2023, 8:35 AM -0600, Ruchika Nayyar , wrote: > Congratulations Tim! Good to see you virtually :) > > Thanks, > Ruchika > > > Dr. Ruchika Nayyar > Data Scientist, Greene Tweed & Co. > > > > On Wed, Mar 8, 2023 at 5:09 

Re: [scikit-learn] [ANNOUNCEMENT] scikit-learn 1.0 release

2021-09-24 Thread Sebastian Raschka
A 1.0 release is huge, and this is really awesome news! Very exciting! Congrats to the scikit-learn team and everyone who helped making this possible! Cheers, Sebastian On Sep 24, 2021, 11:40 AM -0500, Adrin , wrote: > Hi everyone, > > We're happy to announce the 1.0 release which you can install

Re: [scikit-learn] Regarding negative value of sklearn.metrics.r2_score and sklearn.metrics.explained_variance_score

2021-08-12 Thread Sebastian Raschka
The R2 function in scikit-learn works fine. A negative means that the regression model fits the data worse than a horizontal line representing the sample mean. E.g. you usually get that if you are overfitting the training set a lot and then apply that model to the test set. The econometrics book

Re: [scikit-learn] Presented scikit-learn to the French President

2020-12-05 Thread Sebastian Raschka
This is really awesome news! Thanks a lot to everyone developing scikit-learn. I am just wrapping up another successful semester, teaching students ML basics. Most coming from an R background, they really loved scikit-learn and appreciated it's ease of use and well-thought-out API. Best, Sebast

Re: [scikit-learn] make_classification question

2020-08-12 Thread Sebastian Raschka
Hi Anna, You can set shuffle=False (it's set to True by default in the make_classification function). Then, the resulting features will be sorted as follows: X[:, :n_informative + n_redundant + n_repeated]. I.e., if you set “n_features=1000” and “n_informative=20”, the first 20 features will b

Re: [scikit-learn] The exact formula used to compute the tf-idf

2020-02-01 Thread Sebastian Raschka
Hi there, unfortunately I currently don't have time to walk through your example, but I wrote down how the Tf-idf in sklearn works using some examples here: https://github.com/rasbt/pattern_classification/blob/90710922e4f4d7e3f432221b8a4d2ec1dd2d9dc9/machine_learning/scikit-learn/tfidf_scikit-le

Re: [scikit-learn] What are the stopwords used by CountVectorizer?

2020-01-27 Thread Sebastian Raschka
Hi Peng, check out https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/_stop_words.py Best, Sebastian > On Jan 27, 2020, at 2:30 PM, Peng Yu wrote: > > Hi, > > I don't see what stopwords are used by CountVectorizer with > stop_wordsstring = ‘english’. > > htt

Re: [scikit-learn] scikit-learn twitter account

2019-11-04 Thread Sebastian Raschka
I think that a twitter account for scikit-learn would be awesome. I could envision it for announcements (new features, package releases, etc.), but it would be cool to share interesting applications of scikit-learn, upcoming events (tutorials, conference talks) as well -- somewhat similar to wha

Re: [scikit-learn] Can we say stochastic gradient descent as an ML model?

2019-10-28 Thread Sebastian Raschka
Hi Bulbul, I would rather say SGD is a method for optimizing the objective function of certain ML models, or optimize the loss function of certain ML models / learn the parameters of certain ML models. Best, Sebastian > On Oct 28, 2019, at 4:00 PM, Bulbul Ahmmed via scikit-learn > wrote: >

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-06 Thread Sebastian Raschka
igure?). > > > On 10/6/19 10:40 AM, Sebastian Raschka wrote: >> Sure, I just ran an example I made with graphviz via plot_tree, and it looks >> like there's an issue with overlapping boxes if you use class (and/or >> feature) names. I made a reproducible example here so

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-06 Thread Sebastian Raschka
_tree/tree-demo-1.ipynb Happy to add this to the sklearn issue list if there's no issue filed for that yet. Best, Sebastian > On Oct 6, 2019, at 9:10 AM, Andreas Mueller wrote: > > > > On 10/4/19 11:28 PM, Sebastian Raschka wrote: >> The docs show a way such that yo

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-04 Thread Sebastian Raschka
g on your computer. > That's a lot work for just one plot. Is there something like a matplotlib? > > Thanks! > > On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka > wrote: > Yeah, think of it more as a computational workaround for achieving the same > thing more eff

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-04 Thread Sebastian Raschka
think I get it. > > It's just have never seen it this way. Quite different from what I'm used in > Elements of Statistical Learning. > > On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka > wrote: > Not sure if there's a website for that. In any case, to explain

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-04 Thread Sebastian Raschka
#x27;t understand your answer. > > Why after one-hot-encoding it still outputs greater than 0.5 or less than? > Does sklearn website have a working example on categorical input? > > Thanks! > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka > wrote: > Like Nicolas s

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-04 Thread Sebastian Raschka
>> and split at 0.5. This is not right. Perhaps, I'm doing something wrong? >> >> Is there a good toy example on the sklearn website? I am only see this: >> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html >> <https

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-04 Thread Sebastian Raschka
Hi, > The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, > Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5 that's not a onehot encoding then. For an Audi datapoint, it should be BMW=0 Toyota=0 Audi=1 for BMW BMW=1 Toyota=0 Audi=0 and for Toyota

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-09-13 Thread Sebastian Raschka
ionTreeClassifier()? > > Best, > > Mike > > On Fri, Sep 13, 2019 at 11:59 PM Sebastian Raschka > wrote: > Hi, > > if you have the category "car" as shown in your example, this would > effectively be something like > > BMW=0 > Toyota=1 > A

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-09-13 Thread Sebastian Raschka
Hi, if you have the category "car" as shown in your example, this would effectively be something like BMW=0 Toyota=1 Audi=2 Sure, the algorithm will execute just fine on the feature column with values in {0, 1, 2}. However, the problem is that it will come up with binary rules like x_i>= 0.5,

Re: [scikit-learn] No convergence warning in logistic regression

2019-08-30 Thread Sebastian Raschka
Hi Ben, I can recall seeing convergence warnings for scikit-learn's logistic regression model on datasets in the past as well. Which solver did you use for LogisticRegression in sklearn? If you haven't done so, have used the lbfgs solver? I.e., LogisticRegression(..., solver='lbfgs')? Best, S

Re: [scikit-learn] Predict Method of OneVsRestClassifier Integration with Google Cloud ML

2019-04-10 Thread Sebastian Raschka
roblem that I would have to have that custom estimator defined on the Cloud > ML end, which I'm unsure how to do. > > Thanks, > Liam > > On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka > wrote: > Hi Liam, > > not sure what your exact error message is, but it may

Re: [scikit-learn] Predict Method of OneVsRestClassifier Integration with Google Cloud ML

2019-04-10 Thread Sebastian Raschka
Hi Liam, not sure what your exact error message is, but it may also be that the XGBClassifier only accepts dense arrays? I think the TfidfVectorizer returns sparse arrays. You could probably fix your issues by inserting a "DenseTransformer" into your pipelone (a simple class that just transform

Re: [scikit-learn] GridsearchCV returns worse scoring the broader parameter space gets

2019-03-31 Thread Sebastian Raschka
Hi Andreas, the best score is determined by computing the test fold performance (I think R^2 by default) and then averaging over them. Since you chose cv=10, you have 10 test folds, and the performance is the average performance over those for choosing the best hyper parameter setting. Then,

Re: [scikit-learn] What theory cause SGDRegressor can partial_fit but RandomForestRegressor can't?

2019-03-13 Thread Sebastian Raschka
It's not necessarily unique to stochastic gradient descent, it's more that some other algorithms are generally not well suited for "partial_fit". For SGD, partial fit is a more natural thing to do since you estimate the training loss from minibatches anyway -- i.e., you do SGD step by step anywa

Re: [scikit-learn] AUCROC/MAP confidence intervals in scikit

2019-02-07 Thread Sebastian Raschka
ibution independent and doesn't need > bootstrapping, so it looks indeed quite nice. > > > On 2/6/19 1:19 PM, Sebastian Raschka wrote: > > Hi Stuart, > > > > I don't think so because there is no standard way to compute CI's. That > > go

Re: [scikit-learn] AUCROC/MAP confidence intervals in scikit

2019-02-06 Thread Sebastian Raschka
Hi Stuart, I don't think so because there is no standard way to compute CI's. That goes for all performance measures (accuracy, precision, recall, etc.). Some people use simple binomial approximation intervals, some people prefer bootstrapping etc. And it also depends on the data you have. In l

Re: [scikit-learn] Does model consider about previous training results after reloading model and then training with new data?

2019-01-31 Thread Sebastian Raschka
12:52 AM, lampahome wrote: > > > > Sebastian Raschka 於 2019年2月1日 週五 下午1:48寫道: > Hi there, > > if you call the "fit" method, the learning will essentially start from > scratch. So no, it doesn't consider previous training results. > However, certain alg

Re: [scikit-learn] Does model consider about previous training results after reloading model and then training with new data?

2019-01-31 Thread Sebastian Raschka
Hi there, if you call the "fit" method, the learning will essentially start from scratch. So no, it doesn't consider previous training results. However, certain algorithms are implemented with an additional partial_fit method that would consider previous training rounds. Best, Sebastian > On

Re: [scikit-learn] LogisticRegression coef_ greater than n_features?

2019-01-08 Thread Sebastian Raschka
t; array(['American', 'Southwest'], dtype=object) > > > > On Tue, Jan 8, 2019 at 9:51 AM pisymbol wrote: > If that is the case, what order are the coefficients in then? > > -aps > > On Tue, Jan 8, 2019 at 12:48 AM Sebastian Raschka > wrote: >

Re: [scikit-learn] LogisticRegression coef_ greater than n_features?

2019-01-07 Thread Sebastian Raschka
E.g, if you have a feature with values 'a' , 'b', 'c', then applying the one hot encoder will transform this into 3 features. Best, Sebastian > On Jan 7, 2019, at 11:02 PM, pisymbol wrote: > > > > On Mon, Jan 7, 2019 at 11:50 PM pisymbol wrote: > According to the doc (0.20.2) the coef_ vari

Re: [scikit-learn] LogisticRegression coef_ greater than n_features?

2019-01-07 Thread Sebastian Raschka
Maybe check a) if the actual labels of the training examples don't start at 0 b) if you have gaps, e.g,. if your unique training labels are 0, 1, 4, ..., 23 Best, Sebastian > On Jan 7, 2019, at 10:50 PM, pisymbol wrote: > > According to the doc (0.20.2) the coef_ variables are suppose to be s

Re: [scikit-learn] How GridSearchCV to get best_params?

2019-01-03 Thread Sebastian Raschka
I think it refers to the test folds via the k-fold cross-validation that is internally used via the `cv` parameter of GridSearchCV (or the test folds of an alternative cross validation scheme that you may pass as an iterator to cv) Best, Sebastian > On Jan 3, 2019, at 9:44 PM, lampahome wrote:

Re: [scikit-learn] Any way to tune the parameters better than GridSearchCV?

2018-12-24 Thread Sebastian Raschka
I would like to make a related suggestion but instead of focusing on the upper bound for the number of trees rather set choosing the lower bound. From a theoretical perspective, it doesn't make sense to me how fewer trees can result in a better performing random forest model in terms of generali

Re: [scikit-learn] time complexity of tree-based model?

2018-12-20 Thread Sebastian Raschka
Say n is the number of examples and m is the number of features, then a naive implementation of a balanced binary decision tree is O(m * n^2 log n). I think scikit-learn's decision tree cache the sorted features, so this reduces to O(m * n log n). Than, to your O(m * n log n) you can multiply th

Re: [scikit-learn] plan to add the association rule classification algorithm in scikit learn

2018-12-16 Thread Sebastian Raschka
Hi Rui, I agree with Joel that association rule mining could be a bit tricky to fit nicely within the scikit-learn API. Maybe this could be some transformer class? I thought about that a few years ago but remember that I couldn't come up with a good solution at that point. In any case, I have

Re: [scikit-learn] make all new parameters keyword-only?

2018-11-15 Thread Sebastian Raschka
Also want to say that I really welcome this decision/change. Personally, as far as I am aware, I've trying been using keyword arguments consistently for years, except for cases where it is really obvious, like .fit(X_train, y_train), and I believe that it really helped me regarding writing less

Re: [scikit-learn] How does the random state influence the decision tree splits?

2018-10-28 Thread Sebastian Raschka
ki wrote: > Just a small side note that I've come across with Random Forests which in the > end form an ensemble of Decision Trees. I ran a thousand iterations of RFs on > multi-label data and managed to get a 4-10 percentage points difference in > subset accuracy, depending o

Re: [scikit-learn] How does the random state influence the decision tree splits?

2018-10-27 Thread Sebastian Raschka
nt? > > I’d at least try that before diving into the source code... > > Cheers, > > -- > Julio > >> El 28 oct 2018, a las 2:24, Sebastian Raschka >> escribió: >> >> Thanks, Javier, >> >> however, the max_features is n_features by def

Re: [scikit-learn] How does the random state influence the decision tree splits?

2018-10-27 Thread Sebastian Raschka
z wrote: > > Hi Sebastian, > > I think the random state is used to select the features that go into each > split (look at the `max_features` parameter) > > Cheers, > Javier > > On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka > wrote: > Hi all, >

[scikit-learn] How does the random state influence the decision tree splits?

2018-10-27 Thread Sebastian Raschka
Hi all, when I was implementing a bagging classifier based on scikit-learn's DecisionTreeClassifier, I noticed that the results were not deterministic and found that this was due to the random_state in the DescisionTreeClassifier (which is set to None by default). I am wondering what exactly t

Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-10-03 Thread Sebastian Raschka
The ONNX-approach sounds most promising, esp. because it will also allow library interoperability but I wonder if this is for parametric models only and not for the nonparametric ones like KNN, tree-based classifiers, etc. All-in-all I can definitely see the appeal for having a way to export skl

Re: [scikit-learn] Splitting Method on RandomForestClassifier

2018-10-02 Thread Sebastian Raschka
This is explained here http://scikit-learn.org/stable/modules/ensemble.html#random-forests: "In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among

Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-28 Thread Sebastian Raschka
> > > I think model serialization should be a priority. > > There is also the ONNX specification that is gaining industrial adoption and > that already includes open source exporters for several families of > scikit-learn models: > > https://github.com/onnx/onnxmltools Didn't know about that

Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-27 Thread Sebastian Raschka
Congrats everyone, this is awesome!!! I just started teaching an ML course this semester and introduced scikit-learn this week -- it was a great timing to demonstrate how well maintained the library is and praise all the efforts that go into it :). > I think model serialization should be a pri

Re: [scikit-learn] Contribute to Scikit-learn

2018-09-03 Thread Sebastian Raschka
Hi all, first of all, I think that having more feature selection capabilities in scikit-learn would be nice, especially, an algorithm from the wrapper category that also regards dependence/interaction between features. Regarding the SequentialFeatureSelection class... We actually decided to si

Re: [scikit-learn] ANN Scikit-learn 0.20rc1 release candidate available

2018-08-31 Thread Sebastian Raschka
That's awesome! Congrats and thanks everyone for all the work that went into this! Just finished reading through the What's New docs... Wow, that took a while -- here, in a positive sense ;). It's a huge release with lots of important fixes. It's great to see that you prioritized the maintenanc

Re: [scikit-learn] Unable to connect HDInsight hive to python

2018-08-12 Thread Sebastian Raschka
Hi Debu, since Azure HDInsights is a commercial service, their customer support should handle questions like this > On Aug 12, 2018, at 7:16 AM, Debabrata Ghosh wrote: > > Hi All, >Greetings ! Wish you are doing good ! I am just > reaching out to you in case if you hav

Re: [scikit-learn] Using GPU in scikit learn

2018-08-08 Thread Sebastian Raschka
Hi, scikit-learn doesn't support computations on the GPU, unfortunately. Specifically for random forests, there's CudaTree, which implements a GPU version of scikit-learn's random forests. It doesn't look like the library is actively developed (hard to tell whether that's a good thing or a bad

Re: [scikit-learn] Help with Pull Request( Checks failing)

2018-07-24 Thread Sebastian Raschka
I am not a core dev, but I think I can see what's wrong there (mostly Flake8 issues). Let me comment about that over there. > On Jul 24, 2018, at 7:34 PM, Prathusha Jonnagaddla Subramanyam Naidu > wrote: > > This is the link to the PR - > https://github.com/scikit-learn/scikit-learn/pull/1167

Re: [scikit-learn] RFE with logistic regression

2018-07-24 Thread Sebastian Raschka
I addition to checking _n_iter and fixing the random seed as I suggested maybe also try normalizing the features (eg z scores via the standard scale we) to see if that stabilizes the training Sent from my iPhone > On Jul 24, 2018, at 1:07 PM, Benoît Presles > wrote: > > I did the same tests

Re: [scikit-learn] RFE with logistic regression

2018-07-24 Thread Sebastian Raschka
Agreed. But then the setting is c=1e9 in this context (where C is the inverse regularization strength), so the regularization effect should be very small. Probably shouldn't matter much for convex optimization, but I would still try to a) set the random_state to some fixed value b) make sure

Re: [scikit-learn] New core dev: Joris Van den Bossche

2018-06-23 Thread Sebastian Raschka
That's great news! I am glad to hear that you joined the project, Joris Van den Bossche! I am a scikit-learn user (and sometimes contributor) and really appreciate all the time and effort that the core developers and contributors spend on maintaining and extending the library. Best regards, S

Re: [scikit-learn] Jeff Levesque: association rules

2018-06-11 Thread Sebastian Raschka
Hi Jeff, had a similar question 1-2 years ago and ended up using Chris Borgelt's C command line tools but for convenience, i also implemented basic association rule & frequent pattern mining in Python here: http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/ Best, Seb

Re: [scikit-learn] Supervised prediction of multiple scores for a document

2018-06-03 Thread Sebastian Raschka
Hi, > I quickly read about multinomal regression, is it something do you recommend > I use? Maybe you think about something else? Multinomial regression (or Softmax Regression) should give you results somewhat similar to a linear SVC (or logistic regression with OvO or OvR). The theoretical d

Re: [scikit-learn] Supervised prediction of multiple scores for a document

2018-06-03 Thread Sebastian Raschka
sorry, I had a copy & paste error, I meant "LogisticRegression(..., multi_class='multinomial')" and not "LogisticRegression(..., multi_class='ovr')" > On Jun 3, 2018, at 5:19 PM, Sebastian Raschka > wrote: > > Hi, > >> I

Re: [scikit-learn] DBScan freezes my computer !!!

2018-05-13 Thread Sebastian Raschka
> So I suggest that there is a test version that shows a proper message when an > error occurs. I think the freezing that happens in your case is operating system specific and it would require some weird workarounds to detect at which RAM usage the combination of machine and operating system mi

Re: [scikit-learn] Breiman vs. scikit-learn definition of Feature Importance

2018-05-04 Thread Sebastian Raschka
Not sure how it compares in practice, but it's certainly more efficient to rank the features by impurity decrease rather than by OOB permutation performance you wouldn't need to a) compute the OOB performance (an extra pass inference step) b) permute a feature column and do another inference pas

Re: [scikit-learn] Retracting model from the 'blackbox' SVM

2018-05-04 Thread Sebastian Raschka
Dear Wouter, for the SVM, scikit-learn wraps the LIBSVM and LIBLINEAR. I think the scikit-learn class SVC uses LIBSVM for every kernel. Since you are using the linear kernel, you could use the more efficient LinearSVC scikit-learn class to get similar results. I guess this in turn is easier to

Re: [scikit-learn] MLPClassifier - Softmax activation function

2018-04-18 Thread Sebastian Raschka
That's a good question since the outputs would be differently scaled if the logistic sigmoid vs the softmax is used in the output layer. I think you don't need to worry about setting anything though, since the "activation" only applies to the hidden layers, and the softmax is, regardless of "act

Re: [scikit-learn] Using KMeans cluster labels in KNN

2018-03-12 Thread Sebastian Raschka
Hi, If you want to predict the Kmeans cluster membership, you can use Kmeans' predict method instead of training a KNN model on the cluster assignments. This will be computationally more efficient and give you the correct assignment at the borders between clusters. Best, Sebastian > On Mar 12,

Re: [scikit-learn] Need help in dealing with large dataset

2018-03-05 Thread Sebastian Raschka
Like Guillaume suggested, you don't want to load the whole array into memory if it's that large. There are many different ways for how to deal with this. The most naive way would be to break up your NumPy array into smaller NumPy array and load them iteratively with a running accuracy calculatio

Re: [scikit-learn] custom loss function in RandomForestRegressor

2018-03-01 Thread Sebastian Raschka
nt Kendall's tau correlation coefficient and a combination of R, tau > and RMSE. :) > > On Mar 1, 2018 15:49, "Sebastian Raschka" wrote: > Hi, Thomas, > > as far as I know, it's all the same and doesn't matter, and you would get the > same splits, sinc

Re: [scikit-learn] custom loss function in RandomForestRegressor

2018-03-01 Thread Sebastian Raschka
the > impurities of the left and right split? In MSE class they are (sum_i^n > y_i)**2 where n is the number of samples in the respective split. It is not > clear how this is related to variance in order to adapt it for my purpose. > > Best, > Thomas > > > On Mar

Re: [scikit-learn] custom loss function in RandomForestRegressor

2018-03-01 Thread Sebastian Raschka
Hi, Thomas, in regression trees, minimizing the variance among the target values is equivalent to minimizing the MSE between targets and predicted values. This is also called variance reduction: https://en.wikipedia.org/wiki/Decision_tree_learning#Variance_reduction Best, Sebastian > On Mar 1

Re: [scikit-learn] KMeans cluster

2018-02-20 Thread Sebastian Raschka
Inertia simply means the sum of the squared distances from sample points to their cluster centroid. The smaller the inertia, the closer the cluster members are to their cluster centroid (that's also what KMeans optimizes when choosing centroids). In this context, the elbow method may be helpful

Re: [scikit-learn] Applying clustering to cosine distance matrix

2018-02-12 Thread Sebastian Raschka
Hi, by default, the clustering classes from sklearn, (e.g., DBSCAN), take an [num_examples, num_features] array as input, but you can also provide the distance matrix directly, e.g., by instantiating it with metric='precomputed' my_dbscan = DBSCAN(..., metric='precomputed') my_dbscan.fit(my_dis

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-28 Thread Sebastian Raschka
Good point Joel, and I actually forgot that you can set the norm param in the TfidfVectorizer, so one could basically do vect = TfidfVectorizer(use_idf=False, norm='l1') to have the CountVectorizer behavior but normalizing by the document length. Best, Sebastian > On Jan 28, 2018, at 1:29 AM,

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-28 Thread Sebastian Raschka
Hi, Yacine, Just on a side note, you can set idf=False in the Tfidf and only normalize the vectors by their L2 norm. But yeah, the normalization you suggest might be really handy in certain cases. I am not sure though if it's worth making this another parameter in the CountVectorizer (which al

Re: [scikit-learn] a dataset suitable for logistic regression

2017-12-03 Thread Sebastian Raschka
As far as I know, no. But you could simply truncate the iris dataset for binary classification, e.g., from sklearn import datasets iris = datasets.load_iris() X = iris.data[:100] y = iris.target[:100] Best, Sebastian > On Dec 3, 2017, at 3:54 PM, Peng Yu wrote: > > Hi, iris is a three-class

Re: [scikit-learn] How to get centroids from SciPy's hierarchical agglomerative clustering?

2017-10-20 Thread Sebastian Raschka
Independent from the implementation, and unless you use the 'centroid' or 'average linkage' method, cluster centroids don't need to be computed when performing the agglomerative hierarchical clustering . But you can always compute it manually by simply averaging all samples from a cluster (for e

Re: [scikit-learn] 1. Re: unclear help file for sklearn.decomposition.pca

2017-10-16 Thread Sebastian Raschka
Oh, never mind my previous email, because while the components should be the same, the projection of the data points onto those components would still be affected by centering vs non-centering I guess. Best, Sebastian > On Oct 16, 2017, at 3:25 PM, Sebastian Raschka wrote: > > Hi

Re: [scikit-learn] 1. Re: unclear help file for sklearn.decomposition.pca

2017-10-16 Thread Sebastian Raschka
Hi, if you compute the principal components (i.e., eigendecomposition) from the covariance matrix, it shouldn't matter whether the data is centered or not, since the covariance matrix is computed as CovMat = \fact{1}{n} \sum_{i=1}^{n} (x_n - \bar{x}) (x_n - \bar{x})^T where \bar{x} = vector o

Re: [scikit-learn] Combine already fitted models

2017-10-07 Thread Sebastian Raschka
me reason I thought we had a "prefit" parameter. > > I think we should. > > >> On 10/01/2017 07:39 PM, Sebastian Raschka wrote: >> Hi, Rares, >> >>> vc = VotingClassifier(...) >>> vc.estimators_ = [e1, e2, ...] >>> vc.le_ = ..

Re: [scikit-learn] Combine already fitted models

2017-10-01 Thread Sebastian Raschka
Hi, Rares, > vc = VotingClassifier(...) > vc.estimators_ = [e1, e2, ...] > vc.le_ = ... > vc.predict(...) > > But I am not sure it is recommended to modify the "private" estimators_ and > le_ attributes. I think that this may work if you don't call the fit method of the VotingClassifier after

Re: [scikit-learn] Combine already fitted models

2017-10-01 Thread Sebastian Raschka
Hi, Rares, > I am looking at VotingClassifier but it seems that it is expected that the > estimators are fitted when VotingClassifier.fit() is called. I don't see how > I can have already fitted classifiers combined under a VotingClassifier. I think the opposite is true: The classifiers provide

Re: [scikit-learn] Commercial use of ML algorithms and scikit-learn

2017-09-30 Thread Sebastian Raschka
Hi, Paul, I think there should be no issue with that as scikit-learn is distributed under a BSD v3 license as long as you uphold the terms of that license. It's a bit tricky to find that license note as it's not called "LICENSE" in the GitHub repo like it is usually done for open source project

Re: [scikit-learn] anti-correlated predictions by SVR

2017-09-26 Thread Sebastian Raschka
I'd agree with Gael that a potential explanation could be the distribution shift upon splitting (usually the smaller the dataset, the more this is of an issue). As potential solutions/workarounds, you could try a) stratified sampling for regression, if you'd like to stick with the 2-way holdout

Re: [scikit-learn] batch_size for small training sets

2017-09-24 Thread Sebastian Raschka
Small batch sizes are typically used to speed up the training (more iterations) and to avoid the issue that training sets usually don’t fit into memory. Okay, the additional noise from the stochastic approach may also be helpful to escape local minima and/or help with generalization performance

Re: [scikit-learn] Help needed

2017-09-14 Thread Sebastian Raschka
again for your advise. > > Li Yuan > > From: Sebastian Raschka > Sent: Thursday, September 14, 2017 9:36 PM > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] Help needed > > Hi, Li, > > to me, it looks like you are importing matplotlib in your c

Re: [scikit-learn] Help needed

2017-09-14 Thread Sebastian Raschka
Hi, Li, to me, it looks like you are importing matplotlib in your code, but matplotlib is not being installed on the CI instances that are running the scikit-learn unit tests. Or in other words, the Travis instance is trying to execute an "import matplotlib..." and fails because matplotlib is n

Re: [scikit-learn] custom loss function

2017-09-13 Thread Sebastian Raschka
of new ANN architectures. I > am in urgent need to reproduce in Keras the results obtained with > MLPRegressor and the set of hyperparameters that I have optimized for my > problem and later change the loss function. > > > > On 13 September 2017 at 18:14, Sebastian Raschka wr

Re: [scikit-learn] custom loss function

2017-09-13 Thread Sebastian Raschka
gt; M the number of features? > > http://scikit-learn.org/stable/modules/svm.html#kernel-functions > > > > On 12 September 2017 at 00:37, Sebastian Raschka wrote: > Hi Thomas, > > > For the MLPRegressor case so far my conclusion was that it is not possible >

Re: [scikit-learn] custom loss function

2017-09-11 Thread Sebastian Raschka
Hi Thomas, > For the MLPRegressor case so far my conclusion was that it is not possible > unless you modify the source code. Also, I suspect that this would be non-trivial. I haven't looked to closely at how the MLPClassifier/MLPRegressor are implemented but since you perform the weight update

Re: [scikit-learn] control value range of MLPRegressor predictions

2017-09-10 Thread Sebastian Raschka
ote: > > > > On 10 September 2017 at 22:03, Sebastian Raschka wrote: > You could normalize the outputs (e.g., via min-max scaling). However, I think > the more intuitive way would be to clip the predictions. E.g., say you are > predicting house prices, it probably makes no

Re: [scikit-learn] control value range of MLPRegressor predictions

2017-09-10 Thread Sebastian Raschka
You could normalize the outputs (e.g., via min-max scaling). However, I think the more intuitive way would be to clip the predictions. E.g., say you are predicting house prices, it probably makes no sense to have a negative prediction, so you would clip the output at some value >0$ PS: -820 an

Re: [scikit-learn] combining datasets from different sources

2017-09-05 Thread Sebastian Raschka
Another approach would be to pose this as a "ranking" problem to predict relative affinities rather than absolute affinities. E.g., if you have data from one (or more) molecules that has/have been tested under 2 or more experimental conditions, you can rank the other molecules accordingly or no

Re: [scikit-learn] Problem found when testing DecisionTreeClassifier within the source folder

2017-09-04 Thread Sebastian Raschka
Hi, Hanna, I think Joel is right and the renaming is probably causing the issues. Instead of renaming the package to sklearn1, consider modifying, compiling, and installing sklearn in a virtual environment. I am not sure if you are using conda, in this case, creating a new virtual env for devel

Re: [scikit-learn] Random Forest Regressor criterion

2017-08-30 Thread Sebastian Raschka
Hi, regarding MSE minimization vs variance reduction; it's been a few years but I remember that we had a discussion about that, where Gilles Louppe explained that those two are identical when I was confused about the wikipedia equation at https://en.wikipedia.org/wiki/Decision_tree_learning#Va

Re: [scikit-learn] imbalanced-learn 0.3.0 is chasing scikit-learn 0.19.0

2017-08-24 Thread Sebastian Raschka
Just read through the summary of the new features and browsed through the user guide. The guide is really well structured and easy to navigate, thanks for putting all the work into it. Overall, thanks for this great contribution and new version :) Best, Sebastian > On Aug 24, 2017, at 8:14 PM,

Re: [scikit-learn] scikit-learn 0.19.0 is out!

2017-08-11 Thread Sebastian Raschka
Yay, as an avid user, thanks to all the developers! This is a great release indeed -- no breaking changes (at least for my code base) and so many improvements and additions (that I need to check out in detail) :) > On Aug 12, 2017, at 1:14 AM, Gael Varoquaux > wrote: > > Hurray, thank you ev

Re: [scikit-learn] transform categorical data to numerical representation

2017-08-06 Thread Sebastian Raschka
rds, > Georg > > Joel Nothman schrieb am So., 6. Aug. 2017 um 00:49 > Uhr: > We are working on CategoricalEncoder in > https://github.com/scikit-learn/scikit-learn/pull/9151 to help users more > with this kind of thing. Feedback and testing is welcome. > > On 6

Re: [scikit-learn] transform categorical data to numerical representation

2017-08-05 Thread Sebastian Raschka
Hi, Georg, I bring this up every time here on the mailing list :), and you probably aware of this issue, but it makes a difference whether your categorical data is nominal or ordinal. For instance if you have an ordinal variable like with values like {small, medium, large} you probably want to

Re: [scikit-learn] Classifiers for dataset with categorical features

2017-07-21 Thread Sebastian Raschka
x27;t gotten traction. > Overshadowed by GBM & random forests? > > > On Fri, Jul 21, 2017 at 11:52 AM, Sebastian Raschka > wrote: >> Just to throw some additional ideas in here. Based on a conversation with a >> colleague some time ago, I think learning c

Re: [scikit-learn] Classifiers for dataset with categorical features

2017-07-21 Thread Sebastian Raschka
ifference imho. I.e., treating ordinal variables like continuous variable probably makes more sense than one-hot encoding them. Looking forward to the PR :) > On Jul 21, 2017, at 2:52 PM, Sebastian Raschka wrote: > > Just to throw some additional ideas in here. Based on a conversation w

Re: [scikit-learn] Classifiers for dataset with categorical features

2017-07-21 Thread Sebastian Raschka
Just to throw some additional ideas in here. Based on a conversation with a colleague some time ago, I think learning classifier systems (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly useful when working with large, sparse binary vectors (like from a one-hot encodin

Re: [scikit-learn] Max f1 score for soft classifier?

2017-07-17 Thread Sebastian Raschka
>> Does scikit have a function to find the maximum f1 score (and decision >> threshold) for a (soft) classifier? Hm, I don't think so. F1-score is typically used as evaluation metric; hence, it's something optimized via hyperparameter tuning. There's an interesting publication though, where the

Re: [scikit-learn] Replacing the Boston Housing Prices dataset

2017-07-06 Thread Sebastian Raschka
I think there can be some middle ground. I.e., adding a new, simple dataset to demonstrate regression (maybe autmpg, wine quality, or sth like that) and use that for the scikit-learn examples in the main documentation etc but leave the boston dataset in the code base for now. Whether it's a weak

Re: [scikit-learn] [Feature] drop_one in one hot encoder

2017-06-25 Thread Sebastian Raschka
Hi, hm, I think that dropping a column in onehot encoded features is quite uncommon in machine learning practice -- based on the applications and implementations I've seen. My guess is that the onehot encoded features are multicolinear anyway!? There may be certain algorithms that benefit from

Re: [scikit-learn] R user trying to learn Python

2017-06-18 Thread Sebastian Raschka
r me, I have some sense of machine learning, but none of Python. > > Unlike R, which is specifically for statistics analysis. Python is broad! > > Maybe some expert here with R can tell me how to go about this. :) > > On Sun, Jun 18, 2017 at 12:53 PM, Sebastian Raschka

Re: [scikit-learn] R user trying to learn Python

2017-06-18 Thread Sebastian Raschka
Hi, > I am extremely frustrated using this thing. Everything comes after a dot! Why > would you type the sam thing at the beginning of every line. It's not > efficient. > > code 1: > y_sin = np.sin(x) > y_cos = np.cos(x) > > I know you can import the entire package without the "as np", but I s

  1   2   3   >