[scikit-learn] VOTE SLEP 17

2022-10-31 Thread Andreas Mueller
Hey Everybody! SLEP 17 (by Joel Nothman) introduces an __sklearn_clone__ protocol & method that allows estimators to overload what sklearn's clone function does. An implementation for this SLEP is

Re: [scikit-learn] major league hacking summer internship program

2020-05-29 Thread Andreas Mueller
Thanks folks! That gives us a good start I think! Re documentation: honestly I'm not entirely sure if those are good issues because I'm not sure if we have consensus what we want to recommend. We can certainly include these but they require some decisions and a lot of expertise. Maybe we can

[scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee

2020-04-27 Thread Andreas Mueller
Hi All. Given all his recent contributions, I want to nominate Adrin Jalali to the Technical Committee: https://scikit-learn.org/stable/governance.html#technical-committee According to the governance document, this will require a discussion and vote. I think we can move to the vote

Re: [scikit-learn] Number of informative features vs total number of features

2020-04-03 Thread Andreas Mueller
Hi Ben. I'd recommend you check the code to see how the data is generated. Best, Andy On 4/3/20 7:00 AM, Benoît Presles wrote: Dear sklearn users, I have just checked if the generated features were independents by computing the covariance and correlation matrices and it seems they are, so I

Re: [scikit-learn] A basic question about kmeans algorithms elkan and llyod

2020-03-30 Thread Andreas Mueller
scikit-learn since elkan is an optimized on with better performance. Best regards, George *From:* scikit-learn *On Behalf Of *Andreas Mueller *Sent:* Saturday, March 28, 2020 12:37 AM *To:* scikit-learn@python.org *Subject:* Re: [scikit-learn] A basic question about kmeans algorithms el

Re: [scikit-learn] Analysis of sklearn and other python libraries on github by MS team

2020-03-30 Thread Andreas Mueller
Also see https://github.com/scikit-learn/scikit-learn/issues/14268 which is discussing how to make things faster *and* more stable! On 3/30/20 10:30 AM, Andreas Mueller wrote: On 3/27/20 6:20 PM, Gael Varoquaux wrote: Thanks for the link Andy. This is indeed very interesting! On Fri, Mar

Re: [scikit-learn] Analysis of sklearn and other python libraries on github by MS team

2020-03-30 Thread Andreas Mueller
On 3/27/20 6:20 PM, Gael Varoquaux wrote: Thanks for the link Andy. This is indeed very interesting! On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote: Regarding learners, Top-5 in both GH17 and GH19 are LogisticRegression, MultinomialNB, SVC, LinearRegression, and

Re: [scikit-learn] A basic question about kmeans algorithms elkan and llyod

2020-03-27 Thread Andreas Mueller
There's an interesting analysis in this paper: Fast K-Means with Accurate Bounds http://proceedings.mlr.press/v48/newling16.pdf On 3/26/20 3:40 AM, Alexandre Gramfort wrote: hi, I suspect Elkan is really winning when you have many centroids so the conclusion is not systematic my 2c Alex

[scikit-learn] Analysis of sklearn and other python libraries on github by MS team

2020-03-27 Thread Andreas Mueller
Hey all. There's a pretty cool paper by a team at MS that analyses public github repos for their use of the sklearn and related libraries: https://arxiv.org/abs/1912.09536 Thought it might be of interest. Cheers, Andy ___ scikit-learn mailing list

Re: [scikit-learn] distances

2020-03-05 Thread Andreas Mueller
Thanks for a great summary of issues! I agree there's lots to do, though I think most of the issues that you list are quite hard and require thinking about API pretty hard. So they might not be super amendable to being solved by a shorter-term project. I was hoping there would be some more

Re: [scikit-learn] Need for multioutput multivariate algorithm for Random Forest in Python (using Mahalanobis distance)

2020-02-17 Thread Andreas Mueller
On 2/14/20 5:47 PM, Paul Chike Ofoche via scikit-learn wrote: Many thanks Nicolas and Andreas. I was wondering whether this multioutput handling capability of the RandomForestRegressor has been added recently. In order to verify, I went on a fact-finding mission by re-running the exact

Re: [scikit-learn] Need for multioutput multivariate algorithm for Random Forest in Python (using Mahalanobis distance)

2020-02-13 Thread Andreas Mueller
On 2/9/20 12:21 PM, Paul Chike Ofoche via scikit-learn wrote: Hello all, My name is Paul and I am enthused about data science. I have been using Python and other programming languages for close to two years. There is an issue that I have been facing since I began applying Python to the

Re: [scikit-learn] Why is subset invariance necessary for transfom()?

2020-01-21 Thread Andreas Mueller
On 1/21/20 8:23 PM, Charles Pehlivanian wrote: I understand - I'm kind of conflating the idea of data sample with test set, my view assumes there are a sample space of samples, might require rethinking the cross-validation setup... I also think that part of it relies on the notion of online

Re: [scikit-learn] ask a question about weights for features in svc with rbf kernel

2020-01-21 Thread Andreas Mueller
There is no coef_ for kernel SVMs. What exactly are you looking for? On 1/20/20 9:52 AM, Rujing Zha wrote: Hi Guillaume Is it OK for rbf kernel? As the document said:   Weights assigned to the features (coefficients in the primal problem). This is only available in the case of a*/linear

Re: [scikit-learn] logistic regression results are not stable between solvers

2020-01-08 Thread Andreas Mueller
lit_train, sm.tools.add_constant(X_split_train))     logit_res = logit.fit(maxiter=2)     print("Coef statsmodels")     print(logit_res.params) On 11/10/2019 15:42, Andreas Mueller wrote: On 10/10/19 1:14 PM, Benoît Presles wrote: Thanks for your answe

Re: [scikit-learn] SVM-RFE

2019-12-04 Thread Andreas Mueller
PR welcome ;) On 12/3/19 11:02 PM, Brown J.B. via scikit-learn wrote: 2019年12月3日(火) 5:36 Andreas Mueller <mailto:t3k...@gmail.com>>: It does provide the ranking of features in the ranking_ attribute and it provides the cross-validation accuracies for all subsets in gri

Re: [scikit-learn] Version 0.21! and plot_tree!

2019-12-04 Thread Andreas Mueller
ewhowe.com> I live to learn, so I can learn to live. - me <~~~> On Thu, May 23, 2019 at 4:24 PM Andreas Mueller <mailto:t3k...@gmail.com>> wrote: Hey Andrew. Thanks for saying thanks! I share your frustration with export_graphviz, in par

Re: [scikit-learn] Vote on SLEP010: n_features_in_ attribute

2019-12-04 Thread Andreas Mueller
On 12/4/19 5:05 AM, Trevor Stephens wrote: Makes sense Joel, wasn't mentioned in the docs, so was a bit strange. Still feels a bit weird but I'm sure I'll adapt_in and thrive_out. Indeed, and as Joel said, we'll have n_features_out_ added soon. Having both is quite helpful in many

Re: [scikit-learn] Vote on SLEP010: n_features_in_ attribute

2019-12-04 Thread Andreas Mueller
I agree / recall that that was what we settled on. So a) but even more conservative ;) On 12/4/19 5:03 AM, Joel Nothman wrote: Oh... I remember what we landed up on, actually... we've made _validate_data private so downstream estimators can't technically expect to use it reliably across any

Re: [scikit-learn] ANN: scikit-learn 0.22 final release

2019-12-04 Thread Andreas Mueller
Maybe we can discuss this in https://github.com/scikit-learn/scikit-learn/issues/14386 ? I think I have come to agree that we should just do 1.0 and if we want to make any big changes that should be 2.0. On 12/4/19 6:19 AM, Andrew Howe wrote: That is an impressive roadmap, and I certainly

Re: [scikit-learn] Vote on SLEP010: n_features_in_ attribute

2019-12-03 Thread Andreas Mueller
+1 On 12/3/19 5:09 PM, Nicolas Hug wrote: As per our Governance document, changes to API principles are to be established through an Enhancement Proposal (SLEP) from which any core developer can call for a vote on its acceptance. * *

Re: [scikit-learn] ANN: scikit-learn 0.22 final release

2019-12-03 Thread Andreas Mueller
Awesome! Thank you for all the work on the release! This is a big one! Are we tweeting with the repurposed twitter account? Andy On 12/3/19 7:50 AM, Adrin wrote: We're happy to announce the 0.22 release. You can read the release highlights under

Re: [scikit-learn] SVM-RFE

2019-12-02 Thread Andreas Mueller
On Mon, Nov 25, 2019 at 1:36 PM Brown J.B. via scikit-learn mailto:scikit-learn@python.org>> wrote: 2019年11月23日(土) 2:12 Andreas Mueller mailto:t3k...@gmail.com>>: I think you can also use RFECV directly without doing any

Re: [scikit-learn] SVM-RFE

2019-11-22 Thread Andreas Mueller
I think you can also use RFECV directly without doing any wrapping. On 11/20/19 12:24 AM, Brown J.B. via scikit-learn wrote: Dear Malik, Your request to do performance checking of the steps of SVM-RFE is a pretty common task. Since the contributors to scikit-learn have done great to make

Re: [scikit-learn] scikit-learn twitter account

2019-11-04 Thread Andreas Mueller
Should we re-purpose the existing twitter account or make a new one? https://twitter.com/scikit_learn We do have 6k followers already! On 11/4/19 3:08 PM, Nelle Varoquaux wrote: I think that's a good idea as well! On Mon, 4 Nov 2019 at 15:06, Chiara Marmo >

Re: [scikit-learn] logistic regression results are not stable between solvers

2019-10-11 Thread Andreas Mueller
On 10/10/19 1:14 PM, Benoît Presles wrote: Thanks for your answers. On my real data, I do not have so many samples. I have a bit more than 200 samples in total and I also would like to get some results with unpenalized logisitic regression. What do you suggest? Should I switch to the

Re: [scikit-learn] logistic regression results are not stable between solvers

2019-10-08 Thread Andreas Mueller
I'm pretty sure SAGA is not converging. Unless you scale the data, SAGA is very slow to converge. On 10/8/19 7:19 PM, Benoît Presles wrote: Dear scikit-learn users, I am using logistic regression to make some predictions. On my own data, I do not get the same results between solvers. I

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-06 Thread Andreas Mueller
this to the sklearn issue list if there's no issue filed for that yet. Best, Sebastian On Oct 6, 2019, at 9:10 AM, Andreas Mueller wrote: On 10/4/19 11:28 PM, Sebastian Raschka wrote: The docs show a way such that you don't need to write it as png file using tree.plot_tree: https://scikit-learn.org/stable

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-10-06 Thread Andreas Mueller
e sklearn know internally 0 vs. 1 is categorical, not numerical? In R for instance, you do as.factor(), which explicitly states the data type. Thank you! On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller wrote: On 9/15/19 8:16 AM, Guillaume Lemaître wrote: On Sat, 14 Sep 2019 at 20:59, C W

Re: [scikit-learn] Vote on SLEP009: keyword only arguments

2019-09-18 Thread Andreas Mueller
On 9/17/19 3:42 AM, Joel Nothman wrote: I think you mean keyword-only, Alex On Tue., 17 Sep. 2019, 4:11 pm Alexandre Gramfort, mailto:alexandre.gramf...@inria.fr>> wrote: Yes I am +1 for positional arguments for the __init__ of the estimators. Alex Albert: my position when

Re: [scikit-learn] Vote on SLEP009: keyword only arguments

2019-09-18 Thread Andreas Mueller
Sorry, I was on vacation ;)  +1 from me. On 9/17/19 7:28 PM, Joel Nothman wrote: If we were to assume Andy's vote in the positive, him having been a major proponent of this change, we would say this was accepted by a unanimous vote of a majority of core developers. Having tentatively

Re: [scikit-learn] Vote on SLEP009: keyword only arguments

2019-09-18 Thread Andreas Mueller
The SLEP says: This proposal suggests making only/most commonly/used parameters positional. The/most commonly/used parameters are defined per method or function, to be defined as either of the following two ways: * The set defined and agreed upon by the core developers, which should

Re: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

2019-09-18 Thread Andreas Mueller
On 9/15/19 8:16 AM, Guillaume Lemaître wrote: On Sat, 14 Sep 2019 at 20:59, C W > wrote: Thanks, Guillaume. Column transformer looks pretty neat. I've also heard though, this pipeline can be tedious to set up? Specifying what you want for every

Re: [scikit-learn] Clustering Algorithm based on correlation distance

2019-09-03 Thread Andreas Mueller
There are many that allow "metric='precomputed'". On 9/2/19 10:06 AM, Safi Ullah Marwat wrote: Dear List, Is there any clustering algorithm, which is based on correlation coefficient instead of Euclidean/Manhattan distance? Regards ___

Re: [scikit-learn] No convergence warning in logistic regression

2019-09-03 Thread Andreas Mueller
Having correlated data is not the same as not converging. We could warn on correlated data but I don't think that's actually useful for scikit-learn. I actually recently argued to remove the warning in linear discriminant analysis: https://github.com/scikit-learn/scikit-learn/issues/14361 As

Re: [scikit-learn] Monthly meetings between core developers + "Hello World"

2019-08-05 Thread Andreas Mueller
As usual, I agree ;) I think it would be good to call out particularly important bugfixes so they get reviews. We might also want to think about how we can organize the issue tracker better. Having more full-time people on the project certainly means more activity but ideally we can use some

Re: [scikit-learn] question using GridSearchCV

2019-07-24 Thread Andreas Mueller
scoring is not a parameter. It needs to be passed to GridSearchCV selfCLF =GridSearchCV(GradientBoostingClassifier(), parameters, versose = 3m n_jobs = 4), scoring='roc_auc') On 7/24/19 1:24 PM, Glenn Schultz via scikit-learn wrote: I am using GBClassifier, the below works if I use the

Re: [scikit-learn] Long term roadmap and moonshot goals

2019-07-23 Thread Andreas Mueller
, Piotr On Sun, Jul 14, 2019 at 8:44 PM Andreas Mueller <mailto:t3k...@gmail.com>> wrote: Hi all. At SciPy, Brian Granger raised a good point about their planning for the Jupyter Project, which is the importance of long-term goals. I think it's great that we

Re: [scikit-learn] Long term roadmap and moonshot goals

2019-07-23 Thread Andreas Mueller
4 AM Adrin <mailto:adrin.jal...@gmail.com>> wrote: It may be worth doing a user survey to get a feeling of what people care about, we may or may not take them into account afterwards. Here's how Dask is doing it: https://github.com/dask/dask/issues/4748 On Sun, Jul 14, 2019

Re: [scikit-learn] Monthly meetings between core developers + "Hello World"

2019-07-22 Thread Andreas Mueller
On 7/22/19 9:22 AM, Adrin wrote: Awesome, excited to have your help around :) We already have the @core-devs team on github, we can use it more often/more organized.hi Why wouldn't we just use the scikit-learn repo projects? On Fri, Jul 19, 2019 at 2:48 PM Chiara Marmo

Re: [scikit-learn] Monthly meetings between core developers

2019-07-17 Thread Andreas Mueller
On 7/17/19 2:17 PM, Guillaume Lemaître wrote: I am +1. This is a great initiative. IMO, we could make it really regular (i.e., a specific week-day of a specific week in a month), with a rolling time (for the time-zone issue). In this matter, we could maybe clear more in advance our agenda

[scikit-learn] Long term roadmap and moonshot goals

2019-07-14 Thread Andreas Mueller
Hi all. At SciPy, Brian Granger raised a good point about their planning for the Jupyter Project, which is the importance of long-term goals. I think it's great that we now have a detailed short-term roadmap (https://scikit-learn.org/dev/roadmap.html). Given that we now have about 6(!) full

Re: [scikit-learn] titanic dataset, use for book

2019-06-25 Thread Andreas Mueller
Hi Sole. I would suggest not to use this version of the titanic dataset. It's a personal repository of mine and might not exist forever. Ideally you (and we) would use fetch_openml. However, the current version doesn't have support for returning dataframes. That's addressed in

Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

2019-06-13 Thread Andreas Mueller
t you need to know how the data are generated. @ Brown, I know nothing about molecular modeling. The paper your linked "Beware of q2!" paper raises some interesting point, as far as I see in sklearn linear regression, score is R^2. On Wed, Jun 5, 2019 at 9:11 AM Andreas Mueller <mailto:t3

Re: [scikit-learn] LogisticRegression

2019-06-11 Thread Andreas Mueller
On 6/11/19 11:47 AM, Eric J. Van der Velden wrote: Hi Nicolas, Andrew, Thanks! I found out that it is the regularization term. Sklearn always has that term. When I program logistic regression with that term too, with \lambda=1, I get exactly the same answer as sklearn, when I look at the

Re: [scikit-learn] Google code reviews

2019-06-07 Thread Andreas Mueller
not On Sat., May 25, 2019, 16:10 Joel Nothman, <mailto:joel.noth...@gmail.com>> wrote: For some of the larger PRs, this might be helpful. Not going to help where the intricacies of Scikit-learn API come in play. On Sat, 25 May 2019 at 04:17, Andreas Mueller mailto:t3k...@

Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

2019-06-05 Thread Andreas Mueller
On 6/4/19 8:44 PM, C W wrote: Thank you all for the replies. I agree that prediction accuracy is great for evaluating black-box ML models. Especially advanced models like neural networks, or not-so-black models like LASSO, because they are NP-hard to solve. Linear regression is not a

Re: [scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

2019-06-03 Thread Andreas Mueller
This classical paper on statistical practices (Breiman's "two cultures") might be helpful to understand the different viewpoints: https://projecteuclid.org/euclid.ss/1009213726 On 6/3/19 12:19 AM, Brown J.B. via scikit-learn wrote: As far as I understand: Holding out a test set is

Re: [scikit-learn] Difference in normalization between Lasso and LogisticRegression + L1

2019-05-29 Thread Andreas Mueller
That is not very ideal indeed. I think we just went with what liblinear did, and when saga was introduced kept that behavior. It should probably be scaled as in Lasso, I would imagine? On 5/29/19 1:42 PM, Michael Eickenberg wrote: Hi Jesse, I think there was an effort to compare

Re: [scikit-learn] Version 0.21! and plot_tree!

2019-05-23 Thread Andreas Mueller
Hey Andrew. Thanks for saying thanks! I share your frustration with export_graphviz, in particular for teaching. I feel like plot_tree is not ideal yet, though. In particular the layout is not as compact as the graphviz one. If you have any feedback or suggestions, I'd be very happy to hear

Re: [scikit-learn] Regularization in Tree Models

2019-05-22 Thread Andreas Mueller
Hi Prudvi. What exactly do you mean by that? There is regularization in the new HistGradientBoosting, and we're working on post-pruning for decision trees. I'm not sure what l2 regularization for decision tree classifiers or for decision tree regressors would mean. Do you have a reference?

Re: [scikit-learn] Release Candidate for Scikit-learn 0.21

2019-05-01 Thread Andreas Mueller
Thank you for all the amazing work y'all! On 4/30/19 10:09 PM, Joel Nothman wrote: PyPI now has source and binary releases for Scikit-learn 0.21rc2. * Documentation at https://scikit-learn.org/0.21 * Release Notes at https://scikit-learn.org/0.21/whats_new * Download source or wheels at

Re: [scikit-learn] Feature engineering functionality - new package

2019-04-15 Thread Andreas Mueller
1) was indeed a design decision. Your design is certainly an alternative design, that might be more convenient in some situations, but requires adding this feature to all transformers, which basically just adds a bunch of boilerplate code everywhere. So you could argue our design decision was

Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-04 Thread Andreas Mueller
). This is under the assumption that >>> one plot function will want to import trees, another GPs, etc. Unless >>> we move to lazy imports, that would be against the current convention >>> that importing sklearn is fairly minimal. >>> >>> I do like Andy's idea

Re: [scikit-learn] New core developers: thomasjpfan and nicolashug

2019-04-03 Thread Andreas Mueller
Congratulations guys! Great work! Looking forward to much more! Proud to have you on the team! Now we in NYC can approve our own pull requests ;) Sent from phone. Please excuse spelling and brevity. On Wed, Apr 3, 2019, 21:08 Hanmin Qin wrote: > Congratulations and welcome to the team! > >

Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Andreas Mueller
I think what was not clear from the question is that there is actually quite different kinds of plotting functions, and many of these are tied to existing code. Right now we have some that are specific to trees (plot_tree) and to gradient boosting (plot_partial_dependence). I think we want

Re: [scikit-learn] Why is cross_val_predict discouraged?

2019-04-03 Thread Andreas Mueller
On 4/3/19 7:59 AM, Joel Nothman wrote: The equations in Murphy and Hastie very clearly assume a metric decomposable over samples (a loss function). Several popular metrics are not. For a metric like MSE it will be almost identical assuming the test sets have almost the same size. For

Re: [scikit-learn] Sprint discussion points?

2019-02-26 Thread Andreas Mueller
Was that the same that Vlad used? https://github.com/scikit-learn/scikit-learn-speed We might want to just replace that, given that it hasn't been touched in 7 years? On 2/26/19 5:22 AM, Jeremie du Boisberranger wrote: I totally forgot to mention it before the sprint started but i'd like

[scikit-learn] New Governance document accepted!

2019-02-25 Thread Andreas Mueller
Hey y'all. It's my pleasure to announce that the new governance document has been accepted by a core-dev vote. Out of the 49 eligible core-devs, 22 voted "yes" on the mailing list and 4 voted "yes" on the issue tracker. The remaining core devs did not vote. You can find the document on

Re: [scikit-learn] Sprint discussion points?

2019-02-25 Thread Andreas Mueller
, Feb 19, 2019 at 06:16:20PM -0500, Andreas Mueller wrote: I put a draft schedule here: https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events#technical-discussions-schedule I'd like to discuss sample_props. They are important to me. Should I add them somewhere on the schedule? Maybe

Re: [scikit-learn] Sprint discussion points?

2019-02-20 Thread Andreas Mueller
On 2/20/19 4:40 PM, Gael Varoquaux wrote: On Tue, Feb 19, 2019 at 06:16:20PM -0500, Andreas Mueller wrote: I put a draft schedule here: https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events#technical-discussions-schedule I'd like to discuss sample_props. They are important to me

Re: [scikit-learn] Sprint discussion points?

2019-02-20 Thread Andreas Mueller
I messaged them and also tasks Thomas Fan with working with Microsoft to set up azure pipelines. On 2/20/19 12:12 PM, Guillaume Lemaître wrote: @Andy You were the one contacting Travis. On Wed, 20 Feb 2019 at 17:23, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: Thanks for

Re: [scikit-learn] Sprint discussion points?

2019-02-20 Thread Andreas Mueller
Sure, we can change it up on Tuesday. I agree having things that we can implement during the week would be good. I was actually kind of optimistic and was hoping we could make some dent into the freezing, and the convergence issues might be less controversial and more a technical challenge. I

Re: [scikit-learn] Sprint discussion points?

2019-02-20 Thread Andreas Mueller
ussion can happen without you, Andy? On Wed, 20 Feb 2019 at 10:17, Andreas Mueller mailto:t3k...@gmail.com>> wrote: I put a draft schedule here: https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events#technical-discussions-schedule it's obviously somewhat opinionat

Re: [scikit-learn] Sprint discussion points?

2019-02-19 Thread Andreas Mueller
elp. On Tue, 19 Feb 2019 at 22:23, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: Yeah, sounds good. I didn't want to unilaterally post a schedule, but doing some google form or similar seems a bit heavy-handed? Not sure if Guillaume had ideas about the schedule, g

Re: [scikit-learn] Sprint discussion points?

2019-02-19 Thread Andreas Mueller
Yeah, sounds good. I didn't want to unilaterally post a schedule, but doing some google form or similar seems a bit heavy-handed? Not sure if Guillaume had ideas about the schedule, given that he seems to be running the show? On 2/19/19 4:17 PM, Joel Nothman wrote: I don't think optics

Re: [scikit-learn] Sprint discussion points?

2019-02-19 Thread Andreas Mueller
On 2/14/19 11:40 AM, Nicolas Hug wrote: or we could go as far as to schedule meetings on the different topics. Given the number of issues to discuss this is probably the best approach IMO If we want to schedule meetings we could do one of two things: have a scheduling meeting first

Re: [scikit-learn] Reddit thread with complaints about scikit-learn

2019-02-19 Thread Andreas Mueller
I agree with most of their points and have tried to prioritize some (and I think you were the victim of me trying to address some of these ;). The question about structuring the estimators is really something tricky. Maybe it's worth putting it on the roadmap to discuss this at some point?

Re: [scikit-learn] VOTE: scikit-learn governance document

2019-02-19 Thread Andreas Mueller
On 2/19/19 10:55 AM, Paolo Losi wrote: +1 if my opinion matters Thank you and it does :) ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] VOTE: scikit-learn governance document

2019-02-19 Thread Andreas Mueller
A good time to remind all core devs to vote (or abstain). +1 from me as well (as might be expected), I didn't want to put my vote in my call for the vote. Participation is not super high (as might be expected), 13 of the 49 core devs voted so far.. There are some people who have voiced

Re: [scikit-learn] Sprint discussion points?

2019-02-18 Thread Andreas Mueller
On 2/18/19 3:06 PM, Joel Nothman wrote: And here I was thinking we'd better just push out 0.20.3 this week with what's been listed for it. I wouldn't mind this, just don't expect me to help ;) ___ scikit-learn mailing list scikit-learn@python.org

Re: [scikit-learn] Sprint discussion points?

2019-02-14 Thread Andreas Mueller
t implemented the same as euclidean. On Thu, 14 Feb 2019 at 12:56, Andreas Mueller mailto:t3k...@gmail.com>> wrote: Do you have a reference for the logistic regression stability? Is it convergence warnings? Happy to discuss the other two issues, t

Re: [scikit-learn] Sprint discussion points?

2019-02-14 Thread Andreas Mueller
On 2/13/19 11:28 PM, Joel Nothman wrote: Convergence in logistic regression (https://github.com/scikit-learn/scikit-learn/issues/11536) is indeed one problem (and it presents a general issue of what max_iter means when you have several solvers, or how good defaults are selected). But I was

Re: [scikit-learn] Sprint discussion points?

2019-02-13 Thread Andreas Mueller
props (supporting weighted scoring at least), but perhaps I'm being persuaded this will never be a top-priority requirement, and the solutions add much complexity. On Thu, 14 Feb 2019 at 07:39, Andreas Mueller <mailto:t3k...@gmail.com>> wrote: Hey all. Should we collect some d

[scikit-learn] Sprint discussion points?

2019-02-13 Thread Andreas Mueller
Hey all. Should we collect some discussion points for the sprint? There's an unusual amount of core-devs present and I think we should seize the opportunity. Maybe we should create a page in the wiki or add it to the sprint page? Things that are high on my list of priorities are: * slicing

[scikit-learn] VOTE: scikit-learn governance document

2019-02-08 Thread Andreas Mueller
Hey all. I want to call a vote on the final version on the scikit-learn governance document, which can be found in this PR: https://github.com/scikit-learn/scikit-learn/pull/12878 It underwent some significant changes in the last couple of weeks. The two-sentence summary is: conflicts are

Re: [scikit-learn] Possible bug in BayesianGaussianMixture?

2019-02-07 Thread Andreas Mueller
Hey Stefan. I would expect that to depend on the prior. It could either be a bug or an issue with the variational inference. Maybe comparing against an MCMC implementation might be helpful? Though if that works, I'm not sure what the conclusion would be tbh. (I hate debugging variational

Re: [scikit-learn] AUCROC/MAP confidence intervals in scikit

2019-02-07 Thread Andreas Mueller
The paper definitely looks interesting and the authors are certainly some giants in the field. But it is actually not widely cited (139 citations since 2005), and I've never seen it used. I don't know why that is, and looking at the citations there doesn't seem to be a lot of follow-up work.

Re: [scikit-learn] Scikit-learn porting strategy

2019-02-05 Thread Andreas Mueller
There's some stuff already: https://github.com/SciRuby/ And in terms of strategy: No, you can go estimator by estimator and at some point implement cross-validation and grid-search and pipelines and metrics pretty independently. It looks like daru is written in ruby which I expect to be too

Re: [scikit-learn] Scikit-learn porting strategy

2019-02-04 Thread Andreas Mueller
Hi Eljay. Which language? And you want to reimplement it? How many full-time developers do you have for how many year? ;) Openhub estimates scikit-learn took 39 person-years: https://www.openhub.net/p/scikit-learn/estimated_cost I'm asking about the language because there are similar

[scikit-learn] Scipy 2019 Tutorial

2019-01-18 Thread Andreas Mueller
Hey Folks. The scipy tutorial chairs just pinged me about submitting a tutorial. I'm planning to, and wanted to ask if anyone is interested in co-teaching with me. I might transition from the "scipy tutorial" materials (evolved over maybe 5 years) to my own materials, but not sure yet. Nicolas

Re: [scikit-learn] Next Sprint

2019-01-10 Thread Andreas Mueller
Do you or anyone in your team has cycles to do that? I certainly don't, but I could try to delegate (to the single person I delegate everything to ;) On 1/10/19 12:36 PM, Gael Varoquaux wrote: On Thu, Jan 10, 2019 at 12:32:17PM -0500, Andreas Mueller wrote: Any sprint specific funding

Re: [scikit-learn] Next Sprint

2019-01-10 Thread Andreas Mueller
On 1/10/19 10:34 AM, Gael Varoquaux wrote: On Wed, Jan 09, 2019 at 02:09:58PM -0500, Andreas Mueller wrote: Gaël, does the foundation have funds and do you want to use them? And/or do you/INRA have funds you want to use? Neither myself nor Inria has fund to use outside the foundation

Re: [scikit-learn] Next Sprint

2019-01-09 Thread Andreas Mueller
il.com>> wrote:   >   > It'll be the least favourable week of February for me, but I can make do.   >   > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller mailto:t3k...@gmail.com>> wrote:   >>   >> Works for me!   >>   

[scikit-learn] Draft of a Scikit-learn governance document

2018-12-27 Thread Andreas Mueller
Hi all. I just posted a proposal for a scikit-learn governance document as a PR: https://github.com/scikit-learn/scikit-learn/pull/12878 The core devs discussed this already to some degrees but I think it would be great to involve the greater community in finalizing this. Any feedback is

Re: [scikit-learn] How to grab subsets from train sets when bootstrap=False in RF regressor?

2018-12-27 Thread Andreas Mueller
It uses all the data. On 12/26/18 4:26 AM, lampahome wrote: As title RF regressor decide a tree by grabing part of train data aka bootstrap. If set bootstrap=False, how would the model grab data? The reason I'm interesting is when I set it to False, it makes the mse and mae down, that's

Re: [scikit-learn] Next Sprint

2018-12-20 Thread Andreas Mueller
Works for me! On 12/19/18 5:33 PM, Gael Varoquaux wrote: I would propose the week of Feb 25th, as I heard people say that they might be available at this time. It is good for many people, or should we organize a doodle? G On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote: Can

Re: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories

2018-12-19 Thread Andreas Mueller
On 12/15/18 7:35 AM, Joris Van den Bossche wrote: Op vr 14 dec. 2018 om 16:46 schreef Andreas Mueller <mailto:t3k...@gmail.com>>: As far as I understand, the open PR is not a leave-one-out TargetEncoder? I would want it to be :-/ I also did not yet add the CountF

Re: [scikit-learn] Next Sprint

2018-12-19 Thread Andreas Mueller
Can we please nail down dates for a sprint? On 11/20/18 2:25 PM, Gael Varoquaux wrote: On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote: We can also do Paris in April / May or June if that's ok with Joel and better for Andreas. Absolutely. My thoughts here are that I want to

Re: [scikit-learn] plan to add the association rule classification algorithm in scikit learn

2018-12-17 Thread Andreas Mueller
Can we add this to the FAQ as out of scope? Sebastian: feel free to put more into mlxtend :P On 12/17/18 1:46 AM, Sebastian Raschka wrote: Hi Rui, I agree with Joel that association rule mining could be a bit tricky to fit nicely within the scikit-learn API. Maybe this could be some

Re: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories

2018-12-14 Thread Andreas Mueller
On 12/13/18 4:16 AM, Joris Van den Bossche wrote: Hi all, I finally had some time to start looking at it the last days. Some preliminary work can be found here: https://github.com/jorisvandenbossche/target-encoder-benchmarks. You continue to be my hero. Probably can not look at it in detail

Re: [scikit-learn] New core dev: Adrin Jalali

2018-12-08 Thread Andreas Mueller
Congratulations and welcome Adrin! On 12/5/18 5:32 PM, Joel Nothman wrote: The Scikit-learn core development team has welcomed a new member, Adrin Jalali, who has been doing some really amazing work in contributing code and reviews since July (aside from occasional contributions since 2014).

[scikit-learn] [ANN] Scikit-learn 0.20.1 released

2018-11-27 Thread Andreas Mueller
Hey Everybody. I'm happy to announce that we released scikit-learn 0.20.1. This is a minor release containing mostly bugfixes and small improvements, though it's probably one of the bigger minor releases we've done. In particular there've been several enhancements to the ColumnTransformer,

[scikit-learn] [ANN] Scikit-learn 0.20.1 released

2018-11-27 Thread Andreas Mueller
Hey Everybody. I'm happy to announce that we released scikit-learn 0.20.1. This is a minor release containing mostly bugfixes and small improvements, though it's probably one of the bigger minor releases we've done. In particular there've been several enhancements to the ColumnTransformer,

Re: [scikit-learn] Recurrent questions about speed for TfidfVectorizer

2018-11-26 Thread Andreas Mueller
I think tries might be an interesting datastructure, but it really depends on where the bottleneck is. I'm really surprised they are not used more, but maybe that's just because implementations are missing? On 11/26/18 8:39 AM, Roman Yurchak via scikit-learn wrote: Hi Matthieu, if you are

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-21 Thread Andreas Mueller
On 11/21/18 10:34 AM, Gael Varoquaux wrote: Joris has just accepted to help with benchmarking. We can have preliminary results earlier. The question really is: out of the different variants that exist, which one should we choose. I think that it is a legitimate question that arises on many of

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-21 Thread Andreas Mueller
On 11/21/18 12:38 AM, Gael Varoquaux wrote: On Tue, Nov 20, 2018 at 09:58:49PM -0500, Andreas Mueller wrote: On 11/20/18 4:43 PM, Gael Varoquaux wrote: We are planning to do heavy benchmarking of those strategies, to figure out tradeoff. But we won't get to it before February, I am afraid

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Andreas Mueller
On 11/20/18 4:43 PM, Gael Varoquaux wrote: We are planning to do heavy benchmarking of those strategies, to figure out tradeoff. But we won't get to it before February, I am afraid. Does that mean you'd be opposed to adding the leave-one-out TargetEncoder before you do this? I would really

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Andreas Mueller
On 11/20/18 4:16 PM, Gael Varoquaux wrote: - the naive way is not the right one: just computing the average of y for each category leads to overfitting quite fast - it can be done cross-validated, splitting the train data, in a "cross-fit" strategy

Re: [scikit-learn] ANN Dirty_cat: learning on dirty categories

2018-11-20 Thread Andreas Mueller
I would love to see the TargetEncoder ported to scikit-learn. The CountFeaturizer is pretty stalled: https://github.com/scikit-learn/scikit-learn/pull/9614 :-/ Have you benchmarked the other encoders in the category_encoding lib? I would be really curious to know when/how they help. On

  1   2   3   >