Re: [scikit-learn] Analysis of sklearn and other python libraries on github by MS team

Roman Yurchak Fri, 27 Mar 2020 10:13:11 -0700

Very interesting! A few comments,

> From GH17, we managed to extract only 10.5k pipelines. Therelatively low frequency (with respect to the number of notebooks usingSCIKIT-LEARN [..]) indicates a non-wide adoption of this specification.However, the number of pipelines in the GH19 corpus is 132k pipelines(i.e., an increase of 13× [..] since 2017).


It's nice to see that pipelines are indeed widely used.

> Top-5 transformers [from imports] in GH19 are StandardScaler,CountVectorizer, TfidfTransformer, PolynomialFeatures, TfidfVectorizer(in this order). Same are the results for GH17 with the difference thatPCA is instead of TfidfVectorizer.

Hmm, I would have expected OneHotEncoder somewhere at the top and muchless text processing. If there is real usage of CountVectorizer andTfidfTransformer separately, then maybe deprecating TfidfVectorizercould be done https://github.com/scikit-learn/scikit-learn/issues/14951Though this ranking looks quite unexpected. I wonder if they have thefull list and not just the top5.

> Regarding learners, Top-5 in both GH17 and GH19 areLogisticRegression, MultinomialNB, SVC, LinearRegression, andRandomForestClassifier (in this order).

Maybe LinearRegression docstring should more strongly suggest to useRidge with small regularization in practice.


--
Roman

On 27/03/2020 17:32, Andreas Mueller wrote:

Hey all.
There's a pretty cool paper by a team at MS that analyses public githubrepos for their use of the sklearn and related libraries:
https://arxiv.org/abs/1912.09536

Thought it might be of interest.

Cheers,
Andy
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Analysis of sklearn and other python libraries on github by MS team

Reply via email to