Very interesting! A few comments,

> From GH17, we managed to extract only 10.5k pipelines. The relatively low frequency (with respect to the number of notebooks using SCIKIT-LEARN [..]) indicates a non-wide adoption of this specification. However, the number of pipelines in the GH19 corpus is 132k pipelines (i.e., an increase of 13× [..] since 2017).

It's nice to see that pipelines are indeed widely used.

> Top-5 transformers [from imports] in GH19 are StandardScaler, CountVectorizer, TfidfTransformer, PolynomialFeatures, TfidfVectorizer (in this order). Same are the results for GH17 with the difference that PCA is instead of TfidfVectorizer.

Hmm, I would have expected OneHotEncoder somewhere at the top and much less text processing. If there is real usage of CountVectorizer and TfidfTransformer separately, then maybe deprecating TfidfVectorizer could be done https://github.com/scikit-learn/scikit-learn/issues/14951 Though this ranking looks quite unexpected. I wonder if they have the full list and not just the top5.

> Regarding learners, Top-5 in both GH17 and GH19 are LogisticRegression, MultinomialNB, SVC, LinearRegression, and RandomForestClassifier (in this order).

Maybe LinearRegression docstring should more strongly suggest to use Ridge with small regularization in practice.

--
Roman

On 27/03/2020 17:32, Andreas Mueller wrote:
Hey all.
There's a pretty cool paper by a team at MS that analyses public github repos for their use of the sklearn and related libraries:
https://arxiv.org/abs/1912.09536

Thought it might be of interest.

Cheers,
Andy
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to