Re: [scikit-learn] About the Boston housing prices dataset

Christian Lorentzen Wed, 14 Oct 2020 03:04:04 -0700

Hi

As was recently mentioned in PR #18594, the problem with the bostonhousing dataset does not go away, just because we remove it fromscikit-learn. On the contrary, it is a valuable dataset to show andteach bias and discrimination - issue #16715 is still waiting forsomeone to write an example - in particular because we have access tothe variable "B".

Most, if not all, of the datasets in scikit-learn are availableelsewhere, even in python. So I don't think this is a good argumenteither for removal.

As we've now removed it from tests and examples, the question for me is:What do we want to achieve furthermore?

Answers I can think of go down a political road...

I'm fine with Olivier's suggestionhttps://github.com/scikit-learn/scikit-learn/pull/18594#issuecomment-707626543.



All the best,
Christian


On 14.10.20 10:34, Adrin wrote:

Most of those are not talking about the ethical issues of the dataset.Let's talk about the alternatives we have:
Keep the loader, but raise a warning:
- this will result in most people not changing their code/material,and IMO mostly ignore the warning. Some
people may see the warning and care about it.
Deprecate, and point them to an alternative dataset, and if theyreally really want the same dataset, point them
to the openml ID:
- People will have to change something, and if we give them a nicecopy/paste-able alternative which is not boston,
they'll use that instead.
- Some people will keep using boston from openml, and not care aboutthe ethical implications
As an addition, we can keep the load_boston in the docs only, andpoint users to alternatives even after removing
the loader.
On Wed, Oct 14, 2020 at 10:11 AM Olivier Grisel<olivier.gri...@ensta.org <mailto:olivier.gri...@ensta.org>> wrote:
    Le mar. 13 oct. 2020 à 16:19, Adrin <adrin.jal...@gmail.com
    <mailto:adrin.jal...@gmail.com>> a écrit :
    >
    > Isn't the Boston dataset available through openml? Maybe here:
    https://www.openml.org/d/531
    >
    > I'm happy to have the dataset out there on opemml, and for any
    material that addresses some of the issues with it.
    > But for educational purposes, we don't need to have the dataset
    in the package as long as users can still download it
    > with a oneliner using fetch_openml.

    That would be an argument in favor of deprecation warning with a
    message stating the motivation for deprecation and pointing to
    fetch_openml.

    However it's going to break examples written in slow to update
    tutorials or book once the deprecation period is over. But one could
    argue that this is also the case for any other deprecation in
    scikit-learn. It's just that sklearn.datasets.load_boston is used A
    LOT: https://github.com/search?q=load_boston&type=code
--Olivier
    _______________________________________________
    scikit-learn mailing list
    scikit-learn@python.org <mailto:scikit-learn@python.org>
    https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] About the Boston housing prices dataset

Reply via email to