Hi

As was recently mentioned in PR #18594, the problem with the boston housing dataset does not go away, just because we remove it from scikit-learn. On the contrary, it is a valuable dataset to show and teach bias and discrimination - issue #16715 is still waiting for someone to write an example - in particular because we have access to the variable "B".

Most, if not all, of the datasets in scikit-learn are available elsewhere, even in python. So I don't think this is a good argument either for removal.

As we've now removed it from tests and examples, the question for me is: What do we want to achieve furthermore?
Answers I can think of go down a political road...

I'm fine with Olivier's suggestion https://github.com/scikit-learn/scikit-learn/pull/18594#issuecomment-707626543.


All the best,
Christian


On 14.10.20 10:34, Adrin wrote:
Most of those are not talking about the ethical issues of the dataset. Let's talk about the alternatives we have:

Keep the loader, but raise a warning:
- this will result in most people not changing their code/material, and IMO mostly ignore the warning. Some
people may see the warning and care about it.

Deprecate, and point them to an alternative dataset, and if they really really want the same dataset, point them
to the openml ID:
- People will have to change something, and if we give them a nice copy/paste-able alternative which is not boston,
they'll use that instead.
- Some people will keep using boston from openml, and not care about the ethical implications

As an addition, we can keep the load_boston in the docs only, and point users to alternatives even after removing
the loader.

On Wed, Oct 14, 2020 at 10:11 AM Olivier Grisel <olivier.gri...@ensta.org <mailto:olivier.gri...@ensta.org>> wrote:

    Le mar. 13 oct. 2020 à 16:19, Adrin <adrin.jal...@gmail.com
    <mailto:adrin.jal...@gmail.com>> a écrit :
    >
    > Isn't the Boston dataset available through openml? Maybe here:
    https://www.openml.org/d/531
    >
    > I'm happy to have the dataset out there on opemml, and for any
    material that addresses some of the issues with it.
    > But for educational purposes, we don't need to have the dataset
    in the package as long as users can still download it
    > with a oneliner using fetch_openml.

    That would be an argument in favor of deprecation warning with a
    message stating the motivation for deprecation and pointing to
    fetch_openml.

    However it's going to break examples written in slow to update
    tutorials or book once the deprecation period is over. But one could
    argue that this is also the case for any other deprecation in
    scikit-learn. It's just that sklearn.datasets.load_boston is used A
    LOT: https://github.com/search?q=load_boston&type=code

-- Olivier
    _______________________________________________
    scikit-learn mailing list
    scikit-learn@python.org <mailto:scikit-learn@python.org>
    https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to