Hi
As was recently mentioned in PR #18594, the problem with the boston
housing dataset does not go away, just because we remove it from
scikit-learn. On the contrary, it is a valuable dataset to show and
teach bias and discrimination - issue #16715 is still waiting for
someone to write an example - in particular because we have access to
the variable "B".
Most, if not all, of the datasets in scikit-learn are available
elsewhere, even in python. So I don't think this is a good argument
either for removal.
As we've now removed it from tests and examples, the question for me is:
What do we want to achieve furthermore?
Answers I can think of go down a political road...
I'm fine with Olivier's suggestion
https://github.com/scikit-learn/scikit-learn/pull/18594#issuecomment-707626543.
All the best,
Christian
On 14.10.20 10:34, Adrin wrote:
Most of those are not talking about the ethical issues of the dataset.
Let's talk about the alternatives we have:
Keep the loader, but raise a warning:
- this will result in most people not changing their code/material,
and IMO mostly ignore the warning. Some
people may see the warning and care about it.
Deprecate, and point them to an alternative dataset, and if they
really really want the same dataset, point them
to the openml ID:
- People will have to change something, and if we give them a nice
copy/paste-able alternative which is not boston,
they'll use that instead.
- Some people will keep using boston from openml, and not care about
the ethical implications
As an addition, we can keep the load_boston in the docs only, and
point users to alternatives even after removing
the loader.
On Wed, Oct 14, 2020 at 10:11 AM Olivier Grisel
<olivier.gri...@ensta.org <mailto:olivier.gri...@ensta.org>> wrote:
Le mar. 13 oct. 2020 à 16:19, Adrin <adrin.jal...@gmail.com
<mailto:adrin.jal...@gmail.com>> a écrit :
>
> Isn't the Boston dataset available through openml? Maybe here:
https://www.openml.org/d/531
>
> I'm happy to have the dataset out there on opemml, and for any
material that addresses some of the issues with it.
> But for educational purposes, we don't need to have the dataset
in the package as long as users can still download it
> with a oneliner using fetch_openml.
That would be an argument in favor of deprecation warning with a
message stating the motivation for deprecation and pointing to
fetch_openml.
However it's going to break examples written in slow to update
tutorials or book once the deprecation period is over. But one could
argue that this is also the case for any other deprecation in
scikit-learn. It's just that sklearn.datasets.load_boston is used A
LOT: https://github.com/search?q=load_boston&type=code
--
Olivier
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org <mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn