*A major AI training data set contains millions of examples of personal
data*
/Personally identifiable information has been found in DataComp
CommonPool, one of the largest open-source data sets used to train image
generation models./
Eileen Guo
July 18, 2025
Millions of images of passports, credit cards, birth certificates, and
other documents containing personally identifiable information are
likely included in one of the biggest open-source AI training sets, new
research has found.
Thousands of images—including identifiable faces—were found in a small
subset of DataComp CommonPool, a major AI training set for image
generation scraped from the web. Because the researchers audited just
0.1% of CommonPool’s data, they estimate that the real number of images
containing personally identifiable information, including faces and
identity documents, is in the hundreds of millions. The study that
details the breach was published on arXiv earlier this month
<https://arxiv.org/pdf/2506.17185>.
[...]
continua qui:
https://www.technologyreview.com/2025/07/18/1120466/a-major-ai-training-data-set-contains-millions-of-examples-of-personal-data/