[nexa] "A major AI training data set contains millions of examples of personal data"

J.C. DE MARTIN Thu, 24 Jul 2025 08:17:13 -0700

*A major AI training data set contains millions of examples of personaldata*

/Personally identifiable information has been found in DataCompCommonPool, one of the largest open-source data sets used to train imagegeneration models./


Eileen Guo

July 18, 2025

Millions of images of passports, credit cards, birth certificates, andother documents containing personally identifiable information arelikely included in one of the biggest open-source AI training sets, newresearch has found.

Thousands of images—including identifiable faces—were found in a smallsubset of DataComp CommonPool, a major AI training set for imagegeneration scraped from the web. Because the researchers audited just0.1% of CommonPool’s data, they estimate that the real number of imagescontaining personally identifiable information, including faces andidentity documents, is in the hundreds of millions. The study thatdetails the breach was published on arXiv earlier this month<https://arxiv.org/pdf/2506.17185>.



[...]

continua qui:https://www.technologyreview.com/2025/07/18/1120466/a-major-ai-training-data-set-contains-millions-of-examples-of-personal-data/

[nexa] "A major AI training data set contains millions of examples of personal data"

Reply via email to