il tema era comparso anche in
Gerard, David. «How AI slop generators started talking about ‘vegetative
electron microscopy’». /Pivot to AI/ (blog), 15 febbraio 2025.
https://pivot-to-ai.com/2025/02/15/how-ai-slop-generators-started-talking-about-vegetative-electron-microscopy/.
con una linea di discorso differente.
questa e quella si integrano bene a dare una visione ampia del problema.
alcuni degli articoli che contengono questa chimera della “vegetative
electron microscopy” hanno centinaia di citazioni. occorrerebbe capire
se si tratta di fake-citazioni come per un caso precedente si è discusso
qui:
Besançon, Lonni, Guillaume Cabanac, Cyril Labbé, e Alexander Magazinov.
«Sneaked references: Fabricated reference metadata distort citation
counts». /Journal of the Association for Information Science and
Technology/ n/a, fasc. n/a. Consultato 16 agosto 2024.
https://doi.org/10.1002/asi.24896.
Cabanac, Guillaume, e Lonni Besançon. «When scientific citations go
rogue: Uncovering ‘sneaked references’». The Conversation, 9 luglio
2024.
http://theconversation.com/when-scientific-citations-go-rogue-uncovering-sneaked-references-233858.
Ibrahim, Hazem, Fengyuan Liu, Yasir Zaki, e Talal Rahwan. «Citation
manipulation through citation mills and pre-print servers». /Scientific
Reports/ 15, fasc. 1 (14 febbraio 2025): 5480.
https://doi.org/10.1038/s41598-025-88709-7.
per altro verso, di alcuni di questi articoli anche se non hanno
centinaia di citazioni ci sono sono decine di versioni identiche per
titolo e per dominio internet della pubblicazione.
Maurizio
Il 03/05/25 18:48, Alberto Cammozzo via nexa ha scritto:
<https://theconversation.com/a-weird-phrase-is-plaguing-scientific-papers-and-we-traced-it-back-to-a-glitch-in-ai-training-data-254463>
theconversation.com
A weird phrase is plaguing scientific papers – and we traced it back
to a glitch in AI training data
Rayane El Masri
Earlier this year, scientists discovered a peculiar term appearing in
published papers: “vegetative electron microscopy”.
This phrase, which sounds technical but is actually nonsense, has
become a “digital fossil” – an error preserved and reinforced in
artificial intelligence (AI) systems that is nearly impossible to
remove from our knowledge repositories.
Like biological fossils trapped in rock, these digital artefacts may
become permanent fixtures in our information ecosystem.
The case of vegetative electron microscopy offers a troubling glimpse
into how AI systems can perpetuate and amplify errors throughout our
collective knowledge.
A bad scan and an error in translation
Vegetative electron microscopy appears to have originated through a
remarkable coincidence of unrelated errors.
First, two papers from the 1950s, published in the journal
Bacteriological Reviews, were scanned and digitised.
However, the digitising process erroneously combined “vegetative” from
one column of text with “electron” from another. As a result, the
phantom term was created.
Decades later, “vegetative electron microscopy” turned up in some
Iranian scientific papers. In 2017 and 2019, two papers used the term
in English captions and abstracts.
This appears to be due to a translation error. In Farsi, the words for
“vegetative” and “scanning” differ by only a single dot.
An error on the rise
The upshot? As of today, “vegetative electron microscopy” appears in
22 papers, according to Google Scholar. One was the subject of a
contested retraction from a Springer Nature journal, and Elsevier
issued a correction for another.
The term also appears in news articles discussing subsequent integrity
investigations.
Vegetative electron microscopy began to appear more frequently in the
2020s. To find out why, we had to peer inside modern AI models – and
do some archaeological digging through the vast layers of data they
were trained on.
Empirical evidence of AI contamination
The large language models behind modern AI chatbots such as ChatGPT
are “trained” on huge amounts of text to predict the likely next word
in a sequence. The exact contents of a model’s training data are often
a closely guarded secret.
To test whether a model “knew” about vegetative electron microscopy,
we input snippets of the original papers to find out if the model
would complete them with the nonsense term or more sensible alternatives.
The results were revealing. OpenAI’s GPT-3 consistently completed
phrases with “vegetative electron microscopy”. Earlier models such as
GPT-2 and BERT did not. This pattern helped us isolate when and where
the contamination occurred.
We also found the error persists in later models including GPT-4o and
Anthropic’s Claude 3.5. This suggests the nonsense term may now be
permanently embedded in AI knowledge bases.
By comparing what we know about the training datasets of different
models, we identified the CommonCrawl dataset of scraped internet
pages as the most likely vector where AI models first learned this term.
The scale problem
Finding errors of this sort is not easy. Fixing them may be almost
impossible.
One reason is scale. The CommonCrawl dataset, for example, is millions
of gigabytes in size. For most researchers outside large tech
companies, the computing resources required to work at this scale are
inaccessible.
Another reason is a lack of transparency in commercial AI models.
OpenAI and many other developers refuse to provide precise details
about the training data for their models. Research efforts to reverse
engineer some of these datasets have also been stymied by copyright
takedowns.
When errors are found, there is no easy fix. Simple keyword filtering
could deal with specific terms such as vegetative electron microscopy.
However, it would also eliminate legitimate references (such as this
article).
More fundamentally, the case raises an unsettling question. How many
other nonsensical terms exist in AI systems, waiting to be discovered?
Implications for science and publishing
This “digital fossil” also raises important questions about knowledge
integrity as AI-assisted research and writing become more common.
Publishers have responded inconsistently when notified of papers
including vegetative electron microscopy. Some have retracted affected
papers, while others defended them. Elsevier notably attempted to
justify the term’s validity before eventually issuing a correction.
We do not yet know if other such quirks plague large language models,
but it is highly likely. Either way, the use of AI systems has already
created problems for the peer-review process.
For instance, observers have noted the rise of “tortured phrases” used
to evade automated integrity software, such as “counterfeit
consciousness” instead of “artificial intelligence”. Additionally,
phrases such as “I am an AI language model” have been found in other
retracted papers.
Some automatic screening tools such as Problematic Paper Screener now
flag vegetative electron microscopy as a warning sign of possible
AI-generated content. However, such approaches can only address known
errors, not undiscovered ones.
Living with digital fossils
The rise of AI creates opportunities for errors to become permanently
embedded in our knowledge systems, through processes no single actor
controls. This presents challenges for tech companies, researchers,
and publishers alike.
Tech companies must be more transparent about training data and
methods. Researchers must find new ways to evaluate information in the
face of AI-generated convincing nonsense. Scientific publishers must
improve their peer review processes to spot both human and
AI-generated errors.
Digital fossils reveal not just the technical challenge of monitoring
massive datasets, but the fundamental challenge of maintaining
reliable knowledge in systems where errors can become self-perpetuating.
------------------------------------------------------------------------
citazioni come briganti ai bordi della strada
che balzano fuori armati
e strappano l’assenso all’ozioso viandante
walter benjamin, strada a senso unico
------------------------------------------------------------------------
Maurizio Lana
Università del Piemonte Orientale
Dipartimento di Studi Umanistici
Piazza Roma 36 - 13100 Vercelli