Re: [nexa] A weird phrase is plaguing scientific papers – and we traced it back to a glitch in AI training data

maurizio lana Sun, 04 May 2025 04:45:25 -0700

il tema era comparso anche in

Gerard, David. «How AI slop generators started talking about ‘vegetativeelectron microscopy’». /Pivot to AI/ (blog), 15 febbraio 2025.https://pivot-to-ai.com/2025/02/15/how-ai-slop-generators-started-talking-about-vegetative-electron-microscopy/.

con una linea di discorso differente.
questa e quella si integrano bene a dare una visione ampia del problema.

alcuni degli articoli che contengono questa chimera della “vegetativeelectron microscopy” hanno centinaia di citazioni. occorrerebbe capirese si tratta di fake-citazioni come per un caso precedente si è discussoqui:Besançon, Lonni, Guillaume Cabanac, Cyril Labbé, e Alexander Magazinov.«Sneaked references: Fabricated reference metadata distort citationcounts». /Journal of the Association for Information Science andTechnology/ n/a, fasc. n/a. Consultato 16 agosto 2024.https://doi.org/10.1002/asi.24896.Cabanac, Guillaume, e Lonni Besançon. «When scientific citations gorogue: Uncovering ‘sneaked references’». The Conversation, 9 luglio2024.http://theconversation.com/when-scientific-citations-go-rogue-uncovering-sneaked-references-233858.Ibrahim, Hazem, Fengyuan Liu, Yasir Zaki, e Talal Rahwan. «Citationmanipulation through citation mills and pre-print servers». /ScientificReports/ 15, fasc. 1 (14 febbraio 2025): 5480.https://doi.org/10.1038/s41598-025-88709-7.

per altro verso, di alcuni di questi articoli anche se non hannocentinaia di citazioni ci sono sono decine di versioni identiche pertitolo e per dominio internet della pubblicazione.


Maurizio


Il 03/05/25 18:48, Alberto Cammozzo via nexa ha scritto:

<https://theconversation.com/a-weird-phrase-is-plaguing-scientific-papers-and-we-traced-it-back-to-a-glitch-in-ai-training-data-254463>

theconversation.com
A weird phrase is plaguing scientific papers – and we traced it backto a glitch in AI training data
Rayane El Masri
Earlier this year, scientists discovered a peculiar term appearing inpublished papers: “vegetative electron microscopy”.
This phrase, which sounds technical but is actually nonsense, hasbecome a “digital fossil” – an error preserved and reinforced inartificial intelligence (AI) systems that is nearly impossible toremove from our knowledge repositories.
Like biological fossils trapped in rock, these digital artefacts maybecome permanent fixtures in our information ecosystem.
The case of vegetative electron microscopy offers a troubling glimpseinto how AI systems can perpetuate and amplify errors throughout ourcollective knowledge.
A bad scan and an error in translation
Vegetative electron microscopy appears to have originated through aremarkable coincidence of unrelated errors.
First, two papers from the 1950s, published in the journalBacteriological Reviews, were scanned and digitised.
However, the digitising process erroneously combined “vegetative” fromone column of text with “electron” from another. As a result, thephantom term was created.
Decades later, “vegetative electron microscopy” turned up in someIranian scientific papers. In 2017 and 2019, two papers used the termin English captions and abstracts.
This appears to be due to a translation error. In Farsi, the words for“vegetative” and “scanning” differ by only a single dot.
An error on the rise
The upshot? As of today, “vegetative electron microscopy” appears in22 papers, according to Google Scholar. One was the subject of acontested retraction from a Springer Nature journal, and Elsevierissued a correction for another.
The term also appears in news articles discussing subsequent integrityinvestigations.
Vegetative electron microscopy began to appear more frequently in the2020s. To find out why, we had to peer inside modern AI models – anddo some archaeological digging through the vast layers of data theywere trained on.
Empirical evidence of AI contamination
The large language models behind modern AI chatbots such as ChatGPTare “trained” on huge amounts of text to predict the likely next wordin a sequence. The exact contents of a model’s training data are oftena closely guarded secret.
To test whether a model “knew” about vegetative electron microscopy,we input snippets of the original papers to find out if the modelwould complete them with the nonsense term or more sensible alternatives.
The results were revealing. OpenAI’s GPT-3 consistently completedphrases with “vegetative electron microscopy”. Earlier models such asGPT-2 and BERT did not. This pattern helped us isolate when and wherethe contamination occurred.
We also found the error persists in later models including GPT-4o andAnthropic’s Claude 3.5. This suggests the nonsense term may now bepermanently embedded in AI knowledge bases.
By comparing what we know about the training datasets of differentmodels, we identified the CommonCrawl dataset of scraped internetpages as the most likely vector where AI models first learned this term.
The scale problem
Finding errors of this sort is not easy. Fixing them may be almostimpossible.
One reason is scale. The CommonCrawl dataset, for example, is millionsof gigabytes in size. For most researchers outside large techcompanies, the computing resources required to work at this scale areinaccessible.
Another reason is a lack of transparency in commercial AI models.OpenAI and many other developers refuse to provide precise detailsabout the training data for their models. Research efforts to reverseengineer some of these datasets have also been stymied by copyrighttakedowns.
When errors are found, there is no easy fix. Simple keyword filteringcould deal with specific terms such as vegetative electron microscopy.However, it would also eliminate legitimate references (such as thisarticle).
More fundamentally, the case raises an unsettling question. How manyother nonsensical terms exist in AI systems, waiting to be discovered?
Implications for science and publishing
This “digital fossil” also raises important questions about knowledgeintegrity as AI-assisted research and writing become more common.
Publishers have responded inconsistently when notified of papersincluding vegetative electron microscopy. Some have retracted affectedpapers, while others defended them. Elsevier notably attempted tojustify the term’s validity before eventually issuing a correction.
We do not yet know if other such quirks plague large language models,but it is highly likely. Either way, the use of AI systems has alreadycreated problems for the peer-review process.
For instance, observers have noted the rise of “tortured phrases” usedto evade automated integrity software, such as “counterfeitconsciousness” instead of “artificial intelligence”. Additionally,phrases such as “I am an AI language model” have been found in otherretracted papers.
Some automatic screening tools such as Problematic Paper Screener nowflag vegetative electron microscopy as a warning sign of possibleAI-generated content. However, such approaches can only address knownerrors, not undiscovered ones.
Living with digital fossils
The rise of AI creates opportunities for errors to become permanentlyembedded in our knowledge systems, through processes no single actorcontrols. This presents challenges for tech companies, researchers,and publishers alike.
Tech companies must be more transparent about training data andmethods. Researchers must find new ways to evaluate information in theface of AI-generated convincing nonsense. Scientific publishers mustimprove their peer review processes to spot both human andAI-generated errors.
Digital fossils reveal not just the technical challenge of monitoringmassive datasets, but the fundamental challenge of maintainingreliable knowledge in systems where errors can become self-perpetuating.




------------------------------------------------------------------------

citazioni come briganti ai bordi della strada
che balzano fuori armati
e strappano l’assenso all’ozioso viandante
walter benjamin, strada a senso unico

------------------------------------------------------------------------
Maurizio Lana
Università del Piemonte Orientale
Dipartimento di Studi Umanistici
Piazza Roma 36 - 13100 Vercelli

Re: [nexa] A weird phrase is plaguing scientific papers – and we traced it back to a glitch in AI training data

Reply via email to