Molto interessante, grazie mille, non conoscevo questo tipo di problema. AB
Il giorno sab 3 mag 2025 alle 18:48 Alberto Cammozzo via nexa < [email protected]> ha scritto: > < > https://theconversation.com/a-weird-phrase-is-plaguing-scientific-papers-and-we-traced-it-back-to-a-glitch-in-ai-training-data-254463 > > > > theconversation.com > > A weird phrase is plaguing scientific papers – and we traced it back to a > glitch in AI training data > > Rayane El Masri > > Earlier this year, scientists discovered a peculiar term appearing in > published papers: “vegetative electron microscopy”. > > This phrase, which sounds technical but is actually nonsense, has become a > “digital fossil” – an error preserved and reinforced in artificial > intelligence (AI) systems that is nearly impossible to remove from our > knowledge repositories. > > Like biological fossils trapped in rock, these digital artefacts may > become permanent fixtures in our information ecosystem. > > The case of vegetative electron microscopy offers a troubling glimpse into > how AI systems can perpetuate and amplify errors throughout our collective > knowledge. > > A bad scan and an error in translation > > Vegetative electron microscopy appears to have originated through a > remarkable coincidence of unrelated errors. > > First, two papers from the 1950s, published in the journal Bacteriological > Reviews, were scanned and digitised. > > However, the digitising process erroneously combined “vegetative” from one > column of text with “electron” from another. As a result, the phantom term > was created. > > > Decades later, “vegetative electron microscopy” turned up in some Iranian > scientific papers. In 2017 and 2019, two papers used the term in English > captions and abstracts. > > This appears to be due to a translation error. In Farsi, the words for > “vegetative” and “scanning” differ by only a single dot. > > An error on the rise > > The upshot? As of today, “vegetative electron microscopy” appears in 22 > papers, according to Google Scholar. One was the subject of a contested > retraction from a Springer Nature journal, and Elsevier issued a correction > for another. > > The term also appears in news articles discussing subsequent integrity > investigations. > > Vegetative electron microscopy began to appear more frequently in the > 2020s. To find out why, we had to peer inside modern AI models – and do > some archaeological digging through the vast layers of data they were > trained on. > > Empirical evidence of AI contamination > > The large language models behind modern AI chatbots such as ChatGPT are > “trained” on huge amounts of text to predict the likely next word in a > sequence. The exact contents of a model’s training data are often a closely > guarded secret. > > To test whether a model “knew” about vegetative electron microscopy, we > input snippets of the original papers to find out if the model would > complete them with the nonsense term or more sensible alternatives. > > The results were revealing. OpenAI’s GPT-3 consistently completed phrases > with “vegetative electron microscopy”. Earlier models such as GPT-2 and > BERT did not. This pattern helped us isolate when and where the > contamination occurred. > > We also found the error persists in later models including GPT-4o and > Anthropic’s Claude 3.5. This suggests the nonsense term may now be > permanently embedded in AI knowledge bases. > > > By comparing what we know about the training datasets of different models, > we identified the CommonCrawl dataset of scraped internet pages as the most > likely vector where AI models first learned this term. > > The scale problem > > Finding errors of this sort is not easy. Fixing them may be almost > impossible. > > One reason is scale. The CommonCrawl dataset, for example, is millions of > gigabytes in size. For most researchers outside large tech companies, the > computing resources required to work at this scale are inaccessible. > > Another reason is a lack of transparency in commercial AI models. OpenAI > and many other developers refuse to provide precise details about the > training data for their models. Research efforts to reverse engineer some > of these datasets have also been stymied by copyright takedowns. > > When errors are found, there is no easy fix. Simple keyword filtering > could deal with specific terms such as vegetative electron microscopy. > However, it would also eliminate legitimate references (such as this > article). > > More fundamentally, the case raises an unsettling question. How many other > nonsensical terms exist in AI systems, waiting to be discovered? > > Implications for science and publishing > > This “digital fossil” also raises important questions about knowledge > integrity as AI-assisted research and writing become more common. > > Publishers have responded inconsistently when notified of papers including > vegetative electron microscopy. Some have retracted affected papers, while > others defended them. Elsevier notably attempted to justify the term’s > validity before eventually issuing a correction. > > We do not yet know if other such quirks plague large language models, but > it is highly likely. Either way, the use of AI systems has already created > problems for the peer-review process. > > For instance, observers have noted the rise of “tortured phrases” used to > evade automated integrity software, such as “counterfeit consciousness” > instead of “artificial intelligence”. Additionally, phrases such as “I am > an AI language model” have been found in other retracted papers. > > Some automatic screening tools such as Problematic Paper Screener now flag > vegetative electron microscopy as a warning sign of possible AI-generated > content. However, such approaches can only address known errors, not > undiscovered ones. > > Living with digital fossils > > The rise of AI creates opportunities for errors to become permanently > embedded in our knowledge systems, through processes no single actor > controls. This presents challenges for tech companies, researchers, and > publishers alike. > > Tech companies must be more transparent about training data and methods. > Researchers must find new ways to evaluate information in the face of > AI-generated convincing nonsense. Scientific publishers must improve their > peer review processes to spot both human and AI-generated errors. > > Digital fossils reveal not just the technical challenge of monitoring > massive datasets, but the fundamental challenge of maintaining reliable > knowledge in systems where errors can become self-perpetuating. > >
