Let’s have Claude formulate the contradiction in Lean 4 and delegate the reasoning to a tool that is good at that. (Just like I wouldn’t do long division by hand.)
From: Russ Abbott <[email protected]> Sent: Friday, September 12, 2025 3:51 PM To: Marcus Daniels <[email protected]> Cc: The Friday Morning Applied Complexity Coffee Group <[email protected]> Subject: Re: [FRIAM] Hallucinations Marcus, You're right, and I was wrong. I was much too insistent that LLMs don't understand the text they manipulate. A couple of weeks ago, I asked ChatGPT to embed (encode) a sentence and then decode it back to natural language. It said it didn't have access to the tools to do exactly that, but it would show me what the result would look like. The input sentence was: “I ate an apple because I was hungry. The apple was rotten. I got sick. My friend ate a banana. The banana was not rotten. My friend didn’t get sick.” ChatGPT simulated embedding/encoding the sentence as a vector. It then produced what it claimed was a reasonable natural language approximation of that vector. The result was: "A person and their friend ate fruit. One of the fruits was rotten, which caused sickness, while the other was fresh and did not cause illness." If ChatGPT can be believed, this is quite impressive. It implies that the embedding/encoding of natural language text includes something like the essential semantics of the original text. I had forgotten all about this when I wrote my post about hallucinations. I apologize. What I would like to do now -- and perhaps someone can help figure out if any tools are available to do this -- is to explore more carefully the sorts of information embeddings/encodings contain. For example, what would one get if one encoded and then decoded Chomsky's famous sentence: "Colorless green ideas sleep furiously." What would one get if one encoded -> decoded a contradiction: "All men are mortal. Socrates is a man; Socrates is immortal." What about: "The integer 3 is larger than the integer 9." Or "The American Revolutionary War occurred during the 19th century. George Washington led the American troops in that war. George Washington's tenure as the inaugural president of the United States began on April 30, 1789." Etc. -- Russ Abbott <https://russabbott.substack.com/> (Click for my Substack) Professor Emeritus, Computer Science California State University, Los Angeles On Thu, Sep 11, 2025 at 6:59 PM Marcus Daniels <[email protected] <mailto:[email protected]> > wrote: It often works with the frontier models to take a computational science or theory paper and to have them implement the idea expressed in some computer language. One can also often invert that program back into natural language (and/or with LaTeX equations). Further, one can translate between very different formal languages (imperative vs. functional), which would be hard work for most people. These summaries and transformations work so well that tools like Github Copilot will periodically perform a conversation summary, and simply drop the whole conversation and start over with crystallized context (due to context window limitations). When it picks up after that, one will often see a few syntax or API misunderstandings before it regroups to where it was. What this pivoting ease implies to me is that LLMs have a deep semantic representation of the conversation (and knowledge and these skills). It certainly is not just a matter of mating token sequences with some deft smoothing. Another example that has come up for me recently is using LLMs to predict simulation or solver outputs. When faced with learning large arrays of numbers, what it does is more like capturing a picture then a sequence of digits. It doesn’t know, without some help, about why number boundaries, signs, and decimal points are important. Only through hard-won experience does it learn that the most and least significant digits should be treated differently. Syntax is a hint one can offer through weak scaffolding penalties (outside of the training material). It learns the semantics first. Strong syntax penalties can get in the way of learning semantics by creating problematic energy barriers. While LLMs are huge, the Chinchilla optimality criterion (20 tokens per parameter) forces regularization. There’s some flood fill, but I don’t think it can hold up for idiosyncratic lexical patterns. From: Friam <[email protected] <mailto:[email protected]> > on behalf of Santafe <[email protected] <mailto:[email protected]> > Date: Thursday, September 11, 2025 at 5:12 PM To: [email protected] <mailto:[email protected]> <[email protected] <mailto:[email protected]> >, The Friday Morning Applied Complexity Coffee Group <[email protected] <mailto:[email protected]> > Subject: Re: [FRIAM] Hallucinations In your post, Russ, you say: “They are trained to produce fluent language, not to produce valid statements.“ Is that actually, operationally, what they are trained to do? I speak from a position of ignorance here, but my impression is that they are trained to effectively stitch together fragments of varying lengths, according to rules for what stitchings are “compatible”. My thinking here is metaphorical, to homologous recombination in DNA. Some regions that don’t start out contiguous can be concatenated by DNA repair machinery, because under the physics to which it responds, they have plausible enough overlap that it considers them “compatible” or eligible to be identified at the join region, their “mis-matches” edited out. Other pairs are so dissimilar that, under its operating physics, the repair machinery will effectively never join them. My metaphor isn’t great, in the sense that if what LLMs (for human speech) are doing is “next-word prediction”, that says that the notion of “joining” is reduced formally to appending next-words onto strings. Though, to the extent that certain substrings of next-words are extremely frequently attested across the corpus of all the training expressions, one would expect to see extended sequences essentially reproduced as fragments with large probability. If my previous two characterizations aren’t fundamentally wrong, it would follow that fluent speech-generation becomes possible because the compatible-joining relations are suffficiently strong in human languages that the attention structures or other feed-forward aspects of the architecture have no trouble capturing them in parameters, even though human linguists trying to write them as re-write rules from which a computer could generate native-like speech failed for decades to get anywhere close to that. My interpretation here would be consistent with what I believed was the main watershed change in the LLMs: that the parametric models would, ultimately, have terribly few parameters, whereas the LLMs can flood-fill a corpus with parameters, and then try to drip out the parts that don’t “stick to” some pattern in the data, and are regarded as the excess entropy from the sampling algorithm that the training is supposed to recognize and remove. It is easy to imagine that fluent speech has far more regularities than rule-book linguists captured parametrically, but still few enough that LLMs can have no trouble attaching to almost-all of them, with parameters to spare. Hence fluent speech could be epiphenomenal on what they are (operationally, mechanistically) being trained to do, but a natural summary statistic for the effectiveness of that training, and of course the one that drives market engagement. But if the above is the case, then the question of when they get “the syntax” right and “the semantics” wrong, would seem to turn on how much context from the training set is needed to identify semantically as well as syntactically appropriate “allowed joins” of fragments. When short fragments contain enough of their own context to constrain most of the semantics, the stitching training algorithm has no reason to perform any worse at revealing the semantic signal in the training set than the syntactic one. But if probability needs to be withheld for a long time in the prediction model, driving it to prioritize a much smaller number of longer or more remote assembled inputs from the training data, it could still do fine on syntax but fail to “find” and “render” the semantic signal in the training data, even if that signal is present in principal. I would not feel a need to use terms like “understanding” anywhere in the above, to make predictions of what kinds of successes or failures an LLM might deliver from the user’s perspective. It seems to me like something that all lives in the domain of hardness-of-search combinatorics in data-spaces with a lot of difficult structure. Eric On Sep 10, 2025, at 7:02, Russ Abbott <[email protected] <mailto:[email protected]> > wrote: OpenAI just published a paper on hallucinations <https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fcdn.openai.com%2fpdf%2fd04913be-3f6f-4d2b-b283-ff432ef4aaa5%2fwhy-language-models-hallucinate.pdf&c=E,1,IvBfvLzhn3L6LCNk3_ktKoEbc9NI2Oqq8vlFpNcIXCHElptIB-Fx-UxQYyTnCFW_ToeD5Kd4RjHkY-6fLxSBqZueOcvRqyHwpsHPK9ugMNcsOw,,&typo=1> as well as a post summarizing the paper <https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fopenai.com%2findex%2fwhy-language-models-hallucinate%2f&c=E,1,tEcctM28Lbt5XBi3gNiUX-RiFelMYHNq6K3VJBilGv1_Z8uAt34ta8FaU-FcW5i8V3-2tsjNPu_at8Es78G2_drdmykgOltvjRvvaw1hUgnXUsv3&typo=1> . The two of them seem wrong-headed in such a simple and obvious way that I'm surprised the issue they discuss is still alive. The paper and post point out that LLMs are trained to generate fluent language--which they do extraordinarily well. The paper and post also point out that LLMs are not trained to distinguish valid from invalid statements. Given those facts about LLMs, it's not clear why one should expect LLMs to be able to distinguish true statements from false statements--and hence why one should expect to be able to prevent LLMs from hallucinating. In other words, LLMs are built to generate text; they are not built to understand the texts they generate and certainly not to be able to determine whether the texts they generate make factually correct or incorrect statements. Please see my post <https://linkprotect.cudasvc.com/url?a=https%3a%2f%2frussabbott.substack.com%2fp%2fwhy-language-models-hallucinate-according&c=E,1,zLy4H6KEpD5hDchYiBUjiH2J5dG2O9bmqa-jm1z6mGgRSqZgDKaVd2D2Xh_2Wuzi7FtZu2kjIOTNjQuk4iwsnfNUG68UPCxmZvD_IHTVUEPTcW6HDgpmcozzRQ,,&typo=1> elaborating on this. Why is this not obvious, and why is OpenAI still talking about it? -- Russ Abbott <https://linkprotect.cudasvc.com/url?a=https%3a%2f%2frussabbott.substack.com%2f&c=E,1,cKMmq0etz4RiUaE4G2F04re6Su0EnNyqR9j5Dx8RcccQVNOB2r5CMNBzxRL9EYmN3lG_11nhB4wP-5jPf7NR86Mb9VxP9Jn2YUdKPZQT&typo=1> (Click for my Substack) Professor Emeritus, Computer Science California State University, Los Angeles .- .-.. .-.. / ..-. --- --- - . .-. ... / .- .-. . / .-- .-. --- -. --. / ... --- -- . / .- .-. . / ..- ... . ..-. ..- .-.. FRIAM Applied Complexity Group listserv Fridays 9a-12p Friday St. Johns Cafe / Thursdays 9a-12p Zoom https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fbit.ly%2fvirtualfriam <https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fbit.ly%2fvirtualfriam&c=E,1,TGWUFxByQV3GAAU3oSRoMNfDJD6ptWzY73PWkEy6wjvRSnx8Mc4UYZvwNnCZtQTtnx4s1YQWhA5OFZgcHYsPOfh2UOY3Y08aOLzFbRROXd4isiXdoT93L5Ncgw,,&typo=1> &c=E,1,TGWUFxByQV3GAAU3oSRoMNfDJD6ptWzY73PWkEy6wjvRSnx8Mc4UYZvwNnCZtQTtnx4s1YQWhA5OFZgcHYsPOfh2UOY3Y08aOLzFbRROXd4isiXdoT93L5Ncgw,,&typo=1 to (un)subscribe https://linkprotect.cudasvc.com/url?a=http%3a%2f%2fredfish.com%2fmailman%2flistinfo%2ffriam_redfish.com <https://linkprotect.cudasvc.com/url?a=http%3a%2f%2fredfish.com%2fmailman%2flistinfo%2ffriam_redfish.com&c=E,1,R8rvP64Y8Ojn7C4RmXsVaTwfI61-h--86QYAcdZfJB5b2Vma9UVdbCXCsDqLzWtC_TM9Ckm5LlRcoIn4_6mGC8c_WptkWvx_WtZA0PdtE8ViiUc,&typo=1> &c=E,1,R8rvP64Y8Ojn7C4RmXsVaTwfI61-h--86QYAcdZfJB5b2Vma9UVdbCXCsDqLzWtC_TM9Ckm5LlRcoIn4_6mGC8c_WptkWvx_WtZA0PdtE8ViiUc,&typo=1 FRIAM-COMIC https://linkprotect.cudasvc.com/url?a=http%3a%2f%2ffriam-comic.blogspot.com%2f <https://linkprotect.cudasvc.com/url?a=http%3a%2f%2ffriam-comic.blogspot.com%2f&c=E,1,JolQcZ2iD8sPfKhQE-npSBJtUmqqa8EaE0J19wBCnesx4rjYKUpByO5mwjwVUiEn91veQr1Bk3B0gvLNuTtgIkN8-2VZRSkQS61pFh_zro8Oe_g7&typo=1> &c=E,1,JolQcZ2iD8sPfKhQE-npSBJtUmqqa8EaE0J19wBCnesx4rjYKUpByO5mwjwVUiEn91veQr1Bk3B0gvLNuTtgIkN8-2VZRSkQS61pFh_zro8Oe_g7&typo=1 archives: 5/2017 thru present https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fredfish.com%2fpipermail%2ffriam_redfish.com%2f <https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fredfish.com%2fpipermail%2ffriam_redfish.com%2f&c=E,1,3J72bQm1T2SIdCaPyxSx4gitJ3Bt_OjLNAoKxcLa4u2f5Yw2m3gHImwAjCKE9RabMTMbzMedGiltpwWQ5w10fnNmDFvVkW9oQcfwHVezCQ,,&typo=1> &c=E,1,3J72bQm1T2SIdCaPyxSx4gitJ3Bt_OjLNAoKxcLa4u2f5Yw2m3gHImwAjCKE9RabMTMbzMedGiltpwWQ5w10fnNmDFvVkW9oQcfwHVezCQ,,&typo=1 1/2003 thru 6/2021 http://friam.383.s1.nabble.com/
smime.p7s
Description: S/MIME cryptographic signature
.- .-.. .-.. / ..-. --- --- - . .-. ... / .- .-. . / .-- .-. --- -. --. / ... --- -- . / .- .-. . / ..- ... . ..-. ..- .-.. FRIAM Applied Complexity Group listserv Fridays 9a-12p Friday St. Johns Cafe / Thursdays 9a-12p Zoom https://bit.ly/virtualfriam to (un)subscribe http://redfish.com/mailman/listinfo/friam_redfish.com FRIAM-COMIC http://friam-comic.blogspot.com/ archives: 5/2017 thru present https://redfish.com/pipermail/friam_redfish.com/ 1/2003 thru 6/2021 http://friam.383.s1.nabble.com/
