Hi Richard,

On Fri, 24 Apr 2026 06:56:14 -0700 (PDT) Richard Kenner wrote:

> > Any sequence of bytes   
> 
> Any sequence?
> 
> So an auto-complete done by an LLM counts, but one generated by
> a less-sophisticated plugin doesn't?  What's the difference?

Well, there are so many deep differences that I wonder if your is a
honest question or not.

I won't list them all, since many might be considered off-topic here,
but there ecological and geopolitical issues alone should be enough to
show how naive is such comparison.

However, let's stick with what is well known among developers:

- LLMs are lossy compressions of the source data used during their
  "training", thus they are likely to output (corrupted) snippets from
  such source data (we all remember Quake III Arena GPL'd code in the
  output of the early GitHub Copylot, but there are plenty of research
  proving this simple fact)
- LLMs output cannot be subject to copyright (thus protected through a
  license, either copyleft or not)
- LLMs output can include subtle vulnerabilities encoded in a highly
  plausible form (often justified with highly persuasive text)
- LLMs source data is often unknown, making the supply chain vast and
  totally opaque
- coding LLMs are target of military grade data-poison campaign trying
  to spread vulnerabilities

Those you dismiss as "less-sophisticated plugins" did not pose any of
these problems.

So to summarize, LLMs output pose high legal risks (both in the US and
world-wide) and expose codebases to new threats and supply chain
attacks.

That's why some high impact projects simply ban their usage.

> > computed by a local or remote large language model  
> 
> I notice you didn't say "generated".  But what does "computed" mean?

For better clarity, I tend to avoid marketing slang.

LLMs do not generate anything, they just iteratively compute a point
into a multidimensional space somewhat near of another point (the
vectorization of prompt + context + behind-the-scene system prompt +
behind-the-scene randomized input) so that such new point minimize it's
distance from the data compressed in the lossy archive (the "model").

In the same way, I never talk about "hallucinations": no output of a
LLM is grounded, but I admit that extracting competitors's code
submitted to these tools over time (however corrupted) might prove
useful in the short term.

> Do you only include literal code or do you include algorithms?

In the context of GCC, I only really care about what influence the
binary output of building the repo.

> So if I asked an LLM to generate pseudo-code to create some
> specific GCC tree and then manually translated that pseudo-code into
> C, that doesn't count, but if I asked the LLM to generate the C
> directly, it does? What's the difference?

The difference is your brain. ;-)

However fallible, you have a mental model of the algorithm you are
encoding in C. There's nothing like that in a LLM.

However, if you asked a LLM to generate a pseudo-code and then included
a open transpiler for such pseudo-code in the GCC build, to generate the
C code that was actually included in any GCC binary, that pseudo-code
would influence the binary output and I'd ask you to share detailed
tool infos and the prompts used to obtain it.


Giacomo

Reply via email to