Re: [agi] Token Coloring - Simple architectural defense against prompt injection?

Matt Mahoney Thu, 01 Jan 2026 16:24:07 -0800

What happens if you tell the LLM to ignore the token flags? How could you
test an LLM to make sure this can't happen? Have you done any actual tests?


Defenses against prompt injection should be coded in the training data
before the weights are frozen. It seems this already distinguishes it from
user data.

-- Matt Mahoney, [email protected]

On Wed, Dec 31, 2025, 3:38 PM stefan.reich.maker.of.eye via AGI <
[email protected]> wrote:

> Yo...
>
> I've been thinking about prompt injection defenses and had an idea I'd
> like your feedback on. It's simple enough that I assume either (a) it's
> been tried and doesn't work, or (b) there's an obvious flaw I'm missing.
>
> *The idea:* Add a binary flag to each token during training -
>
> flag=1 for instructions the model should follow (system prompts, user
> queries),
> flag=0 for data the model should only process (user-provided content,
> documents, etc.).
>
> The flag is an additional input channel (like an extra embedding
> dimension), completely separate from the token stream itself - users cannot
> inject it through text.
>
> (I call this Token Coloring because the original idea was to make 
> *instructions
> red* and *data green*.)
>
> *Training approach:*
>
>    - Base model: trained with neutral/no flag (learns language
>    understanding)
>    - Instruction tuning: command following dataset is augmented with
>    flags (learns to only execute flag=1 commands, ignore flag=0 commands even
>    if they look like instructions). Also, there are adversarial examples
>    reinforcing the notion of NOT following flag=0 commands.
>
> *Why it might work:* The flag creates an architectural separation between
> instruction and data channels. Unlike special tokens (which can be
> injected), the flag is out-of-band. Unlike prompt engineering, it's not
> relying on the model's semantic understanding of "ignore this" - it's a
> structural privilege boundary.
>
> *My questions:*
>
>    1. Has this approach been explored? (I couldn't find it in the
>    literature, but might be using wrong search terms)
>    2. What are the obvious problems I'm missing?
>    3. Could this work with existing pretrained models + instruction
>    fine-tuning, or would it require training from scratch?
>
>
> BTW this is what Claude said when I mentioned this group:
>
>
>
> Nice to be famous among AIs... lol
>
> Cheers, Stefan
> *Artificial General Intelligence List <https://agi.topicbox.com/latest>*
> / AGI / see discussions <https://agi.topicbox.com/groups/agi> +
> participants <https://agi.topicbox.com/groups/agi/members> +
> delivery options <https://agi.topicbox.com/groups/agi/subscription>
> Permalink
> <https://agi.topicbox.com/groups/agi/T2faee51273b20a92-Mbe5d3cf378db0180ebc49981>
>

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T2faee51273b20a92-M5738e37fc042c71a36ed85b3
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Re: [agi] Token Coloring - Simple architectural defense against prompt injection?

Reply via email to