Re: [agi] Re: Lexical model learning for LLMs

James Bowery Wed, 22 Nov 2023 12:53:18 -0800

Matt wrote:
> I am doing experiments on learning the rules for tokenization. Back in
> 2000 I experimented in finding word boundaries in text without spaces.
> These occur where there is low mutual information across boundaries.
>


Possibly relevant is the sub-answer "Variable-length tokens" to the question I 
posed several years ago to cs.stackexchange titled "Finding a Simple 
Distribution In a Binary String" 
<https://cs.stackexchange.com/questions/68693/finding-a-simple-distribution-in-a-binary-string>.

That answer rather "cheats" by imposing an inductive bias at the outset when it 
says "In particular, I suggest you identify a set of tokens t1,…,tk that you're 
confident will be a superset of the ones in the real model, and then use 
optimization methods to solve for the repeat-factors n1,…,nk that maximize the 
likelihood of the model."
------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-M3207e3dda49b8b5104888b19
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Re: [agi] Re: Lexical model learning for LLMs

Reply via email to