Matt wrote:
> I am doing experiments on learning the rules for tokenization. Back in
> 2000 I experimented in finding word boundaries in text without spaces.
> These occur where there is low mutual information across boundaries.
> 

Possibly relevant is the sub-answer "Variable-length tokens" to the question I 
posed several years ago to cs.stackexchange titled "Finding a Simple 
Distribution In a Binary String" 
<https://cs.stackexchange.com/questions/68693/finding-a-simple-distribution-in-a-binary-string>.

That answer rather "cheats" by imposing an inductive bias at the outset when it 
says "In particular, I suggest you identify a set of tokens t1,…,tk that you're 
confident will be a superset of the ones in the real model, and then use 
optimization methods to solve for the repeat-factors n1,…,nk that maximize the 
likelihood of the model."
------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-M3207e3dda49b8b5104888b19
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to