Matt wrote: > I am doing experiments on learning the rules for tokenization. Back in > 2000 I experimented in finding word boundaries in text without spaces. > These occur where there is low mutual information across boundaries. >
Possibly relevant is the sub-answer "Variable-length tokens" to the question I posed several years ago to cs.stackexchange titled "Finding a Simple Distribution In a Binary String" <https://cs.stackexchange.com/questions/68693/finding-a-simple-distribution-in-a-binary-string>. That answer rather "cheats" by imposing an inductive bias at the outset when it says "In particular, I suggest you identify a set of tokens t1,…,tk that you're confident will be a superset of the ones in the real model, and then use optimization methods to solve for the repeat-factors n1,…,nk that maximize the likelihood of the model." ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-M3207e3dda49b8b5104888b19 Delivery options: https://agi.topicbox.com/groups/agi/subscription