"Purple numbers" are a way to let people link to every paragraph of every web page you publish, by adding unobtrusive short serial numbers to every newly created paragraph. Chris Dent and Eugene Eric Kim invented them (I think), and they've been implemented in IRC-logging software, Wikis, and blogging software. In PurpleWiki, you can even transclude pragraphs from arbitrary Wiki pages by purple number, so that edits to those other Wiki pages are immediately reflected in the page that transcludes them.
But purple numbers only solve the problem for sites you control. I'd often like to link to, or transclude, some paragraph in a page I don't control. How can I do this, without the site owner providing any stable naming scheme for paragraphs? A few years ago, some folks [find reference!] came up with a "lexical signature" scheme for persistent naming of web pages. They picked the five nonunique words from the page with the highest TF/IDF scores, using the entire Web as the corpus of reference for these scores. Then they could use a full-text search engine such as AltaVista or Google to find the document if its author irresponsibly allowed its URL to become invalid. (Unique words were excluded because they were usually misspellings.) Their scheme appended the lexical signature as query terms to the URL, as in "?lexical_signature=purple+wiki+transclude+lexical+idf". The idea was to modify web browsers to automatically indirect through a search engine for these links. Clearly you could do the same thing for paragraphs with a fragment identifier: "#lexical_signature=few+corpus+reference+altavista+google". But that seems like an unnecessarily large fragment identifier for a closed-world problem like distinguishing one paragraph from another in a single web page. So I suggest the following instead. Make a bit-vector of zeroes; hash the terms in the paragraph with the highest TF/IDF (calculating scores relative to this document as the corpus) to index bits in your bit-vector, setting each one to 1 in turn, until the bit-vector is half full. To dereference a link expressed as such a bit-vector, calculate these same bit-vectors for each paragraph in the document, and take the one with the smallest Hamming distance from the original fragment ID. You shouldn't need very many bits in your bit-vector. I think 12 or 18 bits should be adequate. You can encode these bit-vectors in base64 in two or three characters, perhaps with some prepended special character to prevent ordinary fragment IDs from being misinterpreted in this way. I suggest ".", since it's not a valid SGML NAME character, but it's easy to type. So now you'll have links that look like "foo.html#.2c" or "walnuts.html#.X/.". I propose calling these identifiers "queer numbers", because they're a lot like "purple numbers", but they're much more aggressive and "in your face", willing to go everywhere, even where they're not welcome and not invited, all in the cause of improving discourse and referenceability. *** As a sample, here are the top TF/IDF words from the paragraphs in an earlier version of this document: people number adding serial implemented are by been edits other -> are by them every software wiki pages even way that control often don't site owner any sites how only without -> control paragraphs naming do problem some scheme transclude link i irresponsibly text years picked allowed find if use unique they -> find they its words were reference scores altavista google because idea browsers through query indirect their appended was modify automatically -> lexical links engine terms lexical_signature search url signature wiki as an single another clearly seems distinguishing closed world unnecessarily large -> identifier google paragraphs same few do fragment altavista lexical_signature could is relative expressed hash index calculate make id 1 smallest -> each vector bit one document suggest same tf full terms prevent since being two need special perhaps base64 character three -> character it's bits suggest way vectors or some your not all have queer go aggressive calling much discourse html lot -> html they're not and because links even numbers like your The lines preceded by '->' are the same as the previous lines, but only contain nonunique words. (Generally, unique words have high TF/IDF!) As you would expect from TF/IDF, they seem to have done a pretty good job at excluding words included in other '->' lines. This brings up an interesting point: you might get better Hamming distances with bit-vectors less than half full, since the first few words are likely to contain more-significant words than later words, and you could actually use this feature to have a richer scoring function than merely Hamming distance --- having a bit set that corresponds to the paragraph's #1 word should perhaps count more than the #2 word, etc. Here's the same computation run on a more recent version of this document (a version that happened to be 60% larger): eric irc pragraphs eugene let unobtrusive reflected wikis short kim -> every them software people number adding serial implemented been pages stable i'd solve providing control often don't site owner any -> control often don't site owner any sites how without naming invalid excluded folks then persistent five ago entire using author -> its were irresponsibly reference text years picked allowed find if lexical idea browsers through query indirect their appended was modify -> lexical idea browsers through query indirect their appended was modify thing identifier single another clearly seems distinguishing closed world unnecessarily -> identifier single another clearly seems distinguishing closed world unnecessarily large take instead setting zeroes calculating dereference until turn following original -> vector each is one relative expressed hash index calculate make valid shouldn't encode adequate many type misinterpreted 18 be easy -> it's prevent being two need special base64 character three should foo cause referenceability propose welcome walnuts willing everywhere improving now -> they're all queer go aggressive calling much discourse html lot -> here top sample earlier paragraphs all don't being text years -> paragraphs all don't being text years browsers through queer go preceded high done seem previous would generally expect pretty good -> ' lines nonunique contain have only other tf unique idf distances set point merely corresponds likely scoring better interesting feature -> than word hamming should more half distance up contain use on run computation here's recent version more same document this -> version more same document this of a the >From a cursory inspection, it appears that the word lists remained relatively stable through the 60% enlargement of the document --- stable enough, anyway, that they would mostly retain linkage. However, repeating the experiment again was not terribly promising, although I suspect that this document (to which I have appended lists of unique words from its previous versions) may be a particularly tough nut to crack.