Queer numbers: more aggressive purple numbers

kragen Mon, 10 Oct 2005 00:37:52 -0700

"Purple numbers" are a way to let people link to every paragraph of
every web page you publish, by adding unobtrusive short serial numbers
to every newly created paragraph.  Chris Dent and Eugene Eric Kim
invented them (I think), and they've been implemented in IRC-logging
software, Wikis, and blogging software.  In PurpleWiki, you can even
transclude pragraphs from arbitrary Wiki pages by purple number, so
that edits to those other Wiki pages are immediately reflected in the
page that transcludes them.


But purple numbers only solve the problem for sites you control.  I'd
often like to link to, or transclude, some paragraph in a page I don't
control.  How can I do this, without the site owner providing any
stable naming scheme for paragraphs?

A few years ago, some folks [find reference!] came up with a "lexical
signature" scheme for persistent naming of web pages.  They picked the
five nonunique words from the page with the highest TF/IDF scores,
using the entire Web as the corpus of reference for these scores.
Then they could use a full-text search engine such as AltaVista or
Google to find the document if its author irresponsibly allowed its
URL to become invalid.  (Unique words were excluded because they were
usually misspellings.)

Their scheme appended the lexical signature as query terms to the URL,
as in "?lexical_signature=purple+wiki+transclude+lexical+idf".  The
idea was to modify web browsers to automatically indirect through a
search engine for these links.

Clearly you could do the same thing for paragraphs with a fragment
identifier:
"#lexical_signature=few+corpus+reference+altavista+google".  But that
seems like an unnecessarily large fragment identifier for a
closed-world problem like distinguishing one paragraph from another in
a single web page.

So I suggest the following instead.  Make a bit-vector of zeroes; hash
the terms in the paragraph with the highest TF/IDF (calculating scores
relative to this document as the corpus) to index bits in your
bit-vector, setting each one to 1 in turn, until the bit-vector is
half full.  To dereference a link expressed as such a bit-vector,
calculate these same bit-vectors for each paragraph in the document,
and take the one with the smallest Hamming distance from the original
fragment ID.

You shouldn't need very many bits in your bit-vector.  I think 12 or
18 bits should be adequate.  You can encode these bit-vectors in
base64 in two or three characters, perhaps with some prepended special
character to prevent ordinary fragment IDs from being misinterpreted
in this way.  I suggest ".", since it's not a valid SGML NAME
character, but it's easy to type.

So now you'll have links that look like "foo.html#.2c" or
"walnuts.html#.X/.".  I propose calling these identifiers "queer
numbers", because they're a lot like "purple numbers", but they're
much more aggressive and "in your face", willing to go everywhere,
even where they're not welcome and not invited, all in the cause of
improving discourse and referenceability.

***

As a sample, here are the top TF/IDF words from the paragraphs in an
earlier version of this document:
people number adding serial implemented are by been edits other
-> are by them every software wiki pages even way that
control often don't site owner any sites how only without
-> control paragraphs naming do problem some scheme transclude link i
irresponsibly text years picked allowed find if use unique they
-> find they its words were reference scores altavista google because
idea browsers through query indirect their appended was modify automatically
-> lexical links engine terms lexical_signature search url signature wiki as
an single another clearly seems distinguishing closed world unnecessarily large
-> identifier google paragraphs same few do fragment altavista 
lexical_signature could
is relative expressed hash index calculate make id 1 smallest
-> each vector bit one document suggest same tf full terms
prevent since being two need special perhaps base64 character three
-> character it's bits suggest way vectors or some your not
all have queer go aggressive calling much discourse html lot
-> html they're not and because links even numbers like your

The lines preceded by '->' are the same as the previous lines, but
only contain nonunique words.  (Generally, unique words have high
TF/IDF!)  As you would expect from TF/IDF, they seem to have done a
pretty good job at excluding words included in other '->' lines.

This brings up an interesting point: you might get better Hamming
distances with bit-vectors less than half full, since the first few
words are likely to contain more-significant words than later words,
and you could actually use this feature to have a richer scoring
function than merely Hamming distance --- having a bit set that
corresponds to the paragraph's #1 word should perhaps count more than
the #2 word, etc.

Here's the same computation run on a more recent version of this
document (a version that happened to be 60% larger):
eric irc pragraphs eugene let unobtrusive reflected wikis short kim
-> every them software people number adding serial implemented been pages
stable i'd solve providing control often don't site owner any
-> control often don't site owner any sites how without naming
invalid excluded folks then persistent five ago entire using author
-> its were irresponsibly reference text years picked allowed find if
lexical idea browsers through query indirect their appended was modify
-> lexical idea browsers through query indirect their appended was modify
thing identifier single another clearly seems distinguishing closed world 
unnecessarily
-> identifier single another clearly seems distinguishing closed world 
unnecessarily large
take instead setting zeroes calculating dereference until turn following 
original
-> vector each is one relative expressed hash index calculate make
valid shouldn't encode adequate many type misinterpreted 18 be easy
-> it's prevent being two need special base64 character three should
foo cause referenceability propose welcome walnuts willing everywhere improving 
now
-> they're all queer go aggressive calling much discourse html lot

-> 
here top sample earlier paragraphs all don't being text years
-> paragraphs all don't being text years browsers through queer go
preceded high done seem previous would generally expect pretty good
-> ' lines nonunique contain have only other tf unique idf
distances set point merely corresponds likely scoring better interesting feature
-> than word hamming should more half distance up contain use
on run computation here's recent version more same document this
-> version more same document this of a the

>From a cursory inspection, it appears that the word lists remained
relatively stable through the 60% enlargement of the document ---
stable enough, anyway, that they would mostly retain linkage.
However, repeating the experiment again was not terribly promising,
although I suspect that this document (to which I have appended lists
of unique words from its previous versions) may be a particularly
tough nut to crack.

Queer numbers: more aggressive purple numbers

Reply via email to