language-model dimensionality reduction with Markov-chain distribution medians

Kragen Javier Sitaker Mon, 30 Jul 2012 16:00:50 -0700

In my kragen-tol post earlier this month about [predictive text input
methods][0], I wrote:


> You could do *much better* than SwiftKey with dimensionality reduction
> techniques, which could allow much more context to be brought to bear
> on the prediction problem using a much smaller model.  SwiftKey often
> makes predictions that yield clearly ungrammatical 3-grams, which even
> a simple 3-gram model over part-of-speech tags ought to be able to
> reject.

[0]: http://lists.canonical.org/pipermail/kragen-tol/2012-July/000961.html

But part-of-speech tagging is sort of hard, especially on incomplete and
possibly erroneous sentences containing words that aren't in the dictionary.
But it seems like you ought to be able to do a good enough job with simple
statistics to be useful.

Now, the rest of this post may be a case of "a few hours of hacking can save
you several minutes in the library", so take that under advisement.  I haven't
checked out the literature enough to know if this idea is already known, let
alone an improvement over what already exists.

One simple approach is to consider a Markov-chain model of text, and cluster
together words whose transition probabilities are similar, either for
in-transitions (what they tend to follow) or out-transitions (what they tend to
precede).

However, this runs into a practical problem immediately: even with a
first-order Markov model, you're trying to perform unsupervised classification
on thousands of vectors in a space with thousands of dimensions.  This is, as
far as I know, a totally intractable problem, because in high-dimensional
spaces, almost all points are closer to boundaries of the space than they are
to any other point.  So you need to reduce the dimensionality of the term space
before you can reduce the dimensionality of the term space.

Rather than eat my tail through that rathole, I thought maybe I'd pick some
kind of arbitrary characteristic of those vectors to cluster them.  The vectors
are finite discrete probability distributions of preceding or following words,
so given a mapping from the words onto the real line, you can use the usual set
of characteristics of finite discrete probability distributions, such as mean,
median, mode, standard deviation, skewness, kurtosis, and so on.  Of these,
median is dependent only on the ordering of the words, and mode isn't even
dependent on the ordering.

So if we cluster these distributions by their median, an idea due to a friend
of mine who should get credit if he wants it, we should be able to easily find
words that are used in similar contexts.

So, grouping the words into clusters that tend to precede the same word, using
the KJV bible for my example text, I got reasonable-looking clusters like the
following.  First, a cluster of mostly verbs, mostly naughty ones, that precede
"not", although interestingly "Is", "am", and "ye" made it in there too:

    not: Accuse, Be, Boast, Cast, Cease, Could, Count, Curse, Deceive,
        Defile, Desire, Despise, Destroy, Devise, Did, Didst, Distress,
        Do, Doth, Dread, Eat, Enter, Fear, Forget, Fret, Give, Grudge,
        Harden, Hide, Hurt, Is, Judge, Labour, Lay, Lodge, Lust, Marvel,
        Meddle, Mind, Murmur, Neglect, Quench, Regard, Rejoice, Rejoiceth,
        Remove, Reprove, Respect, Rob, Slack, Think, Told, Touch, Trust,
        Uncover, Was, Weep, Went, Withhold, agreed, altereth, am,
        appertaineth, approveth, are, attained, backbiteth, bearest,
        believed, believeth, bewray, brawler, bridleth, buildedst,
        burdened, canst, casteth, ceased, circumspectly, confesseth,
        consent, considerest, consisteth, continueth, could, couldest,
        defer, deferred, defile, defileth, delightest, descendeth, dieth,
        diggedst, dishonesty, displease, do, does, doeth, doubletongued,
        doubt, durst, edify, envieth, epistle, extendeth, falleth, feared,
        filledst, followedst, forbid, fret, gathereth, gnaw, grieve,
        harden, heareth, hearkenedst, hide, housetop, kept, knew, knewest,
        knowest, lacketh, lady, laid, layedst, layeth, let, lingereth,
        needed, needest, needeth, obey, obeyed, obeyedst, obeyeth,
        observest, oppress, patient, payeth, perceivest, perfection,
        prefer, profane, purifieth, rained, receive, regard, regarded,
        regardest, regardeth, remained, repented, respected, respecteth,
        return, returneth, reviled, sacrificeth, savourest, saw, seek,
        servedst, shalt, shone, slack, slumbereth, sobriety, sowest,
        spared, spin, staggered, striker, strive, stumbleth, suffereth,
        swelled, tarrieth, tarry, they, toil, tortured, touch, travailest,
        trow, understand, understood, upbraideth, vaunteth, wasted, wax,
        wipe, wist, withdraweth, withheldest, wot, wotteth, wouldest,
        wrestle, ye

And second, a cluster of mostly proper names that precede capitalized "And":

    And: Abez, Abiezrites, Abihud, Abishalom, Achshaph, Adadah, Ader,
        Ahlai, Ahoah, Alamoth, Allonbachuth, Ammonitess, Anaharath, Aniam,
        Anim, Antothijah, Aphekah, Ara, Ardon, Aridatha, Armageddon,
        Asareel, Ashbea, Ashima, Asiel, Aspatha, Athach, Athlai, Attalia,
        Avith, Azem, Azzan, Bealoth, Beera, Berith, Bethbaalmeon,
        Bethdiblathaim, Bethgader, Bethmeon, Bethpalet, Bethpazzez,
        Bethphelet, Birzavith, Boscath, Camon, Canaanitess, Caphthorim,
        Caphtorim, Chelubai, Chislon, Chronicles, Dalmanutha, Diklah,
        Dinhabah, Dothan, EARTH, Edar, Eker, EleloheIsrael, Elpalet,
        Enhazor, Ephesdammim, Ephod, Eshbaal, Eshean, Ethnan, Euroclydon,
        Ezel, Gaash, Gabbatha, Galeed, Galilaean, Gazer, Gennesaret,
        Gilboa, Girgashite, Girgasite, Goath, Hamongog, Hamul, Hararite,
        Harum, Hasenuah, Hathath, Hazarsusah, Hazelelponi, Hazezontamar,
        Helekites, Hepherites, Hormah, Huphamites, Ibnijah, Imla,
        Irshemesh, Ishmeelite, Ivah, Jaasau, Jagur, Japhia, Japho,
        ...
        Tyrannus, Urias, Uzzensherah, Zacher, Zaham, Zarthan, Zered,
        Zithri, Zorathites, Zorites, abolish, alloweth, amen, amethyst,
        badness, bakers, bat, begging, casement, chamois, clearness,
        conquer, consist, crew, cupbearer, cymbal, delicacies, dropsy,
        dryshod, effected, excused, fishhooks, flatteries, foals,
        foreheads, hungered, inclosings, inheriteth, manger, marketplaces,
        meadow, meant, mightiest, mouldy, mowings, murrain, occurrent,
        offenders, ospray, overspread, perfectness, perpetually,
        pollution, pounds, pricks, recorder, repenting, rigour, rushes,
        shrubs, size, socket, sorrowing, stanched, target, tens,
        tentmakers, therefrom, traitor, vails, woods, wretchedness

Unsurprisingly, the "begat" cluster is entirely proper names:

    begat: Abia, Abishua, Abiud, Achaz, Achim, Amariah, Aminadab,
        Amminadab, Attai, Azor, Bukki, Canaan, Coz, Eliud, Ephlal, Eshton,
        Esrom, Ezekias, Hezron, Irad, Jahath, Jarah, Jechonias, Jehoadah,
        Jekamiah, Joatham, Joiada, Josaphat, Josias, Kish, Manasses,
        Matthan, Mehujael, Meonothai, Meraioth, Meribbaal, Methusael,
        Mikloth, Mizraim, Ner, Obed, Ozias, Phares, Pharez, Ram, Roboam,
        Sadoc, Salma, Salmon, Shaharaim, Shema, Shobal, Sisamai, Zabad,
        Zerahiah, Zorobabel

All in all, the 14000 or so distinct words get clustered by this approach into
1793 clusters, of which 1185 have only one word.  Many of the smaller clusters
very clearly express some kind of affinity between words:

    support: feebleminded, financial, volunteer
    form: hypertext, proprietary, readable
    falsely: science, swearing, you
    Where: Puteoli, Telassar, Thelasar
    brother: weak, younger, youngest
    appear: flowers, menchildren, plainly
    honour: husbands, retaineth, without
    hither: Hasten, Reach, description
    Duke: Jetheth, Mibzar, Pinon
    branches: Took, fruitful, natural
    She: Tirhanah, distaff, tributary
    besides: Stephanas, loins, self
    knoweth: Man, certainly, unjust
    Till: intermission, purity, stocks
    slept: Jehoiakim, Menahem, Omri
    died: Achbor, Chilion, Jether, Pirathonite, Seled, Terah
    When: Hareth, discretion, guile, patrimony, thereat, unequal
    F: DAMAGE, PARAGRAPH, PURPOSE, below, problem, provisions
    such: Defects, lettest, marvels, perhaps, pigeons, saveth
    year: eighteenth, eightieth, fiftieth, fortieth, nineteenth, thirtieth
    cities: Berothai, Chun, defenced, fenced, ruined, thirteen
    children: Little, Zebedees, backsliding, beget, impudent, sottish
    two: Asuppim, Telaim, Teresh, calamus, containing, spearmen
    So: corpses, debt, eater, freewoman, peacocks, saying
    none: So, burned, finding, put, quarter, smiths
    might: I, mortality, mother, purpose, scripture, whereto
    bare: Adah, Aramitess, Bashemath, Hammoleketh, Jehudijah, Milcah
    OR: DISTRIBUTE, EXPRESS, MERCHANTIBILITY, PUNITIVE, REPLACEMENT,
        WARRANTY
    some: Arm, oppressed, save, sixtyfold, sowed, thirtyfold

It seems like this kind of clustering might be a useful dimensionality-
reduction technique for natural-language processing, in general; and perhaps it
could be iterated to reduce the numbers further.  For example, when applying
Katz smoothing to a language model, as an alternative to backing off to shorter
N-grams when you don't have enough N-grams for a given value of N, you could
back off to N-grams computed over a vocabulary of these clusters.

Efficient computation
---------------------

An interesting thing about this follower-median characteristic -- it can be
computed efficiently from a suffix array.  Although I'm not familiar with the
performance characteristics of modern suffix-array construction algorithms (I
only know that they're linear-time), I suspect that computing the suffix array
is, in practice, more expensive than computing the pdfs of the contexts using a
hash table, then computing their cdfs in order to find the median, which I seem
to recall is what you were doing; but if you're already paying the cost of
building a suffix array for some reason, this may be useful.  For example, I'm
thinking about using a suffix array for Katz smoothing of a language model.

To find the distribution median of followers for a single context:

1. Find the beginning B and end E of that context's span in the suffix array
   --- that is, the first index B such that text[SA[B]:].startswith(context),
   and the last index E meeting the same criterion.  This is O(lg len(text)).
2. Find the median index M = (B+E)/2, or under some circumstances, B+(E-B)/2.
3. Find text[SA[M]+len(context)], and that's your median follower character.

Step 3 is slightly more complicated if you want something more than a single
character, but not much; this is an advantage of the suffix-array structure.

Finding the distribution median for all M contexts this way is 
O(M lg len(text)), which might be less than O(len(text)).  It's very likely
faster than computing the cdfs, which is 256*M work, albeit sequential.

There's a much simpler linear-time algorithm to get them all at once, of
course: read through the suffix array in order, remembering the last boundary
between contexts.

Simple implementation
---------------------

This is the code I used to get the results above.

    #!/usr/bin/python
    # -*- coding: utf-8 -*-
    import sys
    import re
    import textwrap

    def words(file, pattern='[a-zA-Z]+'):
        return (word for line in file for word in re.findall(pattern, line))

    def ngrams(n, words):
        queue = []
        for word in words:
            queue.append(word)
            if len(queue) == n:
                yield tuple(queue)          # Lists aren’t hashable.
                queue.pop(0)

    def multimap(kvpairs):
        rv = {}
        for key, value in kvpairs:
            if key not in rv:
                rv[key] = []
            rv[key].append(value)

        return rv

    following = multimap(ngrams(2, words(open(sys.argv[1]))))
    clusters = multimap((sorted(next_words)[len(next_words)//2], prev_word)
                        for prev_word, next_words in following.iteritems())
    items = sorted(clusters.iteritems(), key=lambda (name, contents): 
len(contents))

    for next_word, prev_words in items:
        for line in textwrap.wrap(', '.join(sorted(prev_words)),
                                  initial_indent=next_word+': ',
                                  subsequent_indent='    '):
            print line

-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol

language-model dimensionality reduction with Markov-chain distribution medians

Reply via email to