In my kragen-tol post earlier this month about [predictive text input methods][0], I wrote:
> You could do *much better* than SwiftKey with dimensionality reduction > techniques, which could allow much more context to be brought to bear > on the prediction problem using a much smaller model. SwiftKey often > makes predictions that yield clearly ungrammatical 3-grams, which even > a simple 3-gram model over part-of-speech tags ought to be able to > reject. [0]: http://lists.canonical.org/pipermail/kragen-tol/2012-July/000961.html But part-of-speech tagging is sort of hard, especially on incomplete and possibly erroneous sentences containing words that aren't in the dictionary. But it seems like you ought to be able to do a good enough job with simple statistics to be useful. Now, the rest of this post may be a case of "a few hours of hacking can save you several minutes in the library", so take that under advisement. I haven't checked out the literature enough to know if this idea is already known, let alone an improvement over what already exists. One simple approach is to consider a Markov-chain model of text, and cluster together words whose transition probabilities are similar, either for in-transitions (what they tend to follow) or out-transitions (what they tend to precede). However, this runs into a practical problem immediately: even with a first-order Markov model, you're trying to perform unsupervised classification on thousands of vectors in a space with thousands of dimensions. This is, as far as I know, a totally intractable problem, because in high-dimensional spaces, almost all points are closer to boundaries of the space than they are to any other point. So you need to reduce the dimensionality of the term space before you can reduce the dimensionality of the term space. Rather than eat my tail through that rathole, I thought maybe I'd pick some kind of arbitrary characteristic of those vectors to cluster them. The vectors are finite discrete probability distributions of preceding or following words, so given a mapping from the words onto the real line, you can use the usual set of characteristics of finite discrete probability distributions, such as mean, median, mode, standard deviation, skewness, kurtosis, and so on. Of these, median is dependent only on the ordering of the words, and mode isn't even dependent on the ordering. So if we cluster these distributions by their median, an idea due to a friend of mine who should get credit if he wants it, we should be able to easily find words that are used in similar contexts. So, grouping the words into clusters that tend to precede the same word, using the KJV bible for my example text, I got reasonable-looking clusters like the following. First, a cluster of mostly verbs, mostly naughty ones, that precede "not", although interestingly "Is", "am", and "ye" made it in there too: not: Accuse, Be, Boast, Cast, Cease, Could, Count, Curse, Deceive, Defile, Desire, Despise, Destroy, Devise, Did, Didst, Distress, Do, Doth, Dread, Eat, Enter, Fear, Forget, Fret, Give, Grudge, Harden, Hide, Hurt, Is, Judge, Labour, Lay, Lodge, Lust, Marvel, Meddle, Mind, Murmur, Neglect, Quench, Regard, Rejoice, Rejoiceth, Remove, Reprove, Respect, Rob, Slack, Think, Told, Touch, Trust, Uncover, Was, Weep, Went, Withhold, agreed, altereth, am, appertaineth, approveth, are, attained, backbiteth, bearest, believed, believeth, bewray, brawler, bridleth, buildedst, burdened, canst, casteth, ceased, circumspectly, confesseth, consent, considerest, consisteth, continueth, could, couldest, defer, deferred, defile, defileth, delightest, descendeth, dieth, diggedst, dishonesty, displease, do, does, doeth, doubletongued, doubt, durst, edify, envieth, epistle, extendeth, falleth, feared, filledst, followedst, forbid, fret, gathereth, gnaw, grieve, harden, heareth, hearkenedst, hide, housetop, kept, knew, knewest, knowest, lacketh, lady, laid, layedst, layeth, let, lingereth, needed, needest, needeth, obey, obeyed, obeyedst, obeyeth, observest, oppress, patient, payeth, perceivest, perfection, prefer, profane, purifieth, rained, receive, regard, regarded, regardest, regardeth, remained, repented, respected, respecteth, return, returneth, reviled, sacrificeth, savourest, saw, seek, servedst, shalt, shone, slack, slumbereth, sobriety, sowest, spared, spin, staggered, striker, strive, stumbleth, suffereth, swelled, tarrieth, tarry, they, toil, tortured, touch, travailest, trow, understand, understood, upbraideth, vaunteth, wasted, wax, wipe, wist, withdraweth, withheldest, wot, wotteth, wouldest, wrestle, ye And second, a cluster of mostly proper names that precede capitalized "And": And: Abez, Abiezrites, Abihud, Abishalom, Achshaph, Adadah, Ader, Ahlai, Ahoah, Alamoth, Allonbachuth, Ammonitess, Anaharath, Aniam, Anim, Antothijah, Aphekah, Ara, Ardon, Aridatha, Armageddon, Asareel, Ashbea, Ashima, Asiel, Aspatha, Athach, Athlai, Attalia, Avith, Azem, Azzan, Bealoth, Beera, Berith, Bethbaalmeon, Bethdiblathaim, Bethgader, Bethmeon, Bethpalet, Bethpazzez, Bethphelet, Birzavith, Boscath, Camon, Canaanitess, Caphthorim, Caphtorim, Chelubai, Chislon, Chronicles, Dalmanutha, Diklah, Dinhabah, Dothan, EARTH, Edar, Eker, EleloheIsrael, Elpalet, Enhazor, Ephesdammim, Ephod, Eshbaal, Eshean, Ethnan, Euroclydon, Ezel, Gaash, Gabbatha, Galeed, Galilaean, Gazer, Gennesaret, Gilboa, Girgashite, Girgasite, Goath, Hamongog, Hamul, Hararite, Harum, Hasenuah, Hathath, Hazarsusah, Hazelelponi, Hazezontamar, Helekites, Hepherites, Hormah, Huphamites, Ibnijah, Imla, Irshemesh, Ishmeelite, Ivah, Jaasau, Jagur, Japhia, Japho, ... Tyrannus, Urias, Uzzensherah, Zacher, Zaham, Zarthan, Zered, Zithri, Zorathites, Zorites, abolish, alloweth, amen, amethyst, badness, bakers, bat, begging, casement, chamois, clearness, conquer, consist, crew, cupbearer, cymbal, delicacies, dropsy, dryshod, effected, excused, fishhooks, flatteries, foals, foreheads, hungered, inclosings, inheriteth, manger, marketplaces, meadow, meant, mightiest, mouldy, mowings, murrain, occurrent, offenders, ospray, overspread, perfectness, perpetually, pollution, pounds, pricks, recorder, repenting, rigour, rushes, shrubs, size, socket, sorrowing, stanched, target, tens, tentmakers, therefrom, traitor, vails, woods, wretchedness Unsurprisingly, the "begat" cluster is entirely proper names: begat: Abia, Abishua, Abiud, Achaz, Achim, Amariah, Aminadab, Amminadab, Attai, Azor, Bukki, Canaan, Coz, Eliud, Ephlal, Eshton, Esrom, Ezekias, Hezron, Irad, Jahath, Jarah, Jechonias, Jehoadah, Jekamiah, Joatham, Joiada, Josaphat, Josias, Kish, Manasses, Matthan, Mehujael, Meonothai, Meraioth, Meribbaal, Methusael, Mikloth, Mizraim, Ner, Obed, Ozias, Phares, Pharez, Ram, Roboam, Sadoc, Salma, Salmon, Shaharaim, Shema, Shobal, Sisamai, Zabad, Zerahiah, Zorobabel All in all, the 14000 or so distinct words get clustered by this approach into 1793 clusters, of which 1185 have only one word. Many of the smaller clusters very clearly express some kind of affinity between words: support: feebleminded, financial, volunteer form: hypertext, proprietary, readable falsely: science, swearing, you Where: Puteoli, Telassar, Thelasar brother: weak, younger, youngest appear: flowers, menchildren, plainly honour: husbands, retaineth, without hither: Hasten, Reach, description Duke: Jetheth, Mibzar, Pinon branches: Took, fruitful, natural She: Tirhanah, distaff, tributary besides: Stephanas, loins, self knoweth: Man, certainly, unjust Till: intermission, purity, stocks slept: Jehoiakim, Menahem, Omri died: Achbor, Chilion, Jether, Pirathonite, Seled, Terah When: Hareth, discretion, guile, patrimony, thereat, unequal F: DAMAGE, PARAGRAPH, PURPOSE, below, problem, provisions such: Defects, lettest, marvels, perhaps, pigeons, saveth year: eighteenth, eightieth, fiftieth, fortieth, nineteenth, thirtieth cities: Berothai, Chun, defenced, fenced, ruined, thirteen children: Little, Zebedees, backsliding, beget, impudent, sottish two: Asuppim, Telaim, Teresh, calamus, containing, spearmen So: corpses, debt, eater, freewoman, peacocks, saying none: So, burned, finding, put, quarter, smiths might: I, mortality, mother, purpose, scripture, whereto bare: Adah, Aramitess, Bashemath, Hammoleketh, Jehudijah, Milcah OR: DISTRIBUTE, EXPRESS, MERCHANTIBILITY, PUNITIVE, REPLACEMENT, WARRANTY some: Arm, oppressed, save, sixtyfold, sowed, thirtyfold It seems like this kind of clustering might be a useful dimensionality- reduction technique for natural-language processing, in general; and perhaps it could be iterated to reduce the numbers further. For example, when applying Katz smoothing to a language model, as an alternative to backing off to shorter N-grams when you don't have enough N-grams for a given value of N, you could back off to N-grams computed over a vocabulary of these clusters. Efficient computation --------------------- An interesting thing about this follower-median characteristic -- it can be computed efficiently from a suffix array. Although I'm not familiar with the performance characteristics of modern suffix-array construction algorithms (I only know that they're linear-time), I suspect that computing the suffix array is, in practice, more expensive than computing the pdfs of the contexts using a hash table, then computing their cdfs in order to find the median, which I seem to recall is what you were doing; but if you're already paying the cost of building a suffix array for some reason, this may be useful. For example, I'm thinking about using a suffix array for Katz smoothing of a language model. To find the distribution median of followers for a single context: 1. Find the beginning B and end E of that context's span in the suffix array --- that is, the first index B such that text[SA[B]:].startswith(context), and the last index E meeting the same criterion. This is O(lg len(text)). 2. Find the median index M = (B+E)/2, or under some circumstances, B+(E-B)/2. 3. Find text[SA[M]+len(context)], and that's your median follower character. Step 3 is slightly more complicated if you want something more than a single character, but not much; this is an advantage of the suffix-array structure. Finding the distribution median for all M contexts this way is O(M lg len(text)), which might be less than O(len(text)). It's very likely faster than computing the cdfs, which is 256*M work, albeit sequential. There's a much simpler linear-time algorithm to get them all at once, of course: read through the suffix array in order, remembering the last boundary between contexts. Simple implementation --------------------- This is the code I used to get the results above. #!/usr/bin/python # -*- coding: utf-8 -*- import sys import re import textwrap def words(file, pattern='[a-zA-Z]+'): return (word for line in file for word in re.findall(pattern, line)) def ngrams(n, words): queue = [] for word in words: queue.append(word) if len(queue) == n: yield tuple(queue) # Lists aren’t hashable. queue.pop(0) def multimap(kvpairs): rv = {} for key, value in kvpairs: if key not in rv: rv[key] = [] rv[key].append(value) return rv following = multimap(ngrams(2, words(open(sys.argv[1])))) clusters = multimap((sorted(next_words)[len(next_words)//2], prev_word) for prev_word, next_words in following.iteritems()) items = sorted(clusters.iteritems(), key=lambda (name, contents): len(contents)) for next_word, prev_words in items: for line in textwrap.wrap(', '.join(sorted(prev_words)), initial_indent=next_word+': ', subsequent_indent=' '): print line -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol