The SwiftKey X trial just expired on me on the Android phone loaner I’m using from my work. I still have the keyboard layout, but the predictive text no longer works. This makes it a real bummer to use the machine.
I guess it serves me right for depending on proprietary software. I should have known better. So it occurred to me to think about what would be needed to replace it with free software. The first thing to note is that SwiftKey X can result in major disclosures of confidential information. If you borrow someone else’s Android phone to check your Facebook account or whatever, and you start accepting the default suggestion over and over again, it will frequently reproduce substantial chunks of text from their history. If they used SwiftKey to enter confidential information into their phone, you may end up seeing bits of it purely by accident if you use some of the same words. So, given that this is apparently acceptable, a quick-and-dirty input method that works more or less the same way could be done as follows: 1. Use one of the n-gram datasets published by Google (e.g. the 2009 books dataset from <http://books.google.com/ngrams/datasets>; in English, 2 gigabytes raw compressed for 1-grams, 26 gigabytes raw for 2-grams, 88 gigabytes raw for 3-grams; other languages available are French, Chinese, German, Hebrew, Russian, and Spanish, but not Arabic or Hindi; there’s also the smaller 2006 web dataset, <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13> linked from <http://googleresearch.blogspot.com.ar/2006/08/all-our-n-gram-are-belong-to-you.html> for US$150) as a baseline. A reasonable 1-gram dataset for English, like the British National Corpus frequency file I frequently use, might be half a megabyte compressed; the most common 2-grams might be another megabyte or two. SwiftKey has apparently shown that you can get by with very few or no 3-grams. 2. Store the history of everything the user types --- maybe up to some limit, like 16 megabytes uncompressed. (At 60wpm, that would be about 800 hours of typing.) Maintain a prefix array --- a suffix array of the reversed text --- of this history. The suffix array allows you to find the longest suffix of the text before the cursor that has previously occurred, and what followed it, in logarithmic time: 24 fetches from flash, say, or 0.5 ms. An initial prototype could skip the suffix array and just do linear search of the text, which should scale to a few hundred kilobytes, while a version that does the search using a hash table indexed by three-character contexts (as described in the DEFLATE RFC) should scale to a megabyte or more. From the n-grams, you can use e.g. Katz smoothing to derive enough of a probability distribution to do Viterbi decoding of an observed keystroke sequence. SwiftKey does a little more than this, because it can handle insertion and deletion errors as well, but that can be handled with a simple dynamic-programming algorithm that’s very similar to Viterbi in spirit. Making the jump from there to best-first search should give you the three possibilities displayed by SwiftKey. SwiftKey doesn’t seem to take into account how far from the center of the key your finger is, but it does seem to take into account how accurate your hitting of a particular key usually is. The second item can be used not merely to update 3-gram-based language models, but also to find the longest context immediately preceding the cursor that has ever occurred before, then all of the places that context has occurred, and what followed. Probably the most recent occurrence should be used for the #1 prediction, rather than the most frequent; in general it will not be practical to retrieve all of the occurrences, but perhaps enough could be retrieved to adjust the language model substantially. Kärkkäinen and Sanders’s “skew algorithm”, although not the first linear-time suffix-array construction algorithm, nor the most efficient, is probably the simplest. The C++ implementation in their paper is only 50 lines of code. Another outcome of all of the above, apparently not used by SwiftKey (but, I suspect, used by iOS) is a probability distribution for your next keystroke. This could be used to expand and contract the hotspots for each key on the keyboard, so that key taps on the border between two keys will be interpreted as the correct key. You could do *much better* than SwiftKey with dimensionality reduction techniques, which could allow much more context to be brought to bear on the prediction problem using a much smaller model. SwiftKey often makes predictions that yield clearly ungrammatical 3-grams, which even a simple 3-gram model over part-of-speech tags ought to be able to reject. There’s a lot of dimensionality-reduction work in the last decade of statistical computational linguistics that could be brought to bear. -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol