how to do a better predictive text input method for Android

Kragen Javier Sitaker Sat, 14 Jul 2012 22:15:08 -0700

The SwiftKey X trial just expired on me on the Android phone loaner
I’m using from my work.  I still have the keyboard layout, but the
predictive text no longer works.  This makes it a real bummer to use
the machine.


I guess it serves me right for depending on proprietary software.  I
should have known better.

So it occurred to me to think about what would be needed to replace
it with free software.

The first thing to note is that SwiftKey X can result in major
disclosures of confidential information.  If you borrow someone else’s
Android phone to check your Facebook account or whatever, and you
start accepting the default suggestion over and over again, it will
frequently reproduce substantial chunks of text from their history.
If they used SwiftKey to enter confidential information into their
phone, you may end up seeing bits of it purely by accident if you use
some of the same words.

So, given that this is apparently acceptable, a quick-and-dirty input
method that works more or less the same way could be done as follows:

1. Use one of the n-gram datasets published by Google (e.g. the 2009
   books dataset from <http://books.google.com/ngrams/datasets>; in
   English, 2 gigabytes raw compressed for 1-grams, 26 gigabytes raw
   for 2-grams, 88 gigabytes raw for 3-grams; other languages
   available are French, Chinese, German, Hebrew, Russian, and
   Spanish, but not Arabic or Hindi; there’s also the smaller 2006 web
   dataset,
   <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13>
   linked from
   
<http://googleresearch.blogspot.com.ar/2006/08/all-our-n-gram-are-belong-to-you.html>
   for US$150) as a baseline.  A reasonable 1-gram dataset for
   English, like the British National Corpus frequency file I
   frequently use, might be half a megabyte compressed; the most
   common 2-grams might be another megabyte or two.  SwiftKey has
   apparently shown that you can get by with very few or no 3-grams.

2. Store the history of everything the user types --- maybe up to some
   limit, like 16 megabytes uncompressed.  (At 60wpm, that would be
   about 800 hours of typing.)  Maintain a prefix array --- a suffix
   array of the reversed text --- of this history.  The suffix array
   allows you to find the longest suffix of the text before the cursor
   that has previously occurred, and what followed it, in logarithmic
   time: 24 fetches from flash, say, or 0.5 ms.  An initial prototype
   could skip the suffix array and just do linear search of the text,
   which should scale to a few hundred kilobytes, while a version that
   does the search using a hash table indexed by three-character
   contexts (as described in the DEFLATE RFC) should scale to a
   megabyte or more.

From the n-grams, you can use e.g. Katz smoothing to derive enough of
a probability distribution to do Viterbi decoding of an observed
keystroke sequence.  SwiftKey does a little more than this, because it
can handle insertion and deletion errors as well, but that can be
handled with a simple dynamic-programming algorithm that’s very
similar to Viterbi in spirit.  Making the jump from there to
best-first search should give you the three possibilities displayed by
SwiftKey.  SwiftKey doesn’t seem to take into account how far from the
center of the key your finger is, but it does seem to take into
account how accurate your hitting of a particular key usually is.

The second item can be used not merely to update 3-gram-based language
models, but also to find the longest context immediately preceding the
cursor that has ever occurred before, then all of the places that
context has occurred, and what followed.  Probably the most recent
occurrence should be used for the #1 prediction, rather than the most
frequent; in general it will not be practical to retrieve all of the
occurrences, but perhaps enough could be retrieved to adjust the
language model substantially.

Kärkkäinen and Sanders’s “skew algorithm”, although not the first
linear-time suffix-array construction algorithm, nor the most
efficient, is probably the simplest.  The C++ implementation in their
paper is only 50 lines of code.

Another outcome of all of the above, apparently not used by SwiftKey
(but, I suspect, used by iOS) is a probability distribution for your
next keystroke.  This could be used to expand and contract the
hotspots for each key on the keyboard, so that key taps on the border
between two keys will be interpreted as the correct key.

You could do *much better* than SwiftKey with dimensionality reduction
techniques, which could allow much more context to be brought to bear
on the prediction problem using a much smaller model.  SwiftKey often
makes predictions that yield clearly ungrammatical 3-grams, which even
a simple 3-gram model over part-of-speech tags ought to be able to
reject.  There’s a lot of dimensionality-reduction work in the last
decade of statistical computational linguistics that could be brought
to bear.

-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol

how to do a better predictive text input method for Android

Reply via email to