Hi, 1. I have an implementation with some optimizations that you mentioned. Even when keying on the first two words on a ngram, we would still have skewed sharding for unigrams. Isn't it? > You would, but this will be a lot less. > 2. One of the nice things I would like to facilitate is daily *incremental* updates to LM. I have previously read your work on randomized storage of LM's and found it very interesting. I will look through it again to jog my memory and send questions I have your away. > We also have this too, and in a randomised setting also. Look at out "streaming" language model work, which allows for incremental updates to a precomputed LM. Although I say so myself, I like this a lot, since it effectively allows for LMs to be trained on unbounded amounts of monolingual data:
www.aclweb.org/anthology/D/D09/D09-1079.pdf 3. It would great if you can elaborate on why HBase did not meet your needs. Was this application specific? > This may have been due to us using an early version of it. But it was just too slow and unreliable at the time. Also, we have a strong preference for code in C++ and having to deal with Java is just a pain. >. thanks, Mandar -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
