Hi,

1. I have an implementation with some optimizations that you
mentioned. Even when keying on the first two words on a ngram, we
would still have skewed sharding for unigrams. Isn't it?

2. One of the nice things I would like to facilitate is daily
*incremental* updates to LM. I have previously read your work on
randomized storage of LM's and found it very interesting. I will look
through it again to jog my memory and send questions I have your away.

3. It would great if you can elaborate on why HBase did not meet your
needs. Was this application specific?

thanks,
Mandar

On Fri, Feb 5, 2010 at 12:46 AM, Miles Osborne <[email protected]> wrote:
>>
> 1. I agree that I might not have to use any fancy smoothing, but even
> at Google scale using simple smoothing seems to aid performance (at
> least for Machine translation)
> http://acl.ldc.upenn.edu/D/D07/D07-1090.pdf
>>
> I said "fancy smoothing", not no smoothing.  we actually do on-the-fly
> Witten-Bell smoothing (and sometimes Stupid-Backoff, which is what you
> can do with large LMs).
>
> for smaller LMs we do Kneser-Ney.
>
> 2. Is your code open source?
>
> My ngram code hasn't been released, but it is not hard to do yourself.
>  collecting ngrams and counts is really a generalisation of the
> standard word counting problem.
> (to make it more efficient you would need to do in-mapper combining).
>
> One thing i have meant to do is deal with skewed sharding.  basically,
> high frequency function words tend to get sent to the same shard and
> this makes reducing not very well balanced.
> (to do this you key on the first two words of a gram), rather than just one)
>
> 3. I was also looking to understand if there were any efforts to store
> these large sets optimally for real time access. Can you please point
> me to effort on hosting LM's using hypertable effort?
>
> Currently we store very large LMs in a randomised manner.  look here
> for our source forge release:
>
> https://sourceforge.net/projects/randlm/
>
> The associated papers can be found on my homepage, under randomised
> language modelling:
>
> http://www.iccs.informatics.ed.ac.uk/~miles/mt-papers.html
>
> The state-of-the-art in large LMs is to use a cluster of machines (ie
> some kind of bigTable setup) along with a randomised representation.
> If you store fingerprints for ngrams and quantise your probabilities,
> you can retrieve each gram in about three hash functions (or less).
> Over time I have been exploring how to do this.  My first attempt used
> Chord, but that didn't really work-out.  We also looked at HBase
> (ditto). Right now I have a student looking at HypeTable.  He has
> implemented non-blocking I/O (ie you can batch requests, send them off
> and do something else) and also some tricks to spot when bogus ngram
> requests are being made across the network.
> It turns-out that for Machine Translation, the vast majority of ngram
> requests are for grams that don't exist.
>
>
> Miles
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>

Reply via email to