Re: Language Pack size

Matt Post Fri, 13 May 2016 15:47:01 -0700

Oh, yeah, of course. So to summarize:

Language model:


- quantize
- try four grams

Packing:

- pre-compute dot product with weight vector
- quantize the score
- prune out low-probability translation options
- pre-sort the grammar

This would make things both smaller and faster, and most of this wouldn't even 
require changing any code (pre-sorting might).

I am working on fixing up the language pack pages; I want to have a suite of 
test sets for each language pack, so we can know whether a new language pack is 
better.

matt


> On May 13, 2016, at 5:11 PM, kellen sunderland <kellen.sunderl...@gmail.com> 
> wrote:
> 
> That's a great idea, can we pre-sort the grammar as well?
> 
> On Fri, May 13, 2016 at 1:47 PM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> Quantization is also supported in the grammar packer.
>> 
>> Another idea: since we know the model weights when we publish a language
>> pack, we should pre-compute the dot product of the weight vector against
>> the grammar weights and reduce it to a single (quantized) score.
>> 
>> (This would reduce the ability for users to play with the individual
>> weights, but I don't think that's a huge loss, since the main weight is LM
>> vs. TM).
>> 
>> matt
>> 
>> 
>>> On May 13, 2016, at 4:45 PM, Matt Post <p...@cs.jhu.edu> wrote:
>>> 
>>> Oh, yes, of course. That's in build_binary.
>>> 
>>> 
>>>> On May 13, 2016, at 4:39 PM, kellen sunderland <
>> kellen.sunderl...@gmail.com> wrote:
>>>> 
>>>> Could we also use quantization with the language model to reduce the
>> size?
>>>> KenLM supports this right?
>>>> 
>>>> On Fri, May 13, 2016 at 1:19 PM, Matt Post <p...@cs.jhu.edu> wrote:
>>>> 
>>>>> Great idea, hadn't thought of that.
>>>>> 
>>>>> I think we could also get some leverage out of:
>>>>> 
>>>>> - Reducing the language model to a 4-gram one
>>>>> - Doing some filtering of the phrase table to reduce low-probability
>>>>> translation options
>>>>> 
>>>>> These would be a bit lossier but I doubt it would matter much at all.
>>>>> 
>>>>> matt
>>>>> 
>>>>> 
>>>>>> On May 13, 2016, at 4:02 PM, Tom Barber <t...@analytical-labs.com>
>> wrote:
>>>>>> 
>>>>>> Out of curiosity more than anything else I tested XZ compression on a
>>>>> model
>>>>>> instead of Gzip, it takes the Spain pack down from 1.9GB to 1.5GB, not
>>>>> the
>>>>>> most ever, but obviously does mean 400MB+ less in remote storage and
>> data
>>>>>> going over the wire.
>>>>>> 
>>>>>> Worth considering I guess.
>>>>>> 
>>>>>> Tom
>>>>>> --------------
>>>>>> 
>>>>>> Director Meteorite.bi - Saiku Analytics Founder
>>>>>> Tel: +44(0)5603641316
>>>>>> 
>>>>>> (Thanks to the Saiku community we reached our Kickstart
>>>>>> <
>>>>> 
>> http://kickstarter.com/projects/2117053714/saiku-reporting-interactive-report-designer/
>>>>>> 
>>>>>> goal, but you can always help by sponsoring the project
>>>>>> <http://www.meteorite.bi/products/saiku/sponsorship>)
>>>>> 
>>>>> 
>>> 
>> 
>>

Re: Language Pack size

Reply via email to