Hi,

        What I think is going on is that the corpus has short sentences.  Or at
least that's my stereotype of the GNOME and Ubuntu data from OPUS.  So
there are not many ways to extend a 5-gram, which is confusing
Kneser-Ney.  You can always duct-tape it with --discount_fallback.

Kenneth

On 12/3/18 12:52 PM, James Baker wrote:
> Strangely, if I take a random sample of 75% of that same data, it works
> just fine. I can use that for the time being, but it is a curious "feature"!
> 
> James
> 
> On Mon, 3 Dec 2018 at 12:34, James Baker <james.d.ba...@gmail.com
> <mailto:james.d.ba...@gmail.com>> wrote:
> 
>     What would constitute duplicated in this context? The number of
>     duplicated lines in the document is relatively small, but it's
>     possible some of the lines have similar text.
> 
>         $ wc lm_data.en 
>          1876364 21359196 96962517 lm_data.en
>         $ sort lm_data.en | uniq > lm_data_uniq.en
>         $ wc lm_data_uniq.en 
>          1487703 15801025 71344598 lm_data_uniq.en
> 
>     I'd have thought there should be enough unique data in there though,
>     as the file is a combined version of the following datasets from OPUS:
> 
>     * GNOME
>     * OpenSubtitles 2018
>     * Tanzil
>     * Tatoeba
>     * Ubuntu
> 
>     Thanks,
>     James
> 
>     On Mon, 3 Dec 2018 at 11:58, Kenneth Heafield <mo...@kheafield.com
>     <mailto:mo...@kheafield.com>> wrote:
> 
>         Hi,
> 
>             If I had to guess, you have a lot of duplicated text? 
> 
>         Kenneth
> 
>         On 12/3/18 11:23 AM, James Baker wrote:
>>         Morning,
>>
>>         I've been trying to train a language model using the following
>>         command:
>>
>>             /opt/model-builder/mosesdecoder/bin/lmplz -o 5 -S 80% -T
>>         /tmp < lm_data.en > model.lm
>>
>>         But I'm getting the following error:
>>
>>             === 1/5 Counting and sorting n-grams ===
>>             Reading /opt/model-builder/training/lm_data.en
>>            
>>         
>> ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
>>            
>>         
>> ****************************************************************************************************
>>             Unigram tokens 21187448 types 117756
>>             === 2/5 Calculating and sorting adjusted counts ===
>>             Chain sizes: 1:1413072 2:5151762432 3:9659554816
>>         4:15455287296 5:22538960896
>>             terminate called after throwing an instance of
>>         'lm::builder::BadDiscountException'
>>             what():
>>         /opt/model-builder/mosesdecoder/lm/builder/adjust_counts.cc:61
>>         in void
>>         lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const
>>         lm::builder::DiscountConfig&) threw BadDiscountException
>>         because `discounts_[i].amount[j] < 0.0 ||
>>         discounts_[i].amount[j] > j'.
>>             ERROR: 5-gram discount out of range for adjusted count 2:
>>         -6.80247
>>
>>         The data I'm training on has come from the OPUS project. I
>>         found some references online to issues when there isn't enough
>>         training data, but I think I have sufficient data and have
>>         previously trained on a lot less (and even on a subset of my
>>         current data):
>>
>>             $ wc lm_data.en 
>>             1874495 21187448 96148754 lm_data.en
>>
>>         Any ideas what might be causing the problem?
>>
>>         James
>>
>>         _______________________________________________
>>         Moses-support mailing list
>>         Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>         http://mailman.mit.edu/mailman/listinfo/moses-support
>         _______________________________________________
>         Moses-support mailing list
>         Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>         http://mailman.mit.edu/mailman/listinfo/moses-support
> 
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to