Hi, If I had to guess, you have a lot of duplicated text?
Kenneth On 12/3/18 11:23 AM, James Baker wrote: > Morning, > > I've been trying to train a language model using the following command: > > /opt/model-builder/mosesdecoder/bin/lmplz -o 5 -S 80% -T /tmp < > lm_data.en > model.lm > > But I'm getting the following error: > > === 1/5 Counting and sorting n-grams === > Reading /opt/model-builder/training/lm_data.en > > ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 > > **************************************************************************************************** > Unigram tokens 21187448 types 117756 > === 2/5 Calculating and sorting adjusted counts === > Chain sizes: 1:1413072 2:5151762432 3:9659554816 4:15455287296 > 5:22538960896 > terminate called after throwing an instance of > 'lm::builder::BadDiscountException' > what(): > /opt/model-builder/mosesdecoder/lm/builder/adjust_counts.cc:61 in void > lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const > lm::builder::DiscountConfig&) threw BadDiscountException because > `discounts_[i].amount[j] < 0.0 || discounts_[i].amount[j] > j'. > ERROR: 5-gram discount out of range for adjusted count 2: -6.80247 > > The data I'm training on has come from the OPUS project. I found some > references online to issues when there isn't enough training data, but > I think I have sufficient data and have previously trained on a lot > less (and even on a subset of my current data): > > $ wc lm_data.en > 1874495 21187448 96148754 lm_data.en > > Any ideas what might be causing the problem? > > James > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support