Strangely, if I take a random sample of 75% of that same data, it works just fine. I can use that for the time being, but it is a curious "feature"!
James On Mon, 3 Dec 2018 at 12:34, James Baker <james.d.ba...@gmail.com> wrote: > What would constitute duplicated in this context? The number of duplicated > lines in the document is relatively small, but it's possible some of the > lines have similar text. > > $ wc lm_data.en > 1876364 21359196 96962517 lm_data.en > $ sort lm_data.en | uniq > lm_data_uniq.en > $ wc lm_data_uniq.en > 1487703 15801025 71344598 lm_data_uniq.en > > I'd have thought there should be enough unique data in there though, as > the file is a combined version of the following datasets from OPUS: > > * GNOME > * OpenSubtitles 2018 > * Tanzil > * Tatoeba > * Ubuntu > > Thanks, > James > > On Mon, 3 Dec 2018 at 11:58, Kenneth Heafield <mo...@kheafield.com> wrote: > >> Hi, >> >> If I had to guess, you have a lot of duplicated text? >> >> Kenneth >> On 12/3/18 11:23 AM, James Baker wrote: >> >> Morning, >> >> I've been trying to train a language model using the following command: >> >> /opt/model-builder/mosesdecoder/bin/lmplz -o 5 -S 80% -T /tmp < >> lm_data.en > model.lm >> >> But I'm getting the following error: >> >> === 1/5 Counting and sorting n-grams === >> Reading /opt/model-builder/training/lm_data.en >> >> ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 >> >> **************************************************************************************************** >> Unigram tokens 21187448 types 117756 >> === 2/5 Calculating and sorting adjusted counts === >> Chain sizes: 1:1413072 2:5151762432 3:9659554816 4:15455287296 >> 5:22538960896 >> terminate called after throwing an instance of >> 'lm::builder::BadDiscountException' >> what(): >> /opt/model-builder/mosesdecoder/lm/builder/adjust_counts.cc:61 in void >> lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const >> lm::builder::DiscountConfig&) threw BadDiscountException because >> `discounts_[i].amount[j] < 0.0 || discounts_[i].amount[j] > j'. >> ERROR: 5-gram discount out of range for adjusted count 2: -6.80247 >> >> The data I'm training on has come from the OPUS project. I found some >> references online to issues when there isn't enough training data, but I >> think I have sufficient data and have previously trained on a lot less (and >> even on a subset of my current data): >> >> $ wc lm_data.en >> 1874495 21187448 96148754 lm_data.en >> >> Any ideas what might be causing the problem? >> >> James >> >> _______________________________________________ >> Moses-support mailing >> listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support >> >> _______________________________________________ >> Moses-support mailing list >> Moses-support@mit.edu >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support