Hi Yes, that's true. From Liling's description it sounds like a pathologically long sentence is causing Moses to blow up. However he states that it happens on random lines -- could it be that there are so many threads that the amount of data translated is random but it's the same problem line at each time?
cheers - Barry On 12/12/17 09:24, Marcin Junczys-Dowmunt wrote: > Hi, > I think the important part is that Liling actually manages to translate > several tens of thousands of sentences before that happens. A quick fix > would be to break your corpus into pieces of 10K sentences each and loop > over the files. I usually have bad experience with trying to translate > large batches of text with moses. > > Is still trying to load the entire corpus into memory? It used to do that. > > W dniu 12.12.2017 o 10:16, Barry Haddow pisze: >> Hi Liling >> >> The short answer is you need need to prune/filter your phrase table >> prior to creating the compact phrase table. I don't mean "filter model >> given input", because that won't make much difference if you have a >> very large input, I mean getting rid of rare translations which won't >> be used anyway. >> >> The compact phrase does not do pruning, it ends up being done in >> memory, so if you have 750,000 translations of the full-stop in your >> model then they all get loaded into memory, before Moses selects the >> top 20. >> >> You can use prunePhraseTable from Moses (which bizarrely needs to load >> a phrase table in order to parse the config file, last time I looked). >> You could also apply Johnson / entropic pruning, whatever works for you, >> >> cheers - Barry >> >> On 11/12/17 09:20, liling tan wrote: >>> Dear Moses community/developers, >>> >>> I have a question on how to handle large models created using moses. >>> >>> I've a vanilla phrase-based model with >>> >>> * PhraseDictionary num-features=4 input-factor=0 output-factor=0 >>> * LexicalReordering num-features=6 input-factor=0 output-factor=0 >>> * KENLM order=5 factor=0 >>> >>> The size of the model is: >>> >>> * compressed phrase table is 5.4GB, >>> * compressed reordering table is 1.9GB and >>> * quantized LM is 600MB >>> >>> >>> I'm running on a single 56 cores machine with 256GB RAM. Whenever I'm >>> decoding I use -threads 56 parameter. >>> >>> It's takes really long to load the table and after loading, it breaks >>> inconsistently at different lines when decoding, I notice that the >>> RAM goes into swap before it breaks. >>> >>> I've tried compact phrased table and get a >>> >>> * 3.2GB .minphr >>> * 1.5GV .minlexr >>> >>> And the same kind of random breakage happens when RAM goes into swap >>> after loading the phrase-table. >>> >>> Strangely, it still manage to decode ~500K sentences before it breaks. >>> >>> Then I've tried with ondisk phrasetable and it's around 37GB >>> uncompressed. Using the ondisk PT didn't cause breakage but the >>> decoding time is significantly increased, now it can only decode 15K >>> sentences in an hour. >>> >>> The setup is a little different from normal where we have the >>> train/dev/test split. Currently, my task is to decode the train set. >>> I've tried filtering the table with the trainset with >>> filter-model-given-input.pl <http://filter-model-given-input.pl> but >>> the size of the compressed table didn't really decrease much. >>> >>> The entire training set is made up of 5M sentence pairs and it's >>> taking 3+ days just to decode ~1.5M sentences with ondisk PT. >>> >>> >>> My questions are: >>> >>> - Are there best practices with regards to deploying large Moses models? >>> - Why does the 5+GB phrase table take up > 250GB RAM when decoding? >>> - How else should I filter/compress the phrase table? >>> - Is it normal to decode only ~500K sentence a day given the machine >>> specs and the model size? >>> >>> I understand that I could split the train set up into two and train 2 >>> models then cross-decode but if the training size is 10M sentence >>> pairs, we'll face the same issues. >>> >>> Thank you for reading the long post and thank you in advances for any >>> answers, discussions and enlightenment on this issue =) >>> >>> Regards, >>> LIling >>> >>> >>> _______________________________________________ >>> Moses-support mailing list >>> Moses-support@mit.edu >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> >> _______________________________________________ >> Moses-support mailing list >> Moses-support@mit.edu >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support