Hi

Yes, that's true. From Liling's description it sounds like a 
pathologically long sentence is causing Moses to blow up. However he 
states that it happens on random lines -- could it be that there are so 
many threads that the amount of data translated is random but it's the 
same problem line at each time?

cheers - Barry

On 12/12/17 09:24, Marcin Junczys-Dowmunt wrote:
> Hi,
> I think the important part is that Liling actually manages to translate
> several tens of thousands of sentences before that happens. A quick fix
> would be to break your corpus into pieces of 10K sentences each and loop
> over the files. I usually have bad experience with trying to translate
> large batches of text with moses.
>
> Is still trying to load the entire corpus into memory? It used to do that.
>
> W dniu 12.12.2017 o 10:16, Barry Haddow pisze:
>> Hi Liling
>>
>> The short answer is you need need to prune/filter your phrase table
>> prior to creating the compact phrase table. I don't mean "filter model
>> given input", because that won't make much difference if you have a
>> very large input, I mean getting rid of rare translations which won't
>> be used anyway.
>>
>> The compact phrase does not do pruning, it ends up being done in
>> memory, so if you have 750,000 translations of the full-stop in your
>> model then they all get loaded into memory, before Moses selects the
>> top 20.
>>
>> You can use prunePhraseTable from Moses (which bizarrely needs to load
>> a phrase table in order to parse the config file, last time I looked).
>> You could also apply Johnson / entropic pruning, whatever works for you,
>>
>> cheers - Barry
>>
>> On 11/12/17 09:20, liling tan wrote:
>>> Dear Moses community/developers,
>>>
>>> I have a question on how to handle large models created using moses.
>>>
>>> I've a vanilla phrase-based model with
>>>
>>>    * PhraseDictionary num-features=4 input-factor=0 output-factor=0
>>>    * LexicalReordering num-features=6 input-factor=0 output-factor=0
>>>    * KENLM order=5 factor=0
>>>
>>> The size of the model is:
>>>
>>>    * compressed phrase table is 5.4GB,
>>>    * compressed reordering table is 1.9GB and
>>>    * quantized LM is 600MB
>>>
>>>
>>> I'm running on a single 56 cores machine with 256GB RAM. Whenever I'm
>>> decoding I use -threads 56 parameter.
>>>
>>> It's takes really long to load the table and after loading, it breaks
>>> inconsistently at different lines when decoding, I notice that the
>>> RAM goes into swap before it breaks.
>>>
>>> I've tried compact phrased table and get a
>>>
>>>    * 3.2GB .minphr
>>>    * 1.5GV .minlexr
>>>
>>> And the same kind of random breakage happens when RAM goes into swap
>>> after loading the phrase-table.
>>>
>>> Strangely, it still manage to decode ~500K sentences before it breaks.
>>>
>>> Then I've tried with ondisk phrasetable and it's around 37GB
>>> uncompressed. Using the ondisk PT didn't cause breakage but the
>>> decoding time is significantly increased, now it can only decode 15K
>>> sentences in an hour.
>>>
>>> The setup is a little different from normal where we have the
>>> train/dev/test split. Currently, my task is to decode the train set.
>>> I've tried filtering the table with the trainset with
>>> filter-model-given-input.pl <http://filter-model-given-input.pl> but
>>> the size of the compressed table didn't really decrease much.
>>>
>>> The entire training set is made up of 5M sentence pairs and it's
>>> taking 3+ days just to decode ~1.5M sentences with ondisk PT.
>>>
>>>
>>> My questions are:
>>>
>>>   - Are there best practices with regards to deploying large Moses models?
>>>   - Why does the 5+GB phrase table take up > 250GB RAM when decoding?
>>>   - How else should I filter/compress the phrase table?
>>>   - Is it normal to decode only ~500K sentence a day given the machine
>>> specs and the model size?
>>>
>>> I understand that I could split the train set up into two and train 2
>>> models then cross-decode but if the training size is 10M sentence
>>> pairs, we'll face the same issues.
>>>
>>> Thank you for reading the long post and thank you in advances for any
>>> answers, discussions and enlightenment on this issue =)
>>>
>>> Regards,
>>> LIling
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to