[Moses-support] KenLM with 16GB of texts

2015-03-25 Thread liling tan
Dear Moses dev/users,

Has anyone tried to build a language model from 16 GB of texts?

What does Last input should have been poison. mean?

Does anyone know how to estimate the output size of the language model file
given 16GB of texts with 8 grams? How about 5grams, how big will it get?


We've tried to extract 8grams with 16GB of texts and we ended up with:


=== 1/5 Counting and sorting n-grams ===
Reading /home/gillin/wmt15/corpus.truecase/train-lm.en
5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
tcmalloc: large alloc 21621391360 bytes == 0x1de6000 @
tcmalloc: large alloc 86485549056 bytes == 0x50ba5a000 @
*=== 1/5 Counting and sorting n-grams ===
Reading /home/gillin/wmt15/corpus.truecase/train-lm.en
5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
tcmalloc: large alloc 14100905984 bytes == 0x2e6c000 @
tcmalloc: large alloc 94006026240 bytes == 0x34bec4000 @

Unigram tokens 3038737446 types 5924314
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:71091768 2:3162479872 3:5929649664 4:9487439872
5:13835849728 6:18974879744 7:24904527872 8:31624798208
tcmalloc: large alloc 31624798208 bytes == 0x34bec4000 @
tcmalloc: large alloc 3162480640 bytes == 0x2e6c000 @
tcmalloc: large alloc 5929656320 bytes == 0xbf666000 @
tcmalloc: large alloc 9487441920 bytes == 0xaa8e86000 @
tcmalloc: large alloc 13835853824 bytes == 0xcde674000 @
tcmalloc: large alloc 18974883840 bytes == 0x101715a000 @
tcmalloc: large alloc 24904531968 bytes == 0x1940db4000 @
Statistics:
1 5924314 D1=0.709218 D2=1.04888 D3+=1.33462
2 108520273 D1=0.723401 D2=1.06804 D3+=1.36804
3 543892823 D1=0.788765 D2=1.11107 D3+=1.35713
4 1204990660 D1=0.855434 D2=1.17274 D3+=1.36107
5 1716616322 D1=0.907776 D2=1.25272 D3+=1.39455
6 1966436508 D1=0.943121 D2=1.34991 D3+=1.45437
7 2029467690 D1=0.96405 D2=1.44994 D3+=1.5283
8 1997628560 D1=0.863904 D2=1.45784 D3+=1.59832
Memory estimate for binary LM:
type GB
probing 202 assuming -p 1.5
probing 245 assuming -r models -p 1.5
trie115 without quantization
trie 69 assuming -q 8 -b 8 quantization
trie 96 assuming -a 22 array pointer compression
trie 49 assuming -a 22 -q 8 -b 8 array pointer compression and
quantization
=== 3/5 Calculating and sorting initial probabilities ===
tcmalloc: large alloc 10877861888 bytes == 0x7265 @
tcmalloc: large alloc 28919783424 bytes == 0x34bec4000 @
tcmalloc: large alloc 48065257472 bytes == 0xa07ad2000 @
tcmalloc: large alloc 62925971456 bytes == 0x34bec4000 @
tcmalloc: large alloc 73060843520 bytes == 0x34bec4000 @
tcmalloc: large alloc 79905144832 bytes == 0x34bec4000 @
Chain sizes: 1:71091768 2:1736324368 3:6017972736 4:9628755968
5:14041935872 6:19257511936 7:25275484160 8:32095852544
tcmalloc: large alloc 9628762112 bytes == 0x19349e6000 @
tcmalloc: large alloc 14041939968 bytes == 0x1b7289a000 @
tcmalloc: large alloc 19257516032 bytes == 0x34bec4000 @
tcmalloc: large alloc 25275490304 bytes == 0x7c7c2a000 @
tcmalloc: large alloc 32095854592 bytes == 0xdaa4c @
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:71091768 2:1736324368 3:5881222144 4:9409955840
5:13722852352 6:18819911680 7:24701134848 8:31366518784
tcmalloc: large alloc 9409961984 bytes == 0x19349e6000 @
tcmalloc: large alloc 13722853376 bytes == 0x1b657f @
tcmalloc: large alloc 18819915776 bytes == 0x34bec4000 @
tcmalloc: large alloc 24701140992 bytes == 0x7adad6000 @
tcmalloc: large alloc 31366520832 bytes == 0xd6dfae000 @
Last input should have been poison.
util/file.cc:274 in void util::ErsatzPWrite(int, const void*, std::size_t,
uint64_t) threw FDException'.
No space left on device in /tmp/TuM5Ow (deleted) while writing 13586550656
bytes at offset 49146486784


Regards,
Liling
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] KenLM with 16GB of texts

2015-03-25 Thread Marcin Junczys-Dowmunt
 

Hi, 

you do not have enough space in /tmp, see No space left on device in
/tmp/TuM5Ow. The poison-message is just another echo of that. You can
use the -T path to more space option to set a path where you have more
space. You probably need something around 100-200 GB (16 GB of
compressed or uncompressed text? If compressed then probably more.) 

Best, 

Marcin 

W dniu 2015-03-25 14:17, liling tan napisaƂ(a): 

 Dear Moses dev/users, 
 
 Has anyone tried to build a language model from 16 GB of texts? 
 
 What does Last input should have been poison. mean? 
 
 Does anyone know how to estimate the output size of the language model file 
 given 16GB of texts with 8 grams? How about 5grams, how big will it get? 
 
 We've tried to extract 8grams with 16GB of texts and we ended up with: 
 
 === 1/5 Counting and sorting n-grams === 
 
 Reading /home/gillin/wmt15/corpus.truecase/train-lm.en 
 
 5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
  
 
 tcmalloc: large alloc 21621391360 bytes == 0x1de6000 @ 
 
 tcmalloc: large alloc 86485549056 bytes == 0x50ba5a000 @ 
 
 *=== 1/5 Counting and sorting n-grams === 
 
 Reading /home/gillin/wmt15/corpus.truecase/train-lm.en 
 
 5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
  
 
 tcmalloc: large alloc 14100905984 bytes == 0x2e6c000 @ 
 
 tcmalloc: large alloc 94006026240 bytes == 0x34bec4000 @ 
 
 
  
 
 Unigram tokens 3038737446 types 5924314 
 
 === 2/5 Calculating and sorting adjusted counts === 
 
 Chain sizes: 1:71091768 2:3162479872 3:5929649664 4:9487439872 5:13835849728 
 6:18974879744 7:24904527872 8:31624798208 
 
 tcmalloc: large alloc 31624798208 bytes == 0x34bec4000 @ 
 
 tcmalloc: large alloc 3162480640 bytes == 0x2e6c000 @ 
 
 tcmalloc: large alloc 5929656320 bytes == 0xbf666000 @ 
 
 tcmalloc: large alloc 9487441920 bytes == 0xaa8e86000 @ 
 
 tcmalloc: large alloc 13835853824 bytes == 0xcde674000 @ 
 
 tcmalloc: large alloc 18974883840 bytes == 0x101715a000 @ 
 
 tcmalloc: large alloc 24904531968 bytes == 0x1940db4000 @ 
 
 Statistics: 
 
 1 5924314 D1=0.709218 D2=1.04888 D3+=1.33462 
 
 2 108520273 D1=0.723401 D2=1.06804 D3+=1.36804 
 
 3 543892823 D1=0.788765 D2=1.11107 D3+=1.35713 
 
 4 1204990660 D1=0.855434 D2=1.17274 D3+=1.36107 
 
 5 1716616322 D1=0.907776 D2=1.25272 D3+=1.39455 
 
 6 1966436508 D1=0.943121 D2=1.34991 D3+=1.45437 
 
 7 2029467690 D1=0.96405 D2=1.44994 D3+=1.5283 
 
 8 1997628560 D1=0.863904 D2=1.45784 D3+=1.59832 
 
 Memory estimate for binary LM: 
 
 type GB 
 
 probing 202 assuming -p 1.5 
 
 probing 245 assuming -r models -p 1.5 
 
 trie 115 without quantization 
 
 trie 69 assuming -q 8 -b 8 quantization 
 
 trie 96 assuming -a 22 array pointer compression 
 
 trie 49 assuming -a 22 -q 8 -b 8 array pointer compression and quantization 
 
 === 3/5 Calculating and sorting initial probabilities === 
 
 tcmalloc: large alloc 10877861888 bytes == 0x7265 @ 
 
 tcmalloc: large alloc 28919783424 bytes == 0x34bec4000 @ 
 
 tcmalloc: large alloc 48065257472 bytes == 0xa07ad2000 @ 
 
 tcmalloc: large alloc 62925971456 bytes == 0x34bec4000 @ 
 
 tcmalloc: large alloc 73060843520 bytes == 0x34bec4000 @ 
 
 tcmalloc: large alloc 79905144832 bytes == 0x34bec4000 @ 
 
 Chain sizes: 1:71091768 2:1736324368 3:6017972736 4:9628755968 5:14041935872 
 6:19257511936 7:25275484160 8:32095852544 
 
 tcmalloc: large alloc 9628762112 bytes == 0x19349e6000 @ 
 
 tcmalloc: large alloc 14041939968 bytes == 0x1b7289a000 @ 
 
 tcmalloc: large alloc 19257516032 bytes == 0x34bec4000 @ 
 
 tcmalloc: large alloc 25275490304 bytes == 0x7c7c2a000 @ 
 
 tcmalloc: large alloc 32095854592 bytes == 0xdaa4c @ 
 
 === 4/5 Calculating and writing order-interpolated probabilities === 
 
 Chain sizes: 1:71091768 2:1736324368 3:5881222144 4:9409955840 5:13722852352 
 6:18819911680 7:24701134848 8:31366518784 
 
 tcmalloc: large alloc 9409961984 bytes == 0x19349e6000 @ 
 
 tcmalloc: large alloc 13722853376 bytes == 0x1b657f @ 
 
 tcmalloc: large alloc 18819915776 bytes == 0x34bec4000 @ 
 
 tcmalloc: large alloc 24701140992 bytes == 0x7adad6000 @ 
 
 tcmalloc: large alloc 31366520832 bytes == 0xd6dfae000 @ 
 
 Last input should have been poison. 
 
 util/file.cc:274 in void util::ErsatzPWrite(int, const void*, std::size_t, 
 uint64_t) threw FDException'. 
 
 No space left on device in /tmp/TuM5Ow (deleted) while writing 13586550656 
 bytes at offset 49146486784
 
 Regards, 
 Liling 
 
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support [1]

 

Links:
--
[1] http://mailman.mit.edu/mailman/listinfo/moses-support
___
Moses-support mailing list
Moses-support@mit.edu