Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model
Hi Eric, many thanks for your reply. That worked I think? mert-moses-run.out said: After default: -l mem_free=0.5G -hard Using SCRIPTS_ROOTDIR: /usr/share/moses/scripts SYNC distortionchecking weight-count for ttable-file checking weight-count for lmodel-file checking weight-count for distortion-file Executing: mkdir -p europarl/tuning Executing: /usr/share/moses/scripts/training/filter-model-given-input.pl ./filtered /home/llio/MOSES/model/moses.ini /home/llio/MOSES/europarl/tuning/input filtering the phrase tables... Sat Aug 23 16:16:48 BST 2008 Executing: mkdir -p /home/llio/MOSES/europarl/tuning/filtered Considering factor 0 Considering factor 0 It took a few seconds, which surprised me because the tutorial said: 'Note that this step can take many hours, even days, to run.' But I've ended up with a filtered folder containing -rw-r--r-- 1 llio llio 1048 2008-08-23 16:16 moses.ini -rw-r--r-- 1 llio llio 204201984 2008-08-23 16:23 phrase-table.0-0.1 (the date and time are wrong on this machine). Llio On Wed, Aug 20, 2008 at 1:12 AM, Eric Nichols <[EMAIL PROTECTED]> wrote: > Greetings, > > That line in mert-moses.pl is checking to see if moses is executable > but can't find it. > Replace moses in your command with /usr/bin/moses : > > mert-moses.pl europarl/tuning/input europarl/tuning/reference /usr/bin/moses > model/moses.ini --working-dir europarl/tuning --rootdir > /usr/share/moses/scripts >&mert-moses-run.out > > Eric Nichols > > On Wed, Aug 20, 2008 at 12:40 AM, Llio Humphreys <[EMAIL PROTECTED]> wrote: >> Dear Eric/Moses Support Group, >> >> I am using Ubuntu with 3.5GB RAM and finally got >> train-factored-phrase-model.perl to run! >> I am now on the tuning part of the tutorial, and I'm still using the >> Baseline data to test out the system on my machine. >> I adapted the command for tuning from: >> >> bin/moses-scripts/scripts-MMDD-HHMM/training/mert-moses.pl >> working-dir/tuning/input working-dir/tuning/reference >> moses/moses-cmd/src/moses working-dir/model/moses.ini --working-dir >> working-dir/tuning --rootdir bin/moses-scripts/scripts-MMDD-HHMM >> >> to >> >> mert-moses.pl europarl/tuning/input europarl/tuning/reference moses >> model/moses.ini --working-dir europarl/tuning --rootdir >> /usr/share/moses/scripts >&mert-moses-run.out >> >> I get the error message: >> >> After default: -l mem_free=0.5G -hard >> Using SCRIPTS_ROOTDIR: /usr/share/moses/scripts >> Not executable: moses at /usr/bin/mert-moses.pl line 297. >> >> mert-moses.pl line 297 is empty but the previous line says: >> >> die "Not executable: $___DECODER" if ! -x $___DECODER; >> >> Grateful for your advice. >> >> Thanks, >> Llio Humphreys >> >> >> >> >> On Thu, Aug 14, 2008 at 12:52 PM, Eric Nichols <[EMAIL PROTECTED]> wrote: >>> Greetings, >>> >>> In the moses package, I install everything into /usr/share/moses and >>> symlink the scripts and moses command into /usr/bin. >>> You can see a list of installed files by running the following command: >>> >>> # dpkg -L moses >>> >>> When you call a command like ngram-count or >>> train-factored-phrase-model.perl, you do not need to specify the full >>> path; >>> the system will be able to find it. I do not know if it is strictly >>> necessary to set -scripts-root-dir, but the value >>> /usr/share/moses/scripts works fine. >>> >>> Eric Nichols >>> >>> On Thu, Aug 14, 2008 at 8:02 PM, Llio Humphreys <[EMAIL PROTECTED]> wrote: Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai, thank you all for your help. It is very, very much appreciated. I decided to try Eric's packages, and it looks like the installation worked. I typed some of the commands in the Baseline instructions without arguments, and the program either output to the screen that I missed some arguments or gave a description of the program. Thank you Eric!!! Following the Baseline instructions (http://www.statmt.org/wmt08/baseline.html) I have now got to the following step: Use SRILM to build language model: /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount -text working-dir/lm/europarl.lowercased -lm working-dir/lm/europarl.lm In my case, I was in folder home/llio/MOSESMTDATA. I didn't know the path to ngram-count, but it was possible to invoke it without the path: ngram-count -order 5 -interpolate -kndiscount -text europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm I'm concerned about two things: 1) this ngram-count step is taking a very long time. I think I started it off around 6pm yesterday, but it's still going. It's very resource-intensive, and it's difficult to get to other windows open. I went to check up on it around 9pm, and couldn't find that particular terminal. I thought I had closed that terminal by mistake, so I stupidly opened another one, and entered the same comman
Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model
Greetings, That line in mert-moses.pl is checking to see if moses is executable but can't find it. Replace moses in your command with /usr/bin/moses : mert-moses.pl europarl/tuning/input europarl/tuning/reference /usr/bin/moses model/moses.ini --working-dir europarl/tuning --rootdir /usr/share/moses/scripts >&mert-moses-run.out Eric Nichols On Wed, Aug 20, 2008 at 12:40 AM, Llio Humphreys <[EMAIL PROTECTED]> wrote: > Dear Eric/Moses Support Group, > > I am using Ubuntu with 3.5GB RAM and finally got > train-factored-phrase-model.perl to run! > I am now on the tuning part of the tutorial, and I'm still using the > Baseline data to test out the system on my machine. > I adapted the command for tuning from: > > bin/moses-scripts/scripts-MMDD-HHMM/training/mert-moses.pl > working-dir/tuning/input working-dir/tuning/reference > moses/moses-cmd/src/moses working-dir/model/moses.ini --working-dir > working-dir/tuning --rootdir bin/moses-scripts/scripts-MMDD-HHMM > > to > > mert-moses.pl europarl/tuning/input europarl/tuning/reference moses > model/moses.ini --working-dir europarl/tuning --rootdir > /usr/share/moses/scripts >&mert-moses-run.out > > I get the error message: > > After default: -l mem_free=0.5G -hard > Using SCRIPTS_ROOTDIR: /usr/share/moses/scripts > Not executable: moses at /usr/bin/mert-moses.pl line 297. > > mert-moses.pl line 297 is empty but the previous line says: > > die "Not executable: $___DECODER" if ! -x $___DECODER; > > Grateful for your advice. > > Thanks, > Llio Humphreys > > > > > On Thu, Aug 14, 2008 at 12:52 PM, Eric Nichols <[EMAIL PROTECTED]> wrote: >> Greetings, >> >> In the moses package, I install everything into /usr/share/moses and >> symlink the scripts and moses command into /usr/bin. >> You can see a list of installed files by running the following command: >> >> # dpkg -L moses >> >> When you call a command like ngram-count or >> train-factored-phrase-model.perl, you do not need to specify the full >> path; >> the system will be able to find it. I do not know if it is strictly >> necessary to set -scripts-root-dir, but the value >> /usr/share/moses/scripts works fine. >> >> Eric Nichols >> >> On Thu, Aug 14, 2008 at 8:02 PM, Llio Humphreys <[EMAIL PROTECTED]> wrote: >>> Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai, >>> thank you all for your help. It is very, very much appreciated. I >>> decided to try Eric's packages, and it looks like the installation >>> worked. I typed some of the >>> commands in the Baseline instructions without arguments, and the >>> program either output to the screen that I missed some arguments or >>> gave a description of the program. Thank you Eric!!! >>> >>> Following the Baseline instructions >>> (http://www.statmt.org/wmt08/baseline.html) I have now got to the >>> following step: >>> >>> Use SRILM to build language model: >>> /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount >>> -text working-dir/lm/europarl.lowercased -lm >>> working-dir/lm/europarl.lm >>> >>> In my case, I was in folder home/llio/MOSESMTDATA. I didn't know the >>> path to ngram-count, but it was possible to invoke it without the >>> path: >>> >>> ngram-count -order 5 -interpolate -kndiscount -text >>> europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm >>> >>> I'm concerned about two things: >>> 1) this ngram-count step is taking a very long time. I think I started >>> it off around 6pm yesterday, but it's still going. It's very >>> resource-intensive, and it's difficult to get to other windows open. >>> I went to check up on it around 9pm, and couldn't find that particular >>> terminal. I thought I had closed that terminal by mistake, so I stupidly >>> opened another one, and entered the same command. I subsequently >>> found that the original terminal was still open, so I closed the >>> second one. I'm not sure if issuing this command a second time on the >>> same program and files on a different terminal would corrupt the >>> original ngramcount step, and whether I should start it off again, or >>> whether starting it off again would make things worse? I looked up >>> ngram-count >>> (http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html) >>> and I don't think it outputs to any file, so I guess you have to be in >>> the same terminal to do the next step? I opened >>> another terminal and typed 'top' to see what processes are running, >>> and I know that ngram-count is doing something, but whether it's doing >>> well or stuck in a loop, I can't say. What I do find strange is that >>> the time for ngram-count is said to be 00:58:20, and it's been going >>> for hours.. I searched this problem in previous Moses Group emails and >>> I understand that if I run this with order 4 instead of 5 it will run >>> quicker with very similar results? So, can I just stop what it's >>> doing, and run this command in the same terminal with order 4? Are >>> there any files
Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model
Dear Eric/Moses Support Group, I am using Ubuntu with 3.5GB RAM and finally got train-factored-phrase-model.perl to run! I am now on the tuning part of the tutorial, and I'm still using the Baseline data to test out the system on my machine. I adapted the command for tuning from: bin/moses-scripts/scripts-MMDD-HHMM/training/mert-moses.pl working-dir/tuning/input working-dir/tuning/reference moses/moses-cmd/src/moses working-dir/model/moses.ini --working-dir working-dir/tuning --rootdir bin/moses-scripts/scripts-MMDD-HHMM to mert-moses.pl europarl/tuning/input europarl/tuning/reference moses model/moses.ini --working-dir europarl/tuning --rootdir /usr/share/moses/scripts >&mert-moses-run.out I get the error message: After default: -l mem_free=0.5G -hard Using SCRIPTS_ROOTDIR: /usr/share/moses/scripts Not executable: moses at /usr/bin/mert-moses.pl line 297. mert-moses.pl line 297 is empty but the previous line says: die "Not executable: $___DECODER" if ! -x $___DECODER; Grateful for your advice. Thanks, Llio Humphreys On Thu, Aug 14, 2008 at 12:52 PM, Eric Nichols <[EMAIL PROTECTED]> wrote: > Greetings, > > In the moses package, I install everything into /usr/share/moses and > symlink the scripts and moses command into /usr/bin. > You can see a list of installed files by running the following command: > > # dpkg -L moses > > When you call a command like ngram-count or > train-factored-phrase-model.perl, you do not need to specify the full > path; > the system will be able to find it. I do not know if it is strictly > necessary to set -scripts-root-dir, but the value > /usr/share/moses/scripts works fine. > > Eric Nichols > > On Thu, Aug 14, 2008 at 8:02 PM, Llio Humphreys <[EMAIL PROTECTED]> wrote: >> Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai, >> thank you all for your help. It is very, very much appreciated. I >> decided to try Eric's packages, and it looks like the installation >> worked. I typed some of the >> commands in the Baseline instructions without arguments, and the >> program either output to the screen that I missed some arguments or >> gave a description of the program. Thank you Eric!!! >> >> Following the Baseline instructions >> (http://www.statmt.org/wmt08/baseline.html) I have now got to the >> following step: >> >> Use SRILM to build language model: >> /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount >> -text working-dir/lm/europarl.lowercased -lm >> working-dir/lm/europarl.lm >> >> In my case, I was in folder home/llio/MOSESMTDATA. I didn't know the >> path to ngram-count, but it was possible to invoke it without the >> path: >> >> ngram-count -order 5 -interpolate -kndiscount -text >> europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm >> >> I'm concerned about two things: >> 1) this ngram-count step is taking a very long time. I think I started >> it off around 6pm yesterday, but it's still going. It's very >> resource-intensive, and it's difficult to get to other windows open. >> I went to check up on it around 9pm, and couldn't find that particular >> terminal. I thought I had closed that terminal by mistake, so I stupidly >> opened another one, and entered the same command. I subsequently >> found that the original terminal was still open, so I closed the >> second one. I'm not sure if issuing this command a second time on the >> same program and files on a different terminal would corrupt the >> original ngramcount step, and whether I should start it off again, or >> whether starting it off again would make things worse? I looked up >> ngram-count >> (http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html) >> and I don't think it outputs to any file, so I guess you have to be in >> the same terminal to do the next step? I opened >> another terminal and typed 'top' to see what processes are running, >> and I know that ngram-count is doing something, but whether it's doing >> well or stuck in a loop, I can't say. What I do find strange is that >> the time for ngram-count is said to be 00:58:20, and it's been going >> for hours.. I searched this problem in previous Moses Group emails and >> I understand that if I run this with order 4 instead of 5 it will run >> quicker with very similar results? So, can I just stop what it's >> doing, and run this command in the same terminal with order 4? Are >> there any files I need to 'touch' to ensure that it doesn't leave any >> stone unturned? >> >> 2) how to do the next step: >> >> >> bin/moses-scripts/scripts-MMDD-HHMM/training/train-factored-phrase-model.perl >> -scripts-root-dir bin/moses-scripts/scripts-MMDD-HHMM -root-dir >> working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en >> -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm >> 0:5:working-dir/lm/europarl.lm:0 >> >> I assume that like ngram-count, I can just type in >> train-factored-phrase-model.perl wi
Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model
Dear Josh, Ok, I ran the command, with order 3 as I'm just testing if this system works on this machine, and it produced europarl.lm in seconds and output: [1] 15789 n-gram.out said: nohup: ignoring input nohup.out said: Warning: ngram-count option "-text" needs an argument one of required modified KneserNey count-of-counts is zero error in discount estimator for order 1 But I've looked at europarl.lm and it looks fine to me, it even ends with \end\ so it obviously finished the process. I guess if there's anything wrong, I'll find out in the next step? Llio On Thu, Aug 14, 2008 at 12:35 PM, Josh Schroeder <[EMAIL PROTECTED]> wrote: > ngram-count is outputting an LM file specified by the -lm argument. > "working-dir/lm/europarl.lm" in your case. > > I think it counts all ngrams first and then writes the file once at the end, > so you probably didn't corrupt the output by accidentally starting a new > process. > > If you want it to train quicker/don't have enough memory, try an order of 4 > or even 3. Higher order LM models take more time to calculate and more RAM > to hold in memory. The "-l 0:5:working-dir/lm/europarl.lm:0" arg to > train-factored-phrase-model includes the LM order, so change that 5 to the > appropriate number when you run that step. > > You mentioned having trouble getting stderr from train-factored-phrase-model > in another email, and it seems like ngram-count is making your system > unresponsive. Do a web search and learn about the unix 'nohup' and 'nice' > commands, as well as redirecting stderr and stdout to a file, and running > processes in the background. You'll end up with something like this, which > might not thrash your system as much, and won't require that you leave a > terminal window open the whole time a process runs: > > nohup nice ngram-count -order 4 -interpolate -kndiscount -text > europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm &> ngram-run.out > & > > Someone familiar with the Ubuntu packages will have to answer whether the > moses installation is added to the path, how to call the training scripts, > and if the moses/scripts directory is made & released. > > -Josh > > On 14 Aug 2008, at 12:02, Llio Humphreys wrote: > >> Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai, >> thank you all for your help. It is very, very much appreciated. I >> decided to try Eric's packages, and it looks like the installation >> worked. I typed some of the >> commands in the Baseline instructions without arguments, and the >> program either output to the screen that I missed some arguments or >> gave a description of the program. Thank you Eric!!! >> >> Following the Baseline instructions >> (http://www.statmt.org/wmt08/baseline.html) I have now got to the >> following step: >> >> Use SRILM to build language model: >> /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount >> -text working-dir/lm/europarl.lowercased -lm >> working-dir/lm/europarl.lm >> >> In my case, I was in folder home/llio/MOSESMTDATA. I didn't know the >> path to ngram-count, but it was possible to invoke it without the >> path: >> >> ngram-count -order 5 -interpolate -kndiscount -text >> europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm >> >> I'm concerned about two things: >> 1) this ngram-count step is taking a very long time. I think I started >> it off around 6pm yesterday, but it's still going. It's very >> resource-intensive, and it's difficult to get to other windows open. >> I went to check up on it around 9pm, and couldn't find that particular >> terminal. I thought I had closed that terminal by mistake, so I stupidly >> opened another one, and entered the same command. I subsequently >> found that the original terminal was still open, so I closed the >> second one. I'm not sure if issuing this command a second time on the >> same program and files on a different terminal would corrupt the >> original ngramcount step, and whether I should start it off again, or >> whether starting it off again would make things worse? I looked up >> ngram-count >> (http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html) >> and I don't think it outputs to any file, so I guess you have to be in >> the same terminal to do the next step? I opened >> another terminal and typed 'top' to see what processes are running, >> and I know that ngram-count is doing something, but whether it's doing >> well or stuck in a loop, I can't say. What I do find strange is that >> the time for ngram-count is said to be 00:58:20, and it's been going >> for hours.. I searched this problem in previous Moses Group emails and >> I understand that if I run this with order 4 instead of 5 it will run >> quicker with very similar results? So, can I just stop what it's >> doing, and run this command in the same terminal with order 4? Are >> there any files I need to 'touch' to ensure that it doesn't leave any >> stone unturned? >> >> 2) how to do the next step: >> >> >> bin/
Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model
building language models (using for example ngram-count) is computationally expensive. from what you tell the list, it seems that you don't have enough physical memory to run it properly. you have a number of options: --specify a lower order model (eg 4 rather than 5, or even 3); depending upon how much monolingual training material you have, this may not produce worse results and it will certainly run faster and will require less space. --divide your language model training material into chunks and run ngram-count on each chunk. this is one strategy for building LMs using all of the Giga word corpus (when you don't have access to a 64 bit machine). here you would create multiple LMs. --use a disk-based method of creating them. we have done this, and basically it trades speed for time. --take the radical option and simply don't bother smoothing at all (ie use Google's "stupid backoff"). this makes training LMs trivial --just compute the counts of ngrams and work-out how to store them. i reckon it should be possible to do this and create an ARPA file suitable for loading into the SRILM. --buy more machines. Miles 2008/8/14 Llio Humphreys <[EMAIL PROTECTED]> > Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai, > thank you all for your help. It is very, very much appreciated. I > decided to try Eric's packages, and it looks like the installation > worked. I typed some of the > commands in the Baseline instructions without arguments, and the > program either output to the screen that I missed some arguments or > gave a description of the program. Thank you Eric!!! > > Following the Baseline instructions > (http://www.statmt.org/wmt08/baseline.html) I have now got to the > following step: > > Use SRILM to build language model: > /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount > -text working-dir/lm/europarl.lowercased -lm > working-dir/lm/europarl.lm > > In my case, I was in folder home/llio/MOSESMTDATA. I didn't know the > path to ngram-count, but it was possible to invoke it without the > path: > > ngram-count -order 5 -interpolate -kndiscount -text > europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm > > I'm concerned about two things: > 1) this ngram-count step is taking a very long time. I think I started > it off around 6pm yesterday, but it's still going. It's very > resource-intensive, and it's difficult to get to other windows open. > I went to check up on it around 9pm, and couldn't find that particular > terminal. I thought I had closed that terminal by mistake, so I stupidly > opened another one, and entered the same command. I subsequently > found that the original terminal was still open, so I closed the > second one. I'm not sure if issuing this command a second time on the > same program and files on a different terminal would corrupt the > original ngramcount step, and whether I should start it off again, or > whether starting it off again would make things worse? I looked up > ngram-count ( > http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html) > and I don't think it outputs to any file, so I guess you have to be in > the same terminal to do the next step? I opened > another terminal and typed 'top' to see what processes are running, > and I know that ngram-count is doing something, but whether it's doing > well or stuck in a loop, I can't say. What I do find strange is that > the time for ngram-count is said to be 00:58:20, and it's been going > for hours.. I searched this problem in previous Moses Group emails and > I understand that if I run this with order 4 instead of 5 it will run > quicker with very similar results? So, can I just stop what it's > doing, and run this command in the same terminal with order 4? Are > there any files I need to 'touch' to ensure that it doesn't leave any > stone unturned? > > 2) how to do the next step: > > > > bin/moses-scripts/scripts-MMDD-HHMM/training/train-factored-phrase-model.perl > -scripts-root-dir bin/moses-scripts/scripts-MMDD-HHMM -root-dir > working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en > -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm > 0:5:working-dir/lm/europarl.lm:0 > > I assume that like ngram-count, I can just type in > train-factored-phrase-model.perl without the full path...Do I need to > set the -scripts-root-dir paramater? Are all the scripts in the same > place? > > Thank you, > > Llio > > > > > On 8/14/08, Murat ALPEREN <[EMAIL PROTECTED]> wrote: > > Dear Llio, > > > > You should be okay with installing moses finally if you have installed > all > > tha dependant packages before. I am not aware of the 'whereis' command, > but > > once you train your model, your moses.ini file which is created by > training > > script will take care of the paths. However, you should carefully supply > > paths while training your model. Before training your model, y
Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model
Hi Miles/Josh, thanks for your replies. Looking at the options, the first one seems to be the easiest one to try first:- > --specify a lower order model (eg 4 rather than 5, or even 3); depending > upon how much monolingual training material you have, this may not produce > worse results and it will certainly run faster and will require less space. I take it that it won't be a problem stop what it's doing, and run this command in the same terminal with order 4. So, I'll proceed with Josh's suggestion: nohup nice ngram-count -order 4 -interpolate -kndiscount -text europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm &> ngram-run.out & Many thanks, Llio On Thu, Aug 14, 2008 at 12:29 PM, Miles Osborne <[EMAIL PROTECTED]> wrote: > (my message bounced as it was too long ... here is a truncated version) > > Miles > > -- Forwarded message -- > From: Miles Osborne <[EMAIL PROTECTED]> > Date: 2008/8/14 > Subject: Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model > and Train Model > To: Llio Humphreys <[EMAIL PROTECTED]> > Cc: moses-support > > > building language models (using for example ngram-count) is computationally > expensive. from what you tell the list, it seems that you don't have enough > physical memory to run it properly. > > you have a number of options: > > --specify a lower order model (eg 4 rather than 5, or even 3); depending > upon how much monolingual training material you have, this may not produce > worse results and it will certainly run faster and will require less space. > > --divide your language model training material into chunks and run > ngram-count on each chunk. this is one strategy for building LMs using all > of the Giga word corpus (when you don't have access to a 64 bit machine). > here you would create multiple LMs. > > --use a disk-based method of creating them. we have done this, and > basically it trades speed for time. > > --take the radical option and simply don't bother smoothing at all (ie use > Google's "stupid backoff"). this makes training LMs trivial --just compute > the counts of ngrams and work-out how to store them. i reckon it should be > possible to do this and create an ARPA file suitable for loading into the > SRILM. > > --buy more machines. > > Miles > > > > ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model
Greetings, In the moses package, I install everything into /usr/share/moses and symlink the scripts and moses command into /usr/bin. You can see a list of installed files by running the following command: # dpkg -L moses When you call a command like ngram-count or train-factored-phrase-model.perl, you do not need to specify the full path; the system will be able to find it. I do not know if it is strictly necessary to set -scripts-root-dir, but the value /usr/share/moses/scripts works fine. Eric Nichols On Thu, Aug 14, 2008 at 8:02 PM, Llio Humphreys <[EMAIL PROTECTED]> wrote: > Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai, > thank you all for your help. It is very, very much appreciated. I > decided to try Eric's packages, and it looks like the installation > worked. I typed some of the > commands in the Baseline instructions without arguments, and the > program either output to the screen that I missed some arguments or > gave a description of the program. Thank you Eric!!! > > Following the Baseline instructions > (http://www.statmt.org/wmt08/baseline.html) I have now got to the > following step: > > Use SRILM to build language model: > /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount > -text working-dir/lm/europarl.lowercased -lm > working-dir/lm/europarl.lm > > In my case, I was in folder home/llio/MOSESMTDATA. I didn't know the > path to ngram-count, but it was possible to invoke it without the > path: > > ngram-count -order 5 -interpolate -kndiscount -text > europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm > > I'm concerned about two things: > 1) this ngram-count step is taking a very long time. I think I started > it off around 6pm yesterday, but it's still going. It's very > resource-intensive, and it's difficult to get to other windows open. > I went to check up on it around 9pm, and couldn't find that particular > terminal. I thought I had closed that terminal by mistake, so I stupidly > opened another one, and entered the same command. I subsequently > found that the original terminal was still open, so I closed the > second one. I'm not sure if issuing this command a second time on the > same program and files on a different terminal would corrupt the > original ngramcount step, and whether I should start it off again, or > whether starting it off again would make things worse? I looked up > ngram-count > (http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html) > and I don't think it outputs to any file, so I guess you have to be in > the same terminal to do the next step? I opened > another terminal and typed 'top' to see what processes are running, > and I know that ngram-count is doing something, but whether it's doing > well or stuck in a loop, I can't say. What I do find strange is that > the time for ngram-count is said to be 00:58:20, and it's been going > for hours.. I searched this problem in previous Moses Group emails and > I understand that if I run this with order 4 instead of 5 it will run > quicker with very similar results? So, can I just stop what it's > doing, and run this command in the same terminal with order 4? Are > there any files I need to 'touch' to ensure that it doesn't leave any > stone unturned? > > 2) how to do the next step: > > > bin/moses-scripts/scripts-MMDD-HHMM/training/train-factored-phrase-model.perl > -scripts-root-dir bin/moses-scripts/scripts-MMDD-HHMM -root-dir > working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en > -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm > 0:5:working-dir/lm/europarl.lm:0 > > I assume that like ngram-count, I can just type in > train-factored-phrase-model.perl without the full path...Do I need to > set the -scripts-root-dir paramater? Are all the scripts in the same > place? > > Thank you, > > Llio > > > > > On 8/14/08, Murat ALPEREN <[EMAIL PROTECTED]> wrote: > > Dear Llio, > > > > You should be okay with installing moses finally if you have installed all > > tha dependant packages before. I am not aware of the 'whereis' command, but > > once you train your model, your moses.ini file which is created by training > > script will take care of the paths. However, you should carefully supply > > paths while training your model. Before training your model, you should > have > > two seperate corpus files which are lowercased, sentence aligned and > > accordingly tokenized (there are supplementary tools for this). Once you > > have your corpus in two seperate files such as corpus.en, and corpus.fr you > > will run a training perl script: train-factored-phrase-model.pl with > various > > parameters. If you need further help with this command after installing > > moses and all training scripts, send me a reply including your exact path > > for your corpus files and I will try to figure out the training command for > > your paths. > > > > Cheers >
Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model
ngram-count is outputting an LM file specified by the -lm argument. "working-dir/lm/europarl.lm" in your case. I think it counts all ngrams first and then writes the file once at the end, so you probably didn't corrupt the output by accidentally starting a new process. If you want it to train quicker/don't have enough memory, try an order of 4 or even 3. Higher order LM models take more time to calculate and more RAM to hold in memory. The "-l 0:5:working-dir/lm/europarl.lm:0" arg to train-factored-phrase-model includes the LM order, so change that 5 to the appropriate number when you run that step. You mentioned having trouble getting stderr from train-factored-phrase- model in another email, and it seems like ngram-count is making your system unresponsive. Do a web search and learn about the unix 'nohup' and 'nice' commands, as well as redirecting stderr and stdout to a file, and running processes in the background. You'll end up with something like this, which might not thrash your system as much, and won't require that you leave a terminal window open the whole time a process runs: nohup nice ngram-count -order 4 -interpolate -kndiscount -text europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm &> ngram- run.out & Someone familiar with the Ubuntu packages will have to answer whether the moses installation is added to the path, how to call the training scripts, and if the moses/scripts directory is made & released. -Josh On 14 Aug 2008, at 12:02, Llio Humphreys wrote: > Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai, > thank you all for your help. It is very, very much appreciated. I > decided to try Eric's packages, and it looks like the installation > worked. I typed some of the > commands in the Baseline instructions without arguments, and the > program either output to the screen that I missed some arguments or > gave a description of the program. Thank you Eric!!! > > Following the Baseline instructions > (http://www.statmt.org/wmt08/baseline.html) I have now got to the > following step: > > Use SRILM to build language model: > /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount > -text working-dir/lm/europarl.lowercased -lm > working-dir/lm/europarl.lm > > In my case, I was in folder home/llio/MOSESMTDATA. I didn't know the > path to ngram-count, but it was possible to invoke it without the > path: > > ngram-count -order 5 -interpolate -kndiscount -text > europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm > > I'm concerned about two things: > 1) this ngram-count step is taking a very long time. I think I > started > it off around 6pm yesterday, but it's still going. It's very > resource-intensive, and it's difficult to get to other windows open. > I went to check up on it around 9pm, and couldn't find that particular > terminal. I thought I had closed that terminal by mistake, so I > stupidly > opened another one, and entered the same command. I subsequently > found that the original terminal was still open, so I closed the > second one. I'm not sure if issuing this command a second time on the > same program and files on a different terminal would corrupt the > original ngramcount step, and whether I should start it off again, or > whether starting it off again would make things worse? I looked up > ngram-count > (http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html > ) > and I don't think it outputs to any file, so I guess you have to be in > the same terminal to do the next step? I opened > another terminal and typed 'top' to see what processes are running, > and I know that ngram-count is doing something, but whether it's doing > well or stuck in a loop, I can't say. What I do find strange is that > the time for ngram-count is said to be 00:58:20, and it's been going > for hours.. I searched this problem in previous Moses Group emails and > I understand that if I run this with order 4 instead of 5 it will run > quicker with very similar results? So, can I just stop what it's > doing, and run this command in the same terminal with order 4? Are > there any files I need to 'touch' to ensure that it doesn't leave any > stone unturned? > > 2) how to do the next step: > > bin/moses-scripts/scripts-MMDD-HHMM/training/train-factored- > phrase-model.perl > -scripts-root-dir bin/moses-scripts/scripts-MMDD-HHMM -root-dir > working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en > -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm > 0:5:working-dir/lm/europarl.lm:0 > > I assume that like ngram-count, I can just type in > train-factored-phrase-model.perl without the full path...Do I need to > set the -scripts-root-dir paramater? Are all the scripts in the same > place? > > Thank you, > > Llio -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. __
[Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model
Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai, thank you all for your help. It is very, very much appreciated. I decided to try Eric's packages, and it looks like the installation worked. I typed some of the commands in the Baseline instructions without arguments, and the program either output to the screen that I missed some arguments or gave a description of the program. Thank you Eric!!! Following the Baseline instructions (http://www.statmt.org/wmt08/baseline.html) I have now got to the following step: Use SRILM to build language model: /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount -text working-dir/lm/europarl.lowercased -lm working-dir/lm/europarl.lm In my case, I was in folder home/llio/MOSESMTDATA. I didn't know the path to ngram-count, but it was possible to invoke it without the path: ngram-count -order 5 -interpolate -kndiscount -text europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm I'm concerned about two things: 1) this ngram-count step is taking a very long time. I think I started it off around 6pm yesterday, but it's still going. It's very resource-intensive, and it's difficult to get to other windows open. I went to check up on it around 9pm, and couldn't find that particular terminal. I thought I had closed that terminal by mistake, so I stupidly opened another one, and entered the same command. I subsequently found that the original terminal was still open, so I closed the second one. I'm not sure if issuing this command a second time on the same program and files on a different terminal would corrupt the original ngramcount step, and whether I should start it off again, or whether starting it off again would make things worse? I looked up ngram-count (http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html) and I don't think it outputs to any file, so I guess you have to be in the same terminal to do the next step? I opened another terminal and typed 'top' to see what processes are running, and I know that ngram-count is doing something, but whether it's doing well or stuck in a loop, I can't say. What I do find strange is that the time for ngram-count is said to be 00:58:20, and it's been going for hours.. I searched this problem in previous Moses Group emails and I understand that if I run this with order 4 instead of 5 it will run quicker with very similar results? So, can I just stop what it's doing, and run this command in the same terminal with order 4? Are there any files I need to 'touch' to ensure that it doesn't leave any stone unturned? 2) how to do the next step: bin/moses-scripts/scripts-MMDD-HHMM/training/train-factored-phrase-model.perl -scripts-root-dir bin/moses-scripts/scripts-MMDD-HHMM -root-dir working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:5:working-dir/lm/europarl.lm:0 I assume that like ngram-count, I can just type in train-factored-phrase-model.perl without the full path...Do I need to set the -scripts-root-dir paramater? Are all the scripts in the same place? Thank you, Llio On 8/14/08, Murat ALPEREN <[EMAIL PROTECTED]> wrote: > Dear Llio, > > You should be okay with installing moses finally if you have installed all > tha dependant packages before. I am not aware of the 'whereis' command, but > once you train your model, your moses.ini file which is created by training > script will take care of the paths. However, you should carefully supply > paths while training your model. Before training your model, you should have > two seperate corpus files which are lowercased, sentence aligned and > accordingly tokenized (there are supplementary tools for this). Once you > have your corpus in two seperate files such as corpus.en, and corpus.fr you > will run a training perl script: train-factored-phrase-model.pl with various > parameters. If you need further help with this command after installing > moses and all training scripts, send me a reply including your exact path > for your corpus files and I will try to figure out the training command for > your paths. > > Cheers > > > On 8/13/08, Llio Humphreys <[EMAIL PROTECTED]> wrote: > > Hi Murat, > > thanks for this. I've got Ubuntu 8.04 so the Hardy Heron packages are > > what I need also > > (http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/hardy/all/). > > > > I think I already got the order wrong...(sign of panic maybe?) > > I clicked on mckls deb and the package installer said it was already > installed. > > I clicked on srilm deb and the package installer said it was already > > installed, so I clicked Reinstall package. > > > > I can't find anything that says the order of installation, but note > > that the workshop baseline model requires installing giza before mckls > > Do I need to uninstall mkcls (if so how? is it just a matter of > > deleting the .