Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model

2008-08-19 Thread Llio Humphreys
Hi Eric,
many thanks for your reply.  That worked I think?  mert-moses-run.out said:

After default: -l mem_free=0.5G -hard
Using SCRIPTS_ROOTDIR: /usr/share/moses/scripts
SYNC distortionchecking weight-count for ttable-file
checking weight-count for lmodel-file
checking weight-count for distortion-file
Executing: mkdir -p europarl/tuning
Executing: /usr/share/moses/scripts/training/filter-model-given-input.pl
./filtered /home/llio/MOSES/model/moses.ini
/home/llio/MOSES/europarl/tuning/input
filtering the phrase tables... Sat Aug 23 16:16:48 BST 2008
Executing: mkdir -p /home/llio/MOSES/europarl/tuning/filtered
Considering factor 0
Considering factor 0

It took a few seconds, which surprised me because the tutorial said:
'Note that this step can take many hours, even days, to run.'
But I've ended up with a filtered folder containing

-rw-r--r-- 1 llio llio  1048 2008-08-23 16:16 moses.ini
-rw-r--r-- 1 llio llio 204201984 2008-08-23 16:23 phrase-table.0-0.1

(the date and time are wrong on this machine).

Llio




On Wed, Aug 20, 2008 at 1:12 AM, Eric Nichols <[EMAIL PROTECTED]> wrote:
> Greetings,
>
> That line in mert-moses.pl is checking to see if moses is executable
> but can't find it.
> Replace moses in your command with /usr/bin/moses :
>
> mert-moses.pl europarl/tuning/input europarl/tuning/reference /usr/bin/moses
> model/moses.ini --working-dir europarl/tuning --rootdir
> /usr/share/moses/scripts >&mert-moses-run.out
>
> Eric Nichols
>
> On Wed, Aug 20, 2008 at 12:40 AM, Llio Humphreys <[EMAIL PROTECTED]> wrote:
>> Dear Eric/Moses Support Group,
>>
>> I am using Ubuntu with 3.5GB RAM and finally got
>> train-factored-phrase-model.perl to run!
>> I am now on the tuning part of the tutorial, and I'm still using the
>> Baseline data to test out the system on my machine.
>> I adapted the command for tuning from:
>>
>> bin/moses-scripts/scripts-MMDD-HHMM/training/mert-moses.pl
>> working-dir/tuning/input working-dir/tuning/reference
>> moses/moses-cmd/src/moses working-dir/model/moses.ini --working-dir
>> working-dir/tuning --rootdir bin/moses-scripts/scripts-MMDD-HHMM
>>
>> to
>>
>> mert-moses.pl europarl/tuning/input europarl/tuning/reference moses
>> model/moses.ini --working-dir europarl/tuning --rootdir
>> /usr/share/moses/scripts >&mert-moses-run.out
>>
>> I get the error message:
>>
>> After default: -l mem_free=0.5G -hard
>> Using SCRIPTS_ROOTDIR: /usr/share/moses/scripts
>> Not executable: moses at /usr/bin/mert-moses.pl line 297.
>>
>> mert-moses.pl line 297 is empty but the previous line says:
>>
>> die "Not executable: $___DECODER" if ! -x $___DECODER;
>>
>> Grateful for your advice.
>>
>> Thanks,
>> Llio Humphreys
>>
>>
>>
>>
>> On Thu, Aug 14, 2008 at 12:52 PM, Eric Nichols <[EMAIL PROTECTED]> wrote:
>>> Greetings,
>>>
>>> In the moses package, I install everything into /usr/share/moses and
>>> symlink the scripts and moses command into /usr/bin.
>>> You can see a list of installed files by running the following command:
>>>
>>> # dpkg -L moses
>>>
>>> When you call a command like ngram-count or
>>> train-factored-phrase-model.perl, you do not need to specify the full
>>> path;
>>> the system will be able to find it. I do not know if it is strictly
>>> necessary to set -scripts-root-dir, but the value
>>> /usr/share/moses/scripts works fine.
>>>
>>> Eric Nichols
>>>
>>> On Thu, Aug 14, 2008 at 8:02 PM, Llio Humphreys <[EMAIL PROTECTED]> wrote:
 Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai,
 thank you all for your help.  It is very, very much appreciated. I
 decided to try Eric's packages, and it looks like the installation
 worked.  I typed some of the
  commands in the Baseline instructions without arguments, and the
  program either output to the screen that I missed some arguments or
  gave a description of the program.  Thank you Eric!!!

  Following the Baseline instructions
  (http://www.statmt.org/wmt08/baseline.html) I have now got to the
  following step:

  Use SRILM to build language model:
  /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount
  -text working-dir/lm/europarl.lowercased -lm
  working-dir/lm/europarl.lm

  In my case, I was in folder home/llio/MOSESMTDATA.  I didn't know the
  path to ngram-count, but it was possible to invoke it without the
  path:

  ngram-count -order 5 -interpolate -kndiscount -text
  europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm

  I'm concerned about two things:
  1) this ngram-count step is taking a very long time.  I think I started
  it off around 6pm yesterday, but it's still going.  It's very
  resource-intensive, and it's difficult to get to  other windows open.
  I went to check up on it around 9pm, and couldn't find that particular
  terminal.  I thought I had closed that terminal by mistake, so I stupidly
  opened another one, and entered the same comman

Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model

2008-08-19 Thread Eric Nichols
Greetings,

That line in mert-moses.pl is checking to see if moses is executable
but can't find it.
Replace moses in your command with /usr/bin/moses :

mert-moses.pl europarl/tuning/input europarl/tuning/reference /usr/bin/moses
model/moses.ini --working-dir europarl/tuning --rootdir
/usr/share/moses/scripts >&mert-moses-run.out

Eric Nichols

On Wed, Aug 20, 2008 at 12:40 AM, Llio Humphreys <[EMAIL PROTECTED]> wrote:
> Dear Eric/Moses Support Group,
>
> I am using Ubuntu with 3.5GB RAM and finally got
> train-factored-phrase-model.perl to run!
> I am now on the tuning part of the tutorial, and I'm still using the
> Baseline data to test out the system on my machine.
> I adapted the command for tuning from:
>
> bin/moses-scripts/scripts-MMDD-HHMM/training/mert-moses.pl
> working-dir/tuning/input working-dir/tuning/reference
> moses/moses-cmd/src/moses working-dir/model/moses.ini --working-dir
> working-dir/tuning --rootdir bin/moses-scripts/scripts-MMDD-HHMM
>
> to
>
> mert-moses.pl europarl/tuning/input europarl/tuning/reference moses
> model/moses.ini --working-dir europarl/tuning --rootdir
> /usr/share/moses/scripts >&mert-moses-run.out
>
> I get the error message:
>
> After default: -l mem_free=0.5G -hard
> Using SCRIPTS_ROOTDIR: /usr/share/moses/scripts
> Not executable: moses at /usr/bin/mert-moses.pl line 297.
>
> mert-moses.pl line 297 is empty but the previous line says:
>
> die "Not executable: $___DECODER" if ! -x $___DECODER;
>
> Grateful for your advice.
>
> Thanks,
> Llio Humphreys
>
>
>
>
> On Thu, Aug 14, 2008 at 12:52 PM, Eric Nichols <[EMAIL PROTECTED]> wrote:
>> Greetings,
>>
>> In the moses package, I install everything into /usr/share/moses and
>> symlink the scripts and moses command into /usr/bin.
>> You can see a list of installed files by running the following command:
>>
>> # dpkg -L moses
>>
>> When you call a command like ngram-count or
>> train-factored-phrase-model.perl, you do not need to specify the full
>> path;
>> the system will be able to find it. I do not know if it is strictly
>> necessary to set -scripts-root-dir, but the value
>> /usr/share/moses/scripts works fine.
>>
>> Eric Nichols
>>
>> On Thu, Aug 14, 2008 at 8:02 PM, Llio Humphreys <[EMAIL PROTECTED]> wrote:
>>> Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai,
>>> thank you all for your help.  It is very, very much appreciated. I
>>> decided to try Eric's packages, and it looks like the installation
>>> worked.  I typed some of the
>>>  commands in the Baseline instructions without arguments, and the
>>>  program either output to the screen that I missed some arguments or
>>>  gave a description of the program.  Thank you Eric!!!
>>>
>>>  Following the Baseline instructions
>>>  (http://www.statmt.org/wmt08/baseline.html) I have now got to the
>>>  following step:
>>>
>>>  Use SRILM to build language model:
>>>  /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount
>>>  -text working-dir/lm/europarl.lowercased -lm
>>>  working-dir/lm/europarl.lm
>>>
>>>  In my case, I was in folder home/llio/MOSESMTDATA.  I didn't know the
>>>  path to ngram-count, but it was possible to invoke it without the
>>>  path:
>>>
>>>  ngram-count -order 5 -interpolate -kndiscount -text
>>>  europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm
>>>
>>>  I'm concerned about two things:
>>>  1) this ngram-count step is taking a very long time.  I think I started
>>>  it off around 6pm yesterday, but it's still going.  It's very
>>>  resource-intensive, and it's difficult to get to  other windows open.
>>>  I went to check up on it around 9pm, and couldn't find that particular
>>>  terminal.  I thought I had closed that terminal by mistake, so I stupidly
>>>  opened another one, and entered the same command.  I subsequently
>>>  found that the original terminal was still open, so I closed the
>>>  second one.  I'm not sure if issuing this command a second time on the
>>>  same program and files on a different terminal would corrupt the
>>>  original ngramcount step, and whether I should start it off again, or
>>>  whether starting it off again would make things worse?   I looked up
>>>  ngram-count 
>>> (http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html)
>>>  and I don't think it outputs to any file, so I guess you have to be in
>>>  the same terminal to do the next step?  I opened
>>>  another terminal and typed 'top' to see what processes are running,
>>>  and I know that ngram-count is doing something, but whether it's doing
>>>  well or stuck in a loop, I can't say.  What I do find strange is that
>>> the time for ngram-count is said to be 00:58:20, and it's been going
>>> for hours.. I searched this problem in previous Moses Group emails and
>>> I understand that if I run this with order 4 instead of 5 it will run
>>> quicker with very similar results?  So, can I just stop what it's
>>> doing, and run this command in the same terminal with order 4?  Are
>>> there any files 

Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model

2008-08-19 Thread Llio Humphreys
Dear Eric/Moses Support Group,

I am using Ubuntu with 3.5GB RAM and finally got
train-factored-phrase-model.perl to run!
I am now on the tuning part of the tutorial, and I'm still using the
Baseline data to test out the system on my machine.
I adapted the command for tuning from:

bin/moses-scripts/scripts-MMDD-HHMM/training/mert-moses.pl
working-dir/tuning/input working-dir/tuning/reference
moses/moses-cmd/src/moses working-dir/model/moses.ini --working-dir
working-dir/tuning --rootdir bin/moses-scripts/scripts-MMDD-HHMM

to

mert-moses.pl europarl/tuning/input europarl/tuning/reference moses
model/moses.ini --working-dir europarl/tuning --rootdir
/usr/share/moses/scripts >&mert-moses-run.out

I get the error message:

After default: -l mem_free=0.5G -hard
Using SCRIPTS_ROOTDIR: /usr/share/moses/scripts
Not executable: moses at /usr/bin/mert-moses.pl line 297.

mert-moses.pl line 297 is empty but the previous line says:

die "Not executable: $___DECODER" if ! -x $___DECODER;

Grateful for your advice.

Thanks,
Llio Humphreys




On Thu, Aug 14, 2008 at 12:52 PM, Eric Nichols <[EMAIL PROTECTED]> wrote:
> Greetings,
>
> In the moses package, I install everything into /usr/share/moses and
> symlink the scripts and moses command into /usr/bin.
> You can see a list of installed files by running the following command:
>
> # dpkg -L moses
>
> When you call a command like ngram-count or
> train-factored-phrase-model.perl, you do not need to specify the full
> path;
> the system will be able to find it. I do not know if it is strictly
> necessary to set -scripts-root-dir, but the value
> /usr/share/moses/scripts works fine.
>
> Eric Nichols
>
> On Thu, Aug 14, 2008 at 8:02 PM, Llio Humphreys <[EMAIL PROTECTED]> wrote:
>> Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai,
>> thank you all for your help.  It is very, very much appreciated. I
>> decided to try Eric's packages, and it looks like the installation
>> worked.  I typed some of the
>>  commands in the Baseline instructions without arguments, and the
>>  program either output to the screen that I missed some arguments or
>>  gave a description of the program.  Thank you Eric!!!
>>
>>  Following the Baseline instructions
>>  (http://www.statmt.org/wmt08/baseline.html) I have now got to the
>>  following step:
>>
>>  Use SRILM to build language model:
>>  /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount
>>  -text working-dir/lm/europarl.lowercased -lm
>>  working-dir/lm/europarl.lm
>>
>>  In my case, I was in folder home/llio/MOSESMTDATA.  I didn't know the
>>  path to ngram-count, but it was possible to invoke it without the
>>  path:
>>
>>  ngram-count -order 5 -interpolate -kndiscount -text
>>  europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm
>>
>>  I'm concerned about two things:
>>  1) this ngram-count step is taking a very long time.  I think I started
>>  it off around 6pm yesterday, but it's still going.  It's very
>>  resource-intensive, and it's difficult to get to  other windows open.
>>  I went to check up on it around 9pm, and couldn't find that particular
>>  terminal.  I thought I had closed that terminal by mistake, so I stupidly
>>  opened another one, and entered the same command.  I subsequently
>>  found that the original terminal was still open, so I closed the
>>  second one.  I'm not sure if issuing this command a second time on the
>>  same program and files on a different terminal would corrupt the
>>  original ngramcount step, and whether I should start it off again, or
>>  whether starting it off again would make things worse?   I looked up
>>  ngram-count 
>> (http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html)
>>  and I don't think it outputs to any file, so I guess you have to be in
>>  the same terminal to do the next step?  I opened
>>  another terminal and typed 'top' to see what processes are running,
>>  and I know that ngram-count is doing something, but whether it's doing
>>  well or stuck in a loop, I can't say.  What I do find strange is that
>> the time for ngram-count is said to be 00:58:20, and it's been going
>> for hours.. I searched this problem in previous Moses Group emails and
>> I understand that if I run this with order 4 instead of 5 it will run
>> quicker with very similar results?  So, can I just stop what it's
>> doing, and run this command in the same terminal with order 4?  Are
>> there any files I need to 'touch' to ensure that it doesn't leave any
>> stone unturned?
>>
>>  2) how to do the next step:
>>
>>  
>> bin/moses-scripts/scripts-MMDD-HHMM/training/train-factored-phrase-model.perl
>>  -scripts-root-dir bin/moses-scripts/scripts-MMDD-HHMM -root-dir
>>  working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en
>>  -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm
>>  0:5:working-dir/lm/europarl.lm:0
>>
>> I assume that like ngram-count, I can just type in
>> train-factored-phrase-model.perl wi

Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model

2008-08-14 Thread Llio Humphreys
Dear Josh,
Ok, I ran the command, with order 3 as I'm just testing if this system
works on this machine, and it produced europarl.lm in seconds and
output:
[1] 15789

n-gram.out said:
nohup: ignoring input

nohup.out said:
Warning: ngram-count option "-text" needs an argument
one of required modified KneserNey count-of-counts is zero
error in discount estimator for order 1

But I've looked at europarl.lm and it looks fine to me, it even ends
with \end\ so it obviously finished the process.

I guess if there's anything wrong, I'll find out in the next step?

Llio

On Thu, Aug 14, 2008 at 12:35 PM, Josh Schroeder <[EMAIL PROTECTED]> wrote:
> ngram-count is outputting an LM file specified by the -lm argument.
> "working-dir/lm/europarl.lm" in your case.
>
> I think it counts all ngrams first and then writes the file once at the end,
> so you probably didn't corrupt the output by accidentally starting a new
> process.
>
> If you want it to train quicker/don't have enough memory, try an order of 4
> or even 3. Higher order LM models take more time to calculate and more RAM
> to hold in memory. The  "-l 0:5:working-dir/lm/europarl.lm:0" arg to
> train-factored-phrase-model includes the LM order, so change that 5 to the
> appropriate number when you run that step.
>
> You mentioned having trouble getting stderr from train-factored-phrase-model
> in another email, and it seems like ngram-count is making your system
> unresponsive. Do a web search and learn about the unix 'nohup' and 'nice'
> commands, as well as redirecting stderr and stdout to a file, and running
> processes in the background. You'll end up with something like this, which
> might not thrash your system as much, and won't require that you leave a
> terminal window open the whole time a process runs:
>
> nohup nice ngram-count -order 4 -interpolate -kndiscount -text
> europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm &> ngram-run.out
> &
>
> Someone familiar with the Ubuntu packages will have to answer whether the
> moses installation is added to the path, how to call the training scripts,
> and if the moses/scripts directory is made & released.
>
> -Josh
>
> On 14 Aug 2008, at 12:02, Llio Humphreys wrote:
>
>> Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai,
>> thank you all for your help.  It is very, very much appreciated. I
>> decided to try Eric's packages, and it looks like the installation
>> worked.  I typed some of the
>> commands in the Baseline instructions without arguments, and the
>> program either output to the screen that I missed some arguments or
>> gave a description of the program.  Thank you Eric!!!
>>
>> Following the Baseline instructions
>> (http://www.statmt.org/wmt08/baseline.html) I have now got to the
>> following step:
>>
>> Use SRILM to build language model:
>> /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount
>> -text working-dir/lm/europarl.lowercased -lm
>> working-dir/lm/europarl.lm
>>
>> In my case, I was in folder home/llio/MOSESMTDATA.  I didn't know the
>> path to ngram-count, but it was possible to invoke it without the
>> path:
>>
>> ngram-count -order 5 -interpolate -kndiscount -text
>> europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm
>>
>> I'm concerned about two things:
>> 1) this ngram-count step is taking a very long time.  I think I started
>> it off around 6pm yesterday, but it's still going.  It's very
>> resource-intensive, and it's difficult to get to  other windows open.
>> I went to check up on it around 9pm, and couldn't find that particular
>> terminal.  I thought I had closed that terminal by mistake, so I stupidly
>> opened another one, and entered the same command.  I subsequently
>> found that the original terminal was still open, so I closed the
>> second one.  I'm not sure if issuing this command a second time on the
>> same program and files on a different terminal would corrupt the
>> original ngramcount step, and whether I should start it off again, or
>> whether starting it off again would make things worse?   I looked up
>> ngram-count
>> (http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html)
>> and I don't think it outputs to any file, so I guess you have to be in
>> the same terminal to do the next step?  I opened
>> another terminal and typed 'top' to see what processes are running,
>> and I know that ngram-count is doing something, but whether it's doing
>> well or stuck in a loop, I can't say.  What I do find strange is that
>> the time for ngram-count is said to be 00:58:20, and it's been going
>> for hours.. I searched this problem in previous Moses Group emails and
>> I understand that if I run this with order 4 instead of 5 it will run
>> quicker with very similar results?  So, can I just stop what it's
>> doing, and run this command in the same terminal with order 4?  Are
>> there any files I need to 'touch' to ensure that it doesn't leave any
>> stone unturned?
>>
>> 2) how to do the next step:
>>
>>
>> bin/

Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model

2008-08-14 Thread Miles Osborne
building language models (using for example ngram-count) is computationally
expensive.  from what you tell the list, it seems that you don't have enough
physical memory to run it properly.

you have a number of options:

--specify a lower order model (eg 4 rather than 5, or even 3);  depending
upon how much monolingual training material you have, this may not produce
worse results  and it will certainly run faster and will require less space.

--divide your language model training material into chunks and run
ngram-count on each chunk.  this is one strategy for building LMs using all
of the Giga word corpus (when you don't have access to a 64 bit machine).
here you would create multiple LMs.

--use a disk-based method of creating them.  we have done this, and
basically it trades speed for time.

--take the radical option and simply don't bother smoothing at all (ie use
Google's "stupid backoff").  this makes training LMs trivial --just compute
the counts of ngrams and work-out how to store them.  i reckon it should be
possible to do this and create an ARPA file suitable for loading into the
SRILM.

--buy more machines.

Miles

2008/8/14 Llio Humphreys <[EMAIL PROTECTED]>

> Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai,
> thank you all for your help.  It is very, very much appreciated. I
> decided to try Eric's packages, and it looks like the installation
> worked.  I typed some of the
>  commands in the Baseline instructions without arguments, and the
>  program either output to the screen that I missed some arguments or
>  gave a description of the program.  Thank you Eric!!!
>
>  Following the Baseline instructions
>  (http://www.statmt.org/wmt08/baseline.html) I have now got to the
>  following step:
>
>  Use SRILM to build language model:
>  /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount
>  -text working-dir/lm/europarl.lowercased -lm
>  working-dir/lm/europarl.lm
>
>  In my case, I was in folder home/llio/MOSESMTDATA.  I didn't know the
>  path to ngram-count, but it was possible to invoke it without the
>  path:
>
>  ngram-count -order 5 -interpolate -kndiscount -text
>  europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm
>
>  I'm concerned about two things:
>  1) this ngram-count step is taking a very long time.  I think I started
>  it off around 6pm yesterday, but it's still going.  It's very
>  resource-intensive, and it's difficult to get to  other windows open.
>  I went to check up on it around 9pm, and couldn't find that particular
>  terminal.  I thought I had closed that terminal by mistake, so I stupidly
>  opened another one, and entered the same command.  I subsequently
>  found that the original terminal was still open, so I closed the
>  second one.  I'm not sure if issuing this command a second time on the
>  same program and files on a different terminal would corrupt the
>  original ngramcount step, and whether I should start it off again, or
>  whether starting it off again would make things worse?   I looked up
>  ngram-count (
> http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html)
>  and I don't think it outputs to any file, so I guess you have to be in
>  the same terminal to do the next step?  I opened
>  another terminal and typed 'top' to see what processes are running,
>  and I know that ngram-count is doing something, but whether it's doing
>  well or stuck in a loop, I can't say.  What I do find strange is that
> the time for ngram-count is said to be 00:58:20, and it's been going
> for hours.. I searched this problem in previous Moses Group emails and
> I understand that if I run this with order 4 instead of 5 it will run
> quicker with very similar results?  So, can I just stop what it's
> doing, and run this command in the same terminal with order 4?  Are
> there any files I need to 'touch' to ensure that it doesn't leave any
> stone unturned?
>
>  2) how to do the next step:
>
>
>  
> bin/moses-scripts/scripts-MMDD-HHMM/training/train-factored-phrase-model.perl
>  -scripts-root-dir bin/moses-scripts/scripts-MMDD-HHMM -root-dir
>  working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en
>  -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm
>  0:5:working-dir/lm/europarl.lm:0
>
> I assume that like ngram-count, I can just type in
> train-factored-phrase-model.perl without the full path...Do I need to
> set the -scripts-root-dir paramater?  Are all the scripts in the same
> place?
>
> Thank you,
>
> Llio
>
>
>
>
>  On 8/14/08, Murat ALPEREN <[EMAIL PROTECTED]> wrote:
>  > Dear Llio,
>  >
>  > You should be okay with installing moses finally if you have installed
> all
>  > tha dependant packages before. I am not aware of the 'whereis' command,
> but
>  > once you train your model, your moses.ini file which is created by
> training
>  > script will take care of the paths. However, you should carefully supply
>  > paths while training your model. Before training your model, y

Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model

2008-08-14 Thread Llio Humphreys
Hi Miles/Josh,
thanks for your replies.  Looking at the options, the first one seems
to be the easiest one to try first:-

> --specify a lower order model (eg 4 rather than 5, or even 3);  depending
> upon how much monolingual training material you have, this may not produce
> worse results  and it will certainly run faster and will require less space.

I take it that it won't be a problem stop what it's doing, and run
this command in the same terminal with order 4.
So, I'll proceed with Josh's suggestion:

nohup nice ngram-count -order 4 -interpolate -kndiscount -text
europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm &>
ngram-run.out &

Many thanks,
Llio



On Thu, Aug 14, 2008 at 12:29 PM, Miles Osborne <[EMAIL PROTECTED]> wrote:
> (my message bounced as it was too long ... here is a truncated  version)
>
> Miles
>
> -- Forwarded message --
> From: Miles Osborne <[EMAIL PROTECTED]>
> Date: 2008/8/14
> Subject: Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model
> and Train Model
> To: Llio Humphreys <[EMAIL PROTECTED]>
> Cc: moses-support 
>
>
> building language models (using for example ngram-count) is computationally
> expensive.  from what you tell the list, it seems that you don't have enough
> physical memory to run it properly.
>
> you have a number of options:
>
> --specify a lower order model (eg 4 rather than 5, or even 3);  depending
> upon how much monolingual training material you have, this may not produce
> worse results  and it will certainly run faster and will require less space.
>
> --divide your language model training material into chunks and run
> ngram-count on each chunk.  this is one strategy for building LMs using all
> of the Giga word corpus (when you don't have access to a 64 bit machine).
> here you would create multiple LMs.
>
> --use a disk-based method of creating them.  we have done this, and
> basically it trades speed for time.
>
> --take the radical option and simply don't bother smoothing at all (ie use
> Google's "stupid backoff").  this makes training LMs trivial --just compute
> the counts of ngrams and work-out how to store them.  i reckon it should be
> possible to do this and create an ARPA file suitable for loading into the
> SRILM.
>
> --buy more machines.
>
> Miles
>
>
>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model

2008-08-14 Thread Eric Nichols
Greetings,

In the moses package, I install everything into /usr/share/moses and
symlink the scripts and moses command into /usr/bin.
You can see a list of installed files by running the following command:

# dpkg -L moses

When you call a command like ngram-count or
train-factored-phrase-model.perl, you do not need to specify the full
path;
the system will be able to find it. I do not know if it is strictly
necessary to set -scripts-root-dir, but the value
/usr/share/moses/scripts works fine.

Eric Nichols

On Thu, Aug 14, 2008 at 8:02 PM, Llio Humphreys <[EMAIL PROTECTED]> wrote:
> Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai,
> thank you all for your help.  It is very, very much appreciated. I
> decided to try Eric's packages, and it looks like the installation
> worked.  I typed some of the
>  commands in the Baseline instructions without arguments, and the
>  program either output to the screen that I missed some arguments or
>  gave a description of the program.  Thank you Eric!!!
>
>  Following the Baseline instructions
>  (http://www.statmt.org/wmt08/baseline.html) I have now got to the
>  following step:
>
>  Use SRILM to build language model:
>  /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount
>  -text working-dir/lm/europarl.lowercased -lm
>  working-dir/lm/europarl.lm
>
>  In my case, I was in folder home/llio/MOSESMTDATA.  I didn't know the
>  path to ngram-count, but it was possible to invoke it without the
>  path:
>
>  ngram-count -order 5 -interpolate -kndiscount -text
>  europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm
>
>  I'm concerned about two things:
>  1) this ngram-count step is taking a very long time.  I think I started
>  it off around 6pm yesterday, but it's still going.  It's very
>  resource-intensive, and it's difficult to get to  other windows open.
>  I went to check up on it around 9pm, and couldn't find that particular
>  terminal.  I thought I had closed that terminal by mistake, so I stupidly
>  opened another one, and entered the same command.  I subsequently
>  found that the original terminal was still open, so I closed the
>  second one.  I'm not sure if issuing this command a second time on the
>  same program and files on a different terminal would corrupt the
>  original ngramcount step, and whether I should start it off again, or
>  whether starting it off again would make things worse?   I looked up
>  ngram-count 
> (http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html)
>  and I don't think it outputs to any file, so I guess you have to be in
>  the same terminal to do the next step?  I opened
>  another terminal and typed 'top' to see what processes are running,
>  and I know that ngram-count is doing something, but whether it's doing
>  well or stuck in a loop, I can't say.  What I do find strange is that
> the time for ngram-count is said to be 00:58:20, and it's been going
> for hours.. I searched this problem in previous Moses Group emails and
> I understand that if I run this with order 4 instead of 5 it will run
> quicker with very similar results?  So, can I just stop what it's
> doing, and run this command in the same terminal with order 4?  Are
> there any files I need to 'touch' to ensure that it doesn't leave any
> stone unturned?
>
>  2) how to do the next step:
>
>  
> bin/moses-scripts/scripts-MMDD-HHMM/training/train-factored-phrase-model.perl
>  -scripts-root-dir bin/moses-scripts/scripts-MMDD-HHMM -root-dir
>  working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en
>  -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm
>  0:5:working-dir/lm/europarl.lm:0
>
> I assume that like ngram-count, I can just type in
> train-factored-phrase-model.perl without the full path...Do I need to
> set the -scripts-root-dir paramater?  Are all the scripts in the same
> place?
>
> Thank you,
>
> Llio
>
>
>
>
>  On 8/14/08, Murat ALPEREN <[EMAIL PROTECTED]> wrote:
>  > Dear Llio,
>  >
>  > You should be okay with installing moses finally if you have installed all
>  > tha dependant packages before. I am not aware of the 'whereis' command, but
>  > once you train your model, your moses.ini file which is created by training
>  > script will take care of the paths. However, you should carefully supply
>  > paths while training your model. Before training your model, you should 
> have
>  > two seperate corpus files which are lowercased, sentence aligned and
>  > accordingly tokenized (there are supplementary tools for this). Once you
>  > have your corpus in two seperate files such as corpus.en, and corpus.fr you
>  > will run a training perl script: train-factored-phrase-model.pl with 
> various
>  > parameters. If you need further help with this command after installing
>  > moses and all training scripts, send me a reply including your exact path
>  > for your corpus files and I will try to figure out the training command for
>  > your paths.
>  >
>  > Cheers
>  

Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model

2008-08-14 Thread Josh Schroeder
ngram-count is outputting an LM file specified by the -lm argument.  
"working-dir/lm/europarl.lm" in your case.

I think it counts all ngrams first and then writes the file once at  
the end, so you probably didn't corrupt the output by accidentally  
starting a new process.

If you want it to train quicker/don't have enough memory, try an order  
of 4 or even 3. Higher order LM models take more time to calculate and  
more RAM to hold in memory. The  "-l 0:5:working-dir/lm/europarl.lm:0"  
arg to train-factored-phrase-model includes the LM order, so change  
that 5 to the appropriate number when you run that step.

You mentioned having trouble getting stderr from train-factored-phrase- 
model in another email, and it seems like ngram-count is making your  
system unresponsive. Do a web search and learn about the unix 'nohup'  
and 'nice' commands, as well as redirecting stderr and stdout to a  
file, and running processes in the background. You'll end up with  
something like this, which might not thrash your system as much, and  
won't require that you leave a terminal window open the whole time a  
process runs:

nohup nice ngram-count -order 4 -interpolate -kndiscount -text  
europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm &> ngram- 
run.out &

Someone familiar with the Ubuntu packages will have to answer whether  
the moses installation is added to the path, how to call the training  
scripts, and if the moses/scripts directory is made & released.

-Josh

On 14 Aug 2008, at 12:02, Llio Humphreys wrote:

> Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai,
> thank you all for your help.  It is very, very much appreciated. I
> decided to try Eric's packages, and it looks like the installation
> worked.  I typed some of the
> commands in the Baseline instructions without arguments, and the
> program either output to the screen that I missed some arguments or
> gave a description of the program.  Thank you Eric!!!
>
> Following the Baseline instructions
> (http://www.statmt.org/wmt08/baseline.html) I have now got to the
> following step:
>
> Use SRILM to build language model:
> /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount
> -text working-dir/lm/europarl.lowercased -lm
> working-dir/lm/europarl.lm
>
> In my case, I was in folder home/llio/MOSESMTDATA.  I didn't know the
> path to ngram-count, but it was possible to invoke it without the
> path:
>
> ngram-count -order 5 -interpolate -kndiscount -text
> europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm
>
> I'm concerned about two things:
> 1) this ngram-count step is taking a very long time.  I think I  
> started
> it off around 6pm yesterday, but it's still going.  It's very
> resource-intensive, and it's difficult to get to  other windows open.
> I went to check up on it around 9pm, and couldn't find that particular
> terminal.  I thought I had closed that terminal by mistake, so I  
> stupidly
> opened another one, and entered the same command.  I subsequently
> found that the original terminal was still open, so I closed the
> second one.  I'm not sure if issuing this command a second time on the
> same program and files on a different terminal would corrupt the
> original ngramcount step, and whether I should start it off again, or
> whether starting it off again would make things worse?   I looked up
> ngram-count 
> (http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html 
> )
> and I don't think it outputs to any file, so I guess you have to be in
> the same terminal to do the next step?  I opened
> another terminal and typed 'top' to see what processes are running,
> and I know that ngram-count is doing something, but whether it's doing
> well or stuck in a loop, I can't say.  What I do find strange is that
> the time for ngram-count is said to be 00:58:20, and it's been going
> for hours.. I searched this problem in previous Moses Group emails and
> I understand that if I run this with order 4 instead of 5 it will run
> quicker with very similar results?  So, can I just stop what it's
> doing, and run this command in the same terminal with order 4?  Are
> there any files I need to 'touch' to ensure that it doesn't leave any
> stone unturned?
>
> 2) how to do the next step:
>
> bin/moses-scripts/scripts-MMDD-HHMM/training/train-factored- 
> phrase-model.perl
> -scripts-root-dir bin/moses-scripts/scripts-MMDD-HHMM -root-dir
> working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en
> -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm
> 0:5:working-dir/lm/europarl.lm:0
>
> I assume that like ngram-count, I can just type in
> train-factored-phrase-model.perl without the full path...Do I need to
> set the -scripts-root-dir paramater?  Are all the scripts in the same
> place?
>
> Thank you,
>
> Llio


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

__

[Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model

2008-08-14 Thread Llio Humphreys
Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai,
thank you all for your help.  It is very, very much appreciated. I
decided to try Eric's packages, and it looks like the installation
worked.  I typed some of the
 commands in the Baseline instructions without arguments, and the
 program either output to the screen that I missed some arguments or
 gave a description of the program.  Thank you Eric!!!

 Following the Baseline instructions
 (http://www.statmt.org/wmt08/baseline.html) I have now got to the
 following step:

 Use SRILM to build language model:
 /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount
 -text working-dir/lm/europarl.lowercased -lm
 working-dir/lm/europarl.lm

 In my case, I was in folder home/llio/MOSESMTDATA.  I didn't know the
 path to ngram-count, but it was possible to invoke it without the
 path:

 ngram-count -order 5 -interpolate -kndiscount -text
 europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm

 I'm concerned about two things:
 1) this ngram-count step is taking a very long time.  I think I started
 it off around 6pm yesterday, but it's still going.  It's very
 resource-intensive, and it's difficult to get to  other windows open.
 I went to check up on it around 9pm, and couldn't find that particular
 terminal.  I thought I had closed that terminal by mistake, so I stupidly
 opened another one, and entered the same command.  I subsequently
 found that the original terminal was still open, so I closed the
 second one.  I'm not sure if issuing this command a second time on the
 same program and files on a different terminal would corrupt the
 original ngramcount step, and whether I should start it off again, or
 whether starting it off again would make things worse?   I looked up
 ngram-count 
(http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html)
 and I don't think it outputs to any file, so I guess you have to be in
 the same terminal to do the next step?  I opened
 another terminal and typed 'top' to see what processes are running,
 and I know that ngram-count is doing something, but whether it's doing
 well or stuck in a loop, I can't say.  What I do find strange is that
the time for ngram-count is said to be 00:58:20, and it's been going
for hours.. I searched this problem in previous Moses Group emails and
I understand that if I run this with order 4 instead of 5 it will run
quicker with very similar results?  So, can I just stop what it's
doing, and run this command in the same terminal with order 4?  Are
there any files I need to 'touch' to ensure that it doesn't leave any
stone unturned?

 2) how to do the next step:

 
bin/moses-scripts/scripts-MMDD-HHMM/training/train-factored-phrase-model.perl
 -scripts-root-dir bin/moses-scripts/scripts-MMDD-HHMM -root-dir
 working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en
 -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm
 0:5:working-dir/lm/europarl.lm:0

I assume that like ngram-count, I can just type in
train-factored-phrase-model.perl without the full path...Do I need to
set the -scripts-root-dir paramater?  Are all the scripts in the same
place?

Thank you,

Llio




 On 8/14/08, Murat ALPEREN <[EMAIL PROTECTED]> wrote:
 > Dear Llio,
 >
 > You should be okay with installing moses finally if you have installed all
 > tha dependant packages before. I am not aware of the 'whereis' command, but
 > once you train your model, your moses.ini file which is created by training
 > script will take care of the paths. However, you should carefully supply
 > paths while training your model. Before training your model, you should have
 > two seperate corpus files which are lowercased, sentence aligned and
 > accordingly tokenized (there are supplementary tools for this). Once you
 > have your corpus in two seperate files such as corpus.en, and corpus.fr you
 > will run a training perl script: train-factored-phrase-model.pl with various
 > parameters. If you need further help with this command after installing
 > moses and all training scripts, send me a reply including your exact path
 > for your corpus files and I will try to figure out the training command for
 > your paths.
 >
 > Cheers
 >
 >
 > On 8/13/08, Llio Humphreys <[EMAIL PROTECTED]> wrote:
 > > Hi Murat,
 > > thanks for this.  I've got Ubuntu 8.04 so the Hardy Heron packages are
 > > what I need also
 > > (http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/hardy/all/).
 > >
 > > I think I already got the order wrong...(sign of panic maybe?)
 > > I clicked on mckls deb and the package installer said it was already
 > installed.
 > > I clicked on srilm deb and the package installer said it was already
 > > installed, so I clicked Reinstall package.
 > >
 > > I can't find anything that says the order of installation, but note
 > > that the workshop baseline model requires installing giza before mckls
 > > Do I need to uninstall mkcls (if so how? is it just a matter of
 > > deleting the .