Thanks for the pointer. No apologies necessary. I'm betting it's just one of those things that another set of eyeballs will see it immediately. Here's the sequence of events that replicates (I hope) the demo described on the web site.
start: Tokenize French gzip -cd corpora/wmt08/training/news-commentary08.fr-en.fr.gz \ | bin/tokenizer.perl -l fr > demo/corpus/news-commentary.tok.fr finish: Tokenize French start: Tokenize English gzip -cd corpora/wmt08/training/news-commentary08.fr-en.en.gz \ | bin/tokenizer.perl -l en > demo/corpus/news-commentary.tok.en finish: Tokenize English start: Limit sentence length moses-scripts/scripts-20090913-1332/training/clean-corpus-n.perl \ demo/corpus/news-commentary.tok fr en demo/corpus/news-commentary.clean 1 40 finish: Limit sentence length start: Lowercase French training data bin/lowercase.perl < demo/corpus/news-commentary.clean.fr \ > demo/corpus/news-commentary.clean.lowercased.fr finish: Lowercase French training data start: Lowercase English training data bin/lowercase.perl < demo/corpus/news-commentary.clean.en \ > demo/corpus/news-commentary.clean.lowercased.en finish: Lowercase English training data start: Lowercase all English training data bin/lowercase.perl < demo/corpus/news-commentary.tok.en \ > demo/lm/news-commentary.lowercased.en finish: Lowercase all English training data start: Lowercase all French training data bin/lowercase.perl < demo/corpus/news-commentary.tok.fr \ > demo/lm/news-commentary.lowercased.fr finish: Lowercase all French training data start: Build trigram model for English bin/i686/ngram-count -order 3 -interpolate -kndiscount -unk \ -text demo/lm/news-commentary.lowercased.en -lm demo/lm/news-commentary.lm finish: Build trigram model for English start: Build trigram model for French bin/i686/ngram-count -order 3 -interpolate -kndiscount -unk \ -text demo/lm/news-commentary.lowercased.fr -lm demo/lm/news-commentary.lm finish: Build trigram model for French start: Build the language model moses-scripts/scripts-20090913-1332/training/train-factored-phrase-model.perl \ -scripts-root-dir /home/jkolen/trans/moses-scripts/scripts-20090913-1332/ \ -root-dir demo -corpus demo/lm/news-commentary.lowercased \ -f fr -e en -alignment grow-diag-final-and \ -reordering msd-bidirectional-fe \ -lm 0:3:/home/jkolen/trans/demo/lm/news-commentary.lm finish: Build the language model On Mon, Sep 14, 2009 at 7:07 AM, John Burger <[email protected]> wrote: > John Kolen wrote: > > Yes, the output log is reporting many zero length sentences. I must have >> something misconfigured up stream. >> > > I find the clean-corpus-n.perl script included with the Moses distribution > to be useful here. I have a target in my Makefile that looks like this: > > LENGTHLIMIT=40 > %.clean.fr %.clean.en: %.en %.fr > ./moses-scripts/scripts/training/clean-corpus-n.perl $* fr en > $*.clean \ > 1 $(LENGTHLIMIT) > > If you don't use Makefiles, this might be something like this: > > clean-corpus-n.perl data fr en data.clean 1 40 > > This creates data.clean.en and .fr from data.en and .fr, filtering out > pairs if either segment has length less than 1 (which solves your problem) > or more than 40. The script will also optionally take care of lowercasing > the data, although we do that elsewhere. > > (Apologies if you already know about this.) > > - JB >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
