Hi, life is easier with factored models, if you use the experiment.perl set-up, where you just have to specify the factor set-up and scripts that generate factors.
These scripts take the tokenized text and replace each word with a factor (e.g., replace each word with the POS tag). The POS LM is trained on such a corpus - each word is replaced by a POS tag, and then the standard LM training process is run over it. See $MOSES/scripts/ems/example/config.factored for an example. -phi On Wed, May 4, 2016 at 3:30 PM, Sašo Kuntaric <saso.kunta...@gmail.com> wrote: > Hello again, > > I believe I can wrap my head around the theoretical part, but the English > and German corpora in the Moses factored model tutorial > (http://www.statmt.org/moses/?n=Moses.FactoredTutorial) look beautifully > factored, so my question is how were the original corpora processed? Was a > specific tagger used and was there any manual/script postprocessing done? > > And since I am already bugging everyone, how is the language model pos.lm > created? Is it extracted from a file, created manually or in another way? > > Thank you in advance for all the replies. > > Best regards, > > Sašo > > 2016-05-02 19:45 GMT+02:00 Marwa Refaie <basmal...@hotmail.com>: >> >> Corpus for translation model should be on 2 parallel files in the format >> Word | pos | Lema .... For example , by a file for each language. You can >> prepare files using word net , Stanford , or any tagger & stemmer as can >> deal with your language pairs. May be before enter the files to moses you >> should adjust the text files by a python script (write it your self) >> >> For language model ... You must build it as follows >> Verb noun noun >> Noun Det adj >> ....... Depending on the target language only ,, Then build it as usual >> n-gram lm. >> >> Sent from my iPad >> >> > On May 2, 2016, at 10:11, Sašo Kuntaric <saso.kunta...@gmail.com> wrote: >> > >> > Hi all, >> > >> > I am having some issues producing the corpora in the correct format for >> > Moses to execute factored training. >> > >> > I am looking at the factored tutorial on the Moses website and I am >> > wondering, how to get such consistent corpora for two languages. What tools >> > are being used and can they be trained for specific languages (Slovenian in >> > my example). Are such tools available for download or is such data produced >> > with custom scripts? >> > >> > -- >> > Best regards, >> > >> > Sašo >> > _______________________________________________ >> > Moses-support mailing list >> > Moses-support@mit.edu >> > http://mailman.mit.edu/mailman/listinfo/moses-support > > > > > -- > lp, > > Sašo > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support