Re: [Moses-support] Data for building a factored model

Philipp Koehn Thu, 05 May 2016 15:10:44 -0700

Hi,

life is easier with factored models, if you use the experiment.perl set-up,
where you just have to specify the factor set-up and scripts that generate
factors.


These scripts take the tokenized text and replace each word with a factor
(e.g., replace each word with the POS tag).

The POS LM is trained on such a corpus - each word is replaced by a
POS tag, and then the standard LM training process is run over it.

See $MOSES/scripts/ems/example/config.factored for an example.

-phi

On Wed, May 4, 2016 at 3:30 PM, Sašo Kuntaric <saso.kunta...@gmail.com> wrote:
> Hello again,
>
> I believe I can wrap my head around the theoretical part, but the English
> and German corpora in the Moses factored model tutorial
> (http://www.statmt.org/moses/?n=Moses.FactoredTutorial) look beautifully
> factored, so my question is how were the original corpora processed? Was a
> specific tagger used and was there any manual/script postprocessing done?
>
> And since I am already bugging everyone, how is the language model pos.lm
> created? Is it extracted from a file, created manually or in another way?
>
> Thank you in advance for all the replies.
>
> Best regards,
>
> Sašo
>
> 2016-05-02 19:45 GMT+02:00 Marwa Refaie <basmal...@hotmail.com>:
>>
>> Corpus for translation model should be on 2 parallel files in the format
>> Word | pos | Lema .... For example , by a file for each language. You can
>> prepare files using word net , Stanford , or any tagger & stemmer  as can
>> deal with your language pairs. May be before enter the files to moses you
>> should adjust the text files by a python script (write it your self)
>>
>> For language model ... You must build it as follows
>> Verb noun noun
>> Noun Det adj
>> ....... Depending on the target language only ,, Then build it as usual
>> n-gram lm.
>>
>> Sent from my iPad
>>
>> > On May 2, 2016, at 10:11, Sašo Kuntaric <saso.kunta...@gmail.com> wrote:
>> >
>> > Hi all,
>> >
>> > I am having some issues producing the corpora in the correct format for
>> > Moses to execute factored training.
>> >
>> > I am looking at the factored tutorial on the Moses website and I am
>> > wondering, how to get such consistent corpora for two languages. What tools
>> > are being used and can they be trained for specific languages (Slovenian in
>> > my example). Are such tools available for download or is such data produced
>> > with custom scripts?
>> >
>> > --
>> > Best regards,
>> >
>> > Sašo
>> > _______________________________________________
>> > Moses-support mailing list
>> > Moses-support@mit.edu
>> > http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> --
> lp,
>
> Sašo
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Data for building a factored model

Reply via email to