Hi Hieu, let me try to explain. The mxpost program tags the text in such a way that it divides factors with underlines, for example We_PRP collect_VBP information_NN ,_, with_IN a_DT view_NN to_TO improve_VBG our_PRP$ website_NN and_CC provide_VBG users_NNS with_IN better_JJR experience_NN ._. Moses however only takes text, where factors are divided with the pipe symbol, for example We|PRP collect|VBP information|NN,|, with|IN a|DT view|NN to|TO improve|VBG our|PRP$ website|NN and|CC provide|VBG users|NNS with|IN better|JJR experience|NN .|.
My question is, can a parameter be set in mxpost that it produces the second output? I realize it's only a simple substitution, but one has to be careful or errors like stated above occur and it is an extra step. The second part of the question is can mxpost tag text with additional factors, like lemmas, so instead of *surface form|POS* my text would be in the format surface *form|POS|lemma*? And two more general question. After doing the factored training should I tune the model or is that not necessary in factored training? In the factored training tutorial there is the command train-model.perl --root-dir pos --corpus factored-corpus/proj-syndicate.1000 --f de --e en --lm 0:3:factored-corpus/surface.lm --lm 2:3:factored-corpus/pos.lm --translation-factors 0-0,2 --external-bin-dir .../tools. What is the first parameter in listing the lm, namely the 2 in --lm 2:3:factored-corpus/pos.lm? 3 stands for the 3-gram model, but I am not sure about the first parameter. Sorry for the long e-mail. Best regards, Sašo 2016-06-13 12:12 GMT+02:00 Hieu Hoang <hieuho...@gmail.com>: > > > Hieu Hoang > http://www.hoang.co.uk/hieu > > On 13 June 2016 at 07:51, Sašo Kuntaric <saso.kunta...@gmail.com> wrote: > >> Thanks for the tip, however in my case the problem was that after tagging >> the files with mxpost and post-processing I had some standalone |PRP tags >> in the source file. >> > that suggest the corpus file has not been cleaned. eg. there may be > multiple white spaces ' ' > > >> Once I removed those, training resumed. >> >> Which leads me to another question. Since mxpost was used for the Moses >> tutorial, I was wondering how did you create the input files for Moses >> after tagging? Was there any post-processing done or can mxpost use the >> pipes (|) instead of underlines? And one more thing, how can lemmas be >> added, was a custom tagger project made or is there a parameter which tells >> mxpost to do it? >> > not sure what you mean > >> >> Best regards, >> >> Sašo >> >> 2016-06-12 21:08 GMT+02:00 Hieu Hoang <hieuho...@gmail.com>: >> >>> judging by the source code in mgiza's getSentence.cpp line 366, >>> >>> cerr << "ERROR: Forbidden zero sentence length " << >>> sent.sentenceNo << endl; >>> the 0 in your output is the line number. >>> >>> It may be that your corpora was produced on windows and has a BOM at the >>> beginning >>> >>> >>> On 12/06/2016 10:40, Sašo Kuntaric wrote: >>> >>>> Forbidden zero sentence >>>> >>> >>> >> >> >> -- >> lp, >> >> Sašo >> > > -- lp, Sašo
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support