Hi Hieu,

let me try to explain. The mxpost program tags the text in such a way that
it divides factors with underlines, for example We_PRP collect_VBP
information_NN ,_, with_IN a_DT view_NN to_TO improve_VBG our_PRP$
website_NN and_CC provide_VBG users_NNS with_IN better_JJR experience_NN
._. Moses however only takes text, where factors are divided with the pipe
symbol, for example We|PRP collect|VBP information|NN,|, with|IN a|DT
view|NN to|TO improve|VBG our|PRP$ website|NN and|CC provide|VBG users|NNS
with|IN better|JJR experience|NN .|.

My question is, can a parameter be set in mxpost that it produces
the second output? I realize it's only a simple substitution, but one has
to be careful or errors like stated above occur and it is an extra step.

The second part of the question is can mxpost tag text with additional
factors, like lemmas, so instead of *surface form|POS* my text would be in
the format surface *form|POS|lemma*?

And two more general question. After doing the factored training should I
tune the model or is that not necessary in factored training?

In the factored training tutorial there is the command train-model.perl
--root-dir pos --corpus factored-corpus/proj-syndicate.1000 --f de --e en
--lm 0:3:factored-corpus/surface.lm --lm 2:3:factored-corpus/pos.lm
--translation-factors 0-0,2 --external-bin-dir .../tools. What is the first
parameter in listing the lm, namely the 2 in --lm
2:3:factored-corpus/pos.lm? 3 stands for the 3-gram model, but I am not
sure about the first parameter.

Sorry for the long e-mail.

Best regards,

Sašo

2016-06-13 12:12 GMT+02:00 Hieu Hoang <hieuho...@gmail.com>:

>
>
> Hieu Hoang
> http://www.hoang.co.uk/hieu
>
> On 13 June 2016 at 07:51, Sašo Kuntaric <saso.kunta...@gmail.com> wrote:
>
>> Thanks for the tip, however in my case the problem was that after tagging
>> the files with mxpost and post-processing I had some standalone |PRP tags
>> in the source file.
>>
> that suggest the corpus file has not been cleaned. eg. there may be
> multiple white spaces '   '
>
>
>> Once I removed those, training resumed.
>>
>> Which leads me to another question. Since mxpost was used for the Moses
>> tutorial, I was wondering how did you create the input files for Moses
>> after tagging? Was there any post-processing done or can mxpost use the
>> pipes (|) instead of underlines? And one more thing, how can lemmas be
>> added, was a custom tagger project made or is there a parameter which tells
>> mxpost to do it?
>>
> not sure what you mean
>
>>
>> Best regards,
>>
>> Sašo
>>
>> 2016-06-12 21:08 GMT+02:00 Hieu Hoang <hieuho...@gmail.com>:
>>
>>> judging by the source code in mgiza's getSentence.cpp line 366,
>>>
>>>        cerr << "ERROR: Forbidden zero sentence length " <<
>>> sent.sentenceNo << endl;
>>> the 0 in your output is the line number.
>>>
>>> It may be that your corpora was produced on windows and has a BOM at the
>>> beginning
>>>
>>>
>>> On 12/06/2016 10:40, Sašo Kuntaric wrote:
>>>
>>>> Forbidden zero sentence
>>>>
>>>
>>>
>>
>>
>> --
>> lp,
>>
>> Sašo
>>
>
>


-- 
lp,

Sašo
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to