Hi,

I am experimenting with factored training and I've got a question about
decoding performance.
I experience hangups - probably not real hangups, just too long computation
- while decoding some sentences.

My models are trained on text annotated in the following way:

SURF|LEMM|POS|OTHER

Training is done with the following parameters:

    --lm 0:3:news-commentary-v7.cs-en.blm.en:8 \
    --lm 2:3:pos.blm.en:8 \
    --translation-factors 1-1+3-2+0-0,2 \
    --generation-factors 1-2+1,2-0 \
    --decoding-steps t0,g0,t1,g1:t2

I suspect some things that can cause the issue:


- the POS tags are in the form of a number: 1-10, when ambiguous, the
possibilities are separated by comma, so e.g. POS=
1
3,5
7,0,2,6,1

POS tags are more abiguous at the target language side (en), where this is
about 1/2 of cases.
I see this can cause the sparsity problem but believe this is not a
fundamental issue.


- sometimes I do not get the values for some factors, so I introduce an
universal placeholder '_' which I put in place of unknown factors

I imagine this could cause some "sink" problem, in the sense of too common
token (this is not really uncommon in my data).

I post a sample sentence:
the|the|3|Node government|government|1|Gsub in|in,inly|7,2,6,1|Node
Washington|Washington|1|Gsub has|have|5|Node published|publish|5|Pred
a|a|3|Node prohibition|prohibition|1|Obje to|to|6|Advi that|that|8|Node
effect|effect|5,1|Node _-_|_|_|Node thereby|thereby|6|Node
definitively|definitively|6|Node scrapping|scrap|5|Node
earlier|earlier,early|2,6|Node plans|plan|1|Gsub


Could you please point me to the possible problems of this setup?

Thanks in advance and regards,

Michal Krajňanský
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to