Hi,
I am experimenting with factored training and I've got a question about
decoding performance.
I experience hangups - probably not real hangups, just too long computation
- while decoding some sentences.
My models are trained on text annotated in the following way:
SURF|LEMM|POS|OTHER
Training is done with the following parameters:
--lm 0:3:news-commentary-v7.cs-en.blm.en:8 \
--lm 2:3:pos.blm.en:8 \
--translation-factors 1-1+3-2+0-0,2 \
--generation-factors 1-2+1,2-0 \
--decoding-steps t0,g0,t1,g1:t2
I suspect some things that can cause the issue:
- the POS tags are in the form of a number: 1-10, when ambiguous, the
possibilities are separated by comma, so e.g. POS=
1
3,5
7,0,2,6,1
POS tags are more abiguous at the target language side (en), where this is
about 1/2 of cases.
I see this can cause the sparsity problem but believe this is not a
fundamental issue.
- sometimes I do not get the values for some factors, so I introduce an
universal placeholder '_' which I put in place of unknown factors
I imagine this could cause some "sink" problem, in the sense of too common
token (this is not really uncommon in my data).
I post a sample sentence:
the|the|3|Node government|government|1|Gsub in|in,inly|7,2,6,1|Node
Washington|Washington|1|Gsub has|have|5|Node published|publish|5|Pred
a|a|3|Node prohibition|prohibition|1|Obje to|to|6|Advi that|that|8|Node
effect|effect|5,1|Node _-_|_|_|Node thereby|thereby|6|Node
definitively|definitively|6|Node scrapping|scrap|5|Node
earlier|earlier,early|2,6|Node plans|plan|1|Gsub
Could you please point me to the possible problems of this setup?
Thanks in advance and regards,
Michal Krajňanský
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support