Hi, I am experimenting with factored training and I've got a question about decoding performance. I experience hangups - probably not real hangups, just too long computation - while decoding some sentences.
My models are trained on text annotated in the following way: SURF|LEMM|POS|OTHER Training is done with the following parameters: --lm 0:3:news-commentary-v7.cs-en.blm.en:8 \ --lm 2:3:pos.blm.en:8 \ --translation-factors 1-1+3-2+0-0,2 \ --generation-factors 1-2+1,2-0 \ --decoding-steps t0,g0,t1,g1:t2 I suspect some things that can cause the issue: - the POS tags are in the form of a number: 1-10, when ambiguous, the possibilities are separated by comma, so e.g. POS= 1 3,5 7,0,2,6,1 POS tags are more abiguous at the target language side (en), where this is about 1/2 of cases. I see this can cause the sparsity problem but believe this is not a fundamental issue. - sometimes I do not get the values for some factors, so I introduce an universal placeholder '_' which I put in place of unknown factors I imagine this could cause some "sink" problem, in the sense of too common token (this is not really uncommon in my data). I post a sample sentence: the|the|3|Node government|government|1|Gsub in|in,inly|7,2,6,1|Node Washington|Washington|1|Gsub has|have|5|Node published|publish|5|Pred a|a|3|Node prohibition|prohibition|1|Obje to|to|6|Advi that|that|8|Node effect|effect|5,1|Node _-_|_|_|Node thereby|thereby|6|Node definitively|definitively|6|Node scrapping|scrap|5|Node earlier|earlier,early|2,6|Node plans|plan|1|Gsub Could you please point me to the possible problems of this setup? Thanks in advance and regards, Michal Krajňanský
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support