hi zai, in the pre-made models we released with version 1. of Moses http://www.statmt.org/moses/RELEASE-1.0/models/fr-en/
The phrase 'vous êtes' appears to be aligned correctly. There are 330 translations of the phrase but the most probable translation is vous êtes ||| you are ||| 0.219726 0.00598177 0.516811 0.201352 2.718 ||| 0-0 1-0 1-1 ||| 4897 2082 1076 Are you sure you have tokenized your data? is the encoding of your input file in UTF8? On 20 April 2013 19:59, Zai Sarkar <zaisar...@ymail.com> wrote: > This is a basic question as I am relatively new to Moses. > Can someone tell me why the alignment of texts is not picking up many > (common) words and phrases from the input? The decoder shows many UNKs that > should not be. > > I am experimenting with a factored model using EMS and parallel corpora > from Europarl(fr-en) and UNdoc(es-en). The decoder results show a high > incidence of UNKs in both language experiments. I reverted to a model > without factors to see if factoring was an issue, but the incidence of UNKs > in the decoder results are very much the same. I checked the parallel input > corpus and the cleaned corpus for common terms like 'vous êtes' (French for > 'you are'). There are many instances of words and terms contained in the > parallel texts input that the decoder shows as UNK (e.g. 'êtes'). I checked > the parallel data sentences visually by sampling and the parallel corpus > seem reasonably good. I tried with different sizes (100,000, 500,000 and > 1.5 million parallel sentences). The decoder results are similar for for > both fr-en and es-en. Many unexpected UNKs. > > I ran the LM independently(without EMS) as below and saw a high incidence > of OOVs(as below): > /apps/moses/mosesInstalls/irstlm/bin/compile-lm --text yes > /apps/moses/mosesInstalls/en-es/undoc.2000.en-es.lm.es.gz > /apps/moses/mosesInstalls/en-es/undoc.2000.en-es.arpa.es > CHECK FOR : ...../en-es/undoc.2000.en-es.arpa.es > OOV code is 641175 > > My EMS script uses IRSTLM as below. > # irstlm > lm-training = "$moses-script-dir/generic/trainlm-irst.perl -cores $cores > -irst-dir $irstlm-dir -temp-dir $working-dir/lm" > settings = "" > > lm-binarizer = $irstlm-dir/compile-lm > order = 5 > # kenlm, also set type to 8 --- Zai added -- text yes > lm-binarizer = "$moses-bin-dir/build_binary -i" > type = 8 > > With Training settings as below : > ### symmetrization method for word alignments from giza output > alignment-symmetrization-method = grow-diag-final-and > ### lexicalized reordering: specify orientation type > lexicalized-reordering = msd-bidirectional-fe > Thanks for help! > > Zai > > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > > -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support