Re: [Moses-support] High incidence of UNKs in decoder results

Hieu Hoang Mon, 22 Apr 2013 03:54:29 -0700

hi zai,

in the pre-made models we released with version 1. of Moses
  http://www.statmt.org/moses/RELEASE-1.0/models/fr-en/


The phrase 'vous êtes' appears to be aligned correctly. There are 330
translations of the phrase but the most probable translation is
  vous êtes ||| you are ||| 0.219726 0.00598177 0.516811 0.201352 2.718 |||
0-0 1-0 1-1 ||| 4897 2082 1076


Are you sure you have tokenized your data? is the encoding of your input
file in UTF8?




On 20 April 2013 19:59, Zai Sarkar <zaisar...@ymail.com> wrote:

> This is a basic question as I am relatively new to Moses.
> Can someone tell me why the alignment of texts is not picking up many
> (common) words and phrases from the input? The decoder shows many UNKs that
> should not be.
>
> I am experimenting with a factored model using EMS and parallel corpora
> from Europarl(fr-en) and UNdoc(es-en). The decoder results show a high
> incidence of UNKs in both language experiments. I reverted to a model
> without factors to see if factoring was an issue, but the incidence of UNKs
> in the decoder results are very much the same. I checked the parallel input
> corpus and the cleaned corpus for common terms like 'vous êtes' (French for
> 'you are'). There are many instances of words and terms contained in the
> parallel texts input that the decoder shows as UNK (e.g. 'êtes'). I checked
> the parallel data sentences visually by sampling and the parallel corpus
> seem reasonably good. I tried with different sizes (100,000, 500,000 and
> 1.5 million parallel sentences). The decoder results are similar for  for
> both fr-en and es-en. Many unexpected UNKs.
>
> I ran the LM independently(without EMS) as below and saw a high incidence
> of OOVs(as below):
> /apps/moses/mosesInstalls/irstlm/bin/compile-lm --text yes
> /apps/moses/mosesInstalls/en-es/undoc.2000.en-es.lm.es.gz
> /apps/moses/mosesInstalls/en-es/undoc.2000.en-es.arpa.es
> CHECK FOR : ...../en-es/undoc.2000.en-es.arpa.es
> OOV code is 641175
>
> My EMS script uses IRSTLM as below.
> # irstlm
> lm-training = "$moses-script-dir/generic/trainlm-irst.perl -cores $cores
> -irst-dir $irstlm-dir -temp-dir $working-dir/lm"
> settings = ""
>
> lm-binarizer = $irstlm-dir/compile-lm
> order = 5
> # kenlm, also set type to 8  --- Zai added -- text yes
> lm-binarizer = "$moses-bin-dir/build_binary -i"
> type = 8
>
> With Training settings as below :
> ### symmetrization method for word alignments from giza output
> alignment-symmetrization-method = grow-diag-final-and
> ### lexicalized reordering: specify orientation type
> lexicalized-reordering = msd-bidirectional-fe
> Thanks for help!
>
> Zai
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] High incidence of UNKs in decoder results

Reply via email to