Hi Barry, The domains.1 file contains correct line numbers, however, the file names (news and other) are suspect.
My [CORPUS] has defined [CORPUS:in] clean-stem = $training-in-domain-corpus [CORPUS:out] clean-stem = $training-out-domain-corpus and before it, there are input-extension = fr output-extension = en $training-in-domain-corpus = /home/corpus/in-domain-fr-en/news.fr-en.tc.cl $training-out-domain-corpus = /home/corpus/out-domain-fr-en/ other.fr-en.tc.cl However, when running TRAINING_mml-score.1, first the $FILTER_DOMAIN loads the filtering domain name and "out" is loaded (defined at mml-filter-corpora = out), then checks the available domains from domains.1 file at the next while (<DOMAIN>) {...} loop, where news and other are loaded. Since the domain names are not matched between available domains and filtering domains and caused $DOMAIN_FILTERED{$line_number} is null all the time. The result is subroutine check_sentence_filtered always returns false and a sentence will be always in domain (score 99999). After I change the short names "in" and "out" to "news" and "others", the TRAINING_mml-filter-before-wa did not report any error. Thanks again. Jian On Sun, Jan 26, 2014 at 12:37 PM, Barry Haddow <bhad...@staffmail.ed.ac.uk>wrote: > Hi Jian > > The logic looks correct to me. If the domains file has been provided, we > then need to check if the sentence is in-domain. If the domains file is not > provided, then all sentences are considered out-of-domain. > > The fact that all scores are 99999 means that the MML filter is seeing all > your sentences as in-domain. It could be that something went wrong during > corpus preprocessing, or during the creation of the domains file > (/home/mml/mml-test/experiment/model/domains.1). Do the lengths in the > domains file match the lengths of your in and out corpora? > > cheers - Barry > > > On 25/01/14 03:29, jian zhang wrote: > > Hi Barry, I don't not understand line *if (defined($filter_domains) && > !&check_sentence_filtered($i))* at mml-score.perl, before computing the > bilingual cross-entropy difference, > Should it not be *if (!defined($filter_domains) && > !&check_sentence_filtered($i)) *? > > Regards, > > Jian Zhang > > > > > On Fri, Jan 24, 2014 at 10:27 PM, jian zhang <jianzhan...@gmail.com>wrote: > >> Hi Barry, >> >> All the scores are 99999 in that file. >> >> Thanks, >> >> >> Jian >> >> >> On Fri, Jan 24, 2014 at 3:51 PM, Barry Haddow < >> bhad...@staffmail.ed.ac.uk> wrote: >> >>> Hi Jian >>> >>> This is a bit suspect: >>> >>> >>> 2014-01-24 14:17:26,276 Retaining at least 0 entries and ignoring 2075137 >>> >>> Are the scores in this file sensible (or are they all the same?) >>> >>> /home/mml/mml-test/experiment/training/corpus-mml-score.1 >>> >>> cheers - Barry >>> >>> >>> On 24/01/14 14:53, jian zhang wrote: >>> >>>> Hi, >>>> >>>> I got error of IndexError: list index out of range at the >>>> TRAINING_mml-filter-before-wa step. >>>> >>>> I had read the post at >>>> https://www.mail-archive.com/moses-support@mit.edu/msg08767.html, >>>> however I still can not figure out what is wrong. >>>> >>>> The full error is >>>> >>>> general:strategy = Score >>>> general:source_language = fr >>>> general:target_language = en >>>> general:input_stem = /home/mml/mml-test/experiment/training/corpus.1 >>>> general:output_stem = >>>> /home/mml/mml-test/experiment/training/corpus-mml.1 >>>> general:domain_file = /home/mml/mml-test/experiment/model/domains.1 >>>> general:domain_file_out = >>>> /home/mml/mml-test/experiment/training/corpus-mml.1 >>>> score:score_file = >>>> /home/mml/mml-test/experiment/training/corpus-mml-score.1 >>>> score:proportion = 0.9 >>>> >>>> 2014-01-24 14:17:26,276 Retaining at least 0 entries and ignoring >>>> 2075137 >>>> Traceback (most recent call last): >>>> File "/home/tools/mosesdecoder/scripts/ems/support/mml-filter.py", >>>> line 156, in <module> >>>> main() >>>> File "/home/tools/mosesdecoder/scripts/ems/support/mml-filter.py", >>>> line 111, in main >>>> strategy = strategy_class(config) >>>> File "/home/tools/mosesdecoder/scripts/ems/support/mml-filter.py", >>>> line 72, in __init__ >>>> [float(line[:-1]) for line in open(self.score_file)], >>>> reverse=True)[ignore_count + count] >>>> IndexError: list index out of range >>>> >>>> And my ems configuration file has: >>>> >>>> ################################################################# >>>> # PARALLEL CORPUS PREPARATION: >>>> # create a tokenized, sentence-aligned corpus, ready for training >>>> >>>> [CORPUS] >>>> >>>> #in-domain parallel corpus >>>> [CORPUS:in] >>>> clean-stem = $training-in-domain-corpus >>>> >>>> [CORPUS:out] >>>> #out-domain parallel corpus >>>> clean-stem = $training-out-domain-corpus >>>> >>>> >>>> ################################################################# >>>> # LANGUAGE MODEL TRAINING >>>> [LM] >>>> [LM:lm] >>>> type = 8 >>>> lm = $language-model >>>> ################################################################# >>>> # MODIFIED MOORE LEWIS FILTERING >>>> >>>> [MML] >>>> >>>> lm-training = $srilm-dir/ngram-count >>>> lm-settings = "-interpolate -kndiscount -unk" >>>> lm-binarizer = $moses-src-dir/bin/build_binary >>>> lm-query = $moses-src-dir/bin/query >>>> order = 5 >>>> >>>> ### in-/out-of-domain source/target corpora to train the 4 language >>>> model >>>> # >>>> # in-domain parallel corpus >>>> indomain-stem = [CORPUS:in:clean-split-stem] >>>> >>>> # out-of-domain parallel corpus >>>> outdomain-stem = [CORPUS:out:clean-split-stem] >>>> >>>> # settings: number of lines sampled from the corpora to train each >>>> language model on >>>> settings = "--line-count 100000" >>>> >>>> ################################################################# >>>> # TRANSLATION MODEL TRAINING >>>> [TRAINING] >>>> script = $moses-script-dir/training/train-model.perl >>>> training-options = "-mgiza -mgiza-cpus 12 -sort-buffer-size 16G >>>> -sort-compress gzip -sort-parallel 12 -cores 12" >>>> parallel = yes >>>> alignment-symmetrization-method = grow-diag-final-and >>>> lexicalized-reordering = msd-bidirectional-fe >>>> score-settings = "--GoodTuring" >>>> include-word-alignment-in-rules = yes >>>> >>>> #space separated all out-of domain corpora to be filtered >>>> mml-filter-corpora = out >>>> mml-before-wa = "-proportion 0.9" >>>> >>>> ##################################################### >>>> >>>> Thanks. >>>> >>>> >>>> Jian Zhang >>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> Moses-support@mit.edu >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>> >>> >>> -- >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> >>> -- >>> Jian Zhang >>> Centre for Next Generation Localisation >>> (CNGL)<http://www.cngl.ie/index.html> >>> Dublin City University <http://www.dcu.ie/> >>> >>> >>> >>> > > > -- > Jian Zhang > Centre for Next Generation Localisation (CNGL)<http://www.cngl.ie/index.html> > Dublin City University <http://www.dcu.ie/> > > > -- Jian Zhang Centre for Next Generation Localisation (CNGL)<http://www.cngl.ie/index.html> Dublin City University <http://www.dcu.ie/>
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support