Hi Barry,

The domains.1 file contains correct line numbers, however, the file names
(news and other) are suspect.

My [CORPUS] has defined
[CORPUS:in]
clean-stem = $training-in-domain-corpus
[CORPUS:out]
clean-stem = $training-out-domain-corpus

and before it, there are

input-extension = fr
output-extension = en
$training-in-domain-corpus = /home/corpus/in-domain-fr-en/news.fr-en.tc.cl
$training-out-domain-corpus = /home/corpus/out-domain-fr-en/
other.fr-en.tc.cl

However, when running TRAINING_mml-score.1, first the $FILTER_DOMAIN loads
the filtering domain name and "out" is loaded (defined at
mml-filter-corpora = out), then checks the available domains from domains.1
file at the next while (<DOMAIN>) {...} loop, where news and other are
loaded. Since the domain names are not matched between available domains
and filtering domains and caused $DOMAIN_FILTERED{$line_number} is null all
the time. The result is subroutine check_sentence_filtered always returns
false and a sentence will be always in domain (score 99999).

After I change the short names "in" and "out" to "news" and "others", the
TRAINING_mml-filter-before-wa did not report any error.

Thanks again.

Jian





On Sun, Jan 26, 2014 at 12:37 PM, Barry Haddow
<bhad...@staffmail.ed.ac.uk>wrote:

>  Hi Jian
>
> The logic looks correct to me. If the domains file has been provided, we
> then need to check if the sentence is in-domain. If the domains file is not
> provided, then all sentences are considered out-of-domain.
>
> The fact that all scores are 99999 means that the MML filter is seeing all
> your sentences as in-domain. It could be that something went wrong during
> corpus preprocessing, or during the creation of the domains file
> (/home/mml/mml-test/experiment/model/domains.1). Do the lengths in the
> domains file match the lengths of your in and out corpora?
>
> cheers - Barry
>
>
> On 25/01/14 03:29, jian zhang wrote:
>
>  Hi Barry, I don't not understand line *if (defined($filter_domains) &&
> !&check_sentence_filtered($i))* at mml-score.perl, before computing the
> bilingual cross-entropy difference,
>  Should it not be *if (!defined($filter_domains) &&
> !&check_sentence_filtered($i)) *?
>
>  Regards,
>
>  Jian Zhang
>
>
>
>
> On Fri, Jan 24, 2014 at 10:27 PM, jian zhang <jianzhan...@gmail.com>wrote:
>
>> Hi Barry,
>>
>>  All the scores are 99999 in that file.
>>
>>  Thanks,
>>
>>
>>  Jian
>>
>>
>>  On Fri, Jan 24, 2014 at 3:51 PM, Barry Haddow <
>> bhad...@staffmail.ed.ac.uk> wrote:
>>
>>>  Hi Jian
>>>
>>> This is a bit suspect:
>>>
>>>
>>> 2014-01-24 14:17:26,276 Retaining at least 0 entries and ignoring 2075137
>>>
>>>  Are the scores in this file sensible (or are they all the same?)
>>>
>>> /home/mml/mml-test/experiment/training/corpus-mml-score.1
>>>
>>> cheers - Barry
>>>
>>>
>>> On 24/01/14 14:53, jian zhang wrote:
>>>
>>>>  Hi,
>>>>
>>>> I got error of IndexError: list index out of range at the
>>>> TRAINING_mml-filter-before-wa step.
>>>>
>>>> I had read the post at
>>>> https://www.mail-archive.com/moses-support@mit.edu/msg08767.html,
>>>> however I still can not figure out what is wrong.
>>>>
>>>> The full error is
>>>>
>>>> general:strategy = Score
>>>> general:source_language = fr
>>>> general:target_language = en
>>>> general:input_stem = /home/mml/mml-test/experiment/training/corpus.1
>>>> general:output_stem =
>>>> /home/mml/mml-test/experiment/training/corpus-mml.1
>>>> general:domain_file = /home/mml/mml-test/experiment/model/domains.1
>>>> general:domain_file_out =
>>>> /home/mml/mml-test/experiment/training/corpus-mml.1
>>>> score:score_file =
>>>> /home/mml/mml-test/experiment/training/corpus-mml-score.1
>>>> score:proportion = 0.9
>>>>
>>>> 2014-01-24 14:17:26,276 Retaining at least 0 entries and ignoring
>>>> 2075137
>>>> Traceback (most recent call last):
>>>>   File "/home/tools/mosesdecoder/scripts/ems/support/mml-filter.py",
>>>> line 156, in <module>
>>>>     main()
>>>>   File "/home/tools/mosesdecoder/scripts/ems/support/mml-filter.py",
>>>> line 111, in main
>>>>     strategy = strategy_class(config)
>>>>   File "/home/tools/mosesdecoder/scripts/ems/support/mml-filter.py",
>>>> line 72, in __init__
>>>>     [float(line[:-1]) for line in open(self.score_file)],
>>>> reverse=True)[ignore_count + count]
>>>> IndexError: list index out of range
>>>>
>>>> And my ems configuration file has:
>>>>
>>>> #################################################################
>>>> # PARALLEL CORPUS PREPARATION:
>>>> # create a tokenized, sentence-aligned corpus, ready for training
>>>>
>>>> [CORPUS]
>>>>
>>>> #in-domain parallel corpus
>>>> [CORPUS:in]
>>>> clean-stem = $training-in-domain-corpus
>>>>
>>>> [CORPUS:out]
>>>> #out-domain parallel corpus
>>>> clean-stem = $training-out-domain-corpus
>>>>
>>>>
>>>> #################################################################
>>>> # LANGUAGE MODEL TRAINING
>>>> [LM]
>>>> [LM:lm]
>>>> type = 8
>>>> lm = $language-model
>>>> #################################################################
>>>> # MODIFIED MOORE LEWIS FILTERING
>>>>
>>>> [MML]
>>>>
>>>> lm-training = $srilm-dir/ngram-count
>>>> lm-settings = "-interpolate -kndiscount -unk"
>>>> lm-binarizer = $moses-src-dir/bin/build_binary
>>>> lm-query = $moses-src-dir/bin/query
>>>> order = 5
>>>>
>>>> ### in-/out-of-domain source/target corpora to train the 4 language
>>>> model
>>>> #
>>>> # in-domain parallel corpus
>>>> indomain-stem = [CORPUS:in:clean-split-stem]
>>>>
>>>> # out-of-domain parallel corpus
>>>> outdomain-stem = [CORPUS:out:clean-split-stem]
>>>>
>>>> # settings: number of lines sampled from the corpora to train each
>>>> language model on
>>>> settings = "--line-count 100000"
>>>>
>>>> #################################################################
>>>> # TRANSLATION MODEL TRAINING
>>>> [TRAINING]
>>>> script = $moses-script-dir/training/train-model.perl
>>>> training-options = "-mgiza -mgiza-cpus 12 -sort-buffer-size 16G
>>>> -sort-compress gzip -sort-parallel 12 -cores 12"
>>>> parallel = yes
>>>> alignment-symmetrization-method = grow-diag-final-and
>>>> lexicalized-reordering = msd-bidirectional-fe
>>>> score-settings = "--GoodTuring"
>>>> include-word-alignment-in-rules = yes
>>>>
>>>> #space separated all out-of domain corpora to be filtered
>>>> mml-filter-corpora = out
>>>> mml-before-wa = "-proportion 0.9"
>>>>
>>>> #####################################################
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> Jian Zhang
>>>>
>>>>
>>>>  _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>>   --
>>> Jian Zhang
>>> Centre for Next Generation Localisation 
>>> (CNGL)<http://www.cngl.ie/index.html>
>>> Dublin City University <http://www.dcu.ie/>
>>>
>>>
>>>
>>>
>
>
>  --
> Jian Zhang
> Centre for Next Generation Localisation (CNGL)<http://www.cngl.ie/index.html>
> Dublin City University <http://www.dcu.ie/>
>
>
>


-- 
Jian Zhang
Centre for Next Generation Localisation (CNGL)<http://www.cngl.ie/index.html>
Dublin City University <http://www.dcu.ie/>
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to