[Moses-support] -print-alignment-info conflicted with the -mbr option

2019-05-01 Thread Ergun Bicici
-print-alignment-info conflicted with the "-mbr" option and did not add the
alignment information. Worked as usual without "-mbr".

Additionally, when transliteration is used (e.g. for Russian), "| | |"
before the alignment information is lost still after
moses/mosesdecoder/scripts/Transliteration/post-decoding-transliteration.pl
--input-file test.cleaned.* --output-file test.transliterated.*

-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] About Bilingual LM in Moses

2019-04-18 Thread Ergun Bicici
Bilingual LM model on the German-English baseline dataset ( wget
http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz) and did not
improve the scores. I obtained the same score of 0.2266.

Thanks for your help.

Ergun

On Mon, Apr 15, 2019 at 5:52 PM Ergun Bicici  wrote:

>
> Hi Rico,
>
> Thanks for the links. Accordingly, I tried decreasing the learning rate to
> 0.25 and starting seeing numbers instead of nan in the log-likelihood.
> vocabulary files are not needed using train_nplm.py.
>
> I restarted tuning and 'nan' scores for bilingual lm disappeared as well
> in the N-best lists. I'll post the new scores on the German-English
> baseline.
>
> Ergun
>
> On Mon, Apr 15, 2019 at 3:43 PM Rico Sennrich 
> wrote:
>
>> Hello Ergun,
>>
>> we've had the 'nan' issue reported before ( see
>>
>> https://moses-support.mit.narkive.com/hs8LwsnT/blingual-neural-lm-log-likelihood-nan
>> https://moses-support.mit.narkive.com/fklzlBiW/bilingual-lm-nan-nan-nan
>> ).
>>
>> You can follow Nick's recommendation of lowering the learning rate, or
>> try to enable gradient clipping (which is commented out in the code).
>>
>> I'm afraid nlpm is no longer heavily used, so it's unlikely that somebody
>> has fresh experience.
>>
>> best wishes,
>> Rico
>>
>> On 15/04/2019 12:44, Ergun Bicici wrote:
>>
>>
>> I found that training also produced 'nan' scores:
>> Training NCE log-likelihood: nan.
>>
>> I used EMS training:
>> [LM:comb]
>> nplm-dir = "Programs/nplm/"
>> order = 5
>> source-window = 4
>> bilingual-lm = yes
>> bilingual-lm-settings = "--prune-source-vocab 10 --prune-target-vocab
>> 10"
>>
>> I am re-running train_nplm.py.
>>
>> Ergun
>>
>> On Mon, Apr 15, 2019 at 2:26 PM Ergun Bicici  wrote:
>>
>>>
>>> Dear moses-support,
>>>
>>> I tried the nplm model on the German-English baseline dataset ( wget
>>> http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz) and it
>>> improved the scores from 0.2266 to 0.2317 BLEU.
>>>
>>> I tried the bilingual LM:
>>>
>>> http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel#ntoc37
>>> However:
>>> - vocab files were not written in the end and I used extract_training.py
>>> to obtain them.
>>> - I still obtained 'nan' scores from the bilingual lm model.
>>> Error: "Not a label, not a score 'nan'. Failed to parse the scores
>>> string:
>>> 0 ||| ... айта ... болатын .  ||| LexicalReordering0= -11.3723 -15.4848
>>> -26.5152 -17.8301 -6.95664 -16.8553 -29.4425 -22.5538 OpSequenceModel0=
>>> -403.825 99 22 45 5 Distortion0= -146 LM0= -685.828 BLMcomb= nan
>>> WordPenalty0= -76 PhrasePenalty0= 53 TranslationModel0= -242.874 -179.189
>>> -291.623 -342.085 ||| nan
>>>
>>> KENLM name=LM0 factor=0 path=en-kk/lm.corpus.tok.kk.6.blm.bin order=6
>>> BilingualNPLM name=BLMcomb order=5 source_window=4
>>> path=wmt19_en-kk/lm/comb.blm.2/train.10
>>> source_vocab=wmt19_en-kk/lm/comb.blm.2/vocab.source
>>> target_vocab=wmt19_en-kk/lm/comb.blm.2/vocab.target
>>>
>>> Therefore, this may be due to some bug in moses C++ code and not the
>>> input data / configuration.
>>>
>>> The documentation appears also not in sync about "average the 
>>> word embedding as per the instructions here
>>> <http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel#anchorNULL>."
>>> part since averageNullEmbedding.py asks for -i, -o, and -t.
>>>
>>> I found some related note in a paper by Barry Haddow at WMT'15 saying
>>> that the model is not used in the final submission due to insignificant
>>> differences.
>>>
>>> Do you have any recent results on the bilingual LM model?
>>>
>>> --
>>>
>>> Regards,
>>> Ergun
>>>
>>>
>>>
>>
>> --
>>
>> Regards,
>> Ergun
>>
>>
>>
>> ___
>> Moses-support mailing 
>> listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
> --
>
> Regards,
> Ergun
>
>
>

-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] About Bilingual LM in Moses

2019-04-15 Thread Ergun Bicici
Hi Rico,

Thanks for the links. Accordingly, I tried decreasing the learning rate to
0.25 and starting seeing numbers instead of nan in the log-likelihood.
vocabulary files are not needed using train_nplm.py.

I restarted tuning and 'nan' scores for bilingual lm disappeared as well in
the N-best lists. I'll post the new scores on the German-English baseline.

Ergun

On Mon, Apr 15, 2019 at 3:43 PM Rico Sennrich  wrote:

> Hello Ergun,
>
> we've had the 'nan' issue reported before ( see
>
> https://moses-support.mit.narkive.com/hs8LwsnT/blingual-neural-lm-log-likelihood-nan
> https://moses-support.mit.narkive.com/fklzlBiW/bilingual-lm-nan-nan-nan ).
>
> You can follow Nick's recommendation of lowering the learning rate, or try
> to enable gradient clipping (which is commented out in the code).
>
> I'm afraid nlpm is no longer heavily used, so it's unlikely that somebody
> has fresh experience.
>
> best wishes,
> Rico
>
> On 15/04/2019 12:44, Ergun Bicici wrote:
>
>
> I found that training also produced 'nan' scores:
> Training NCE log-likelihood: nan.
>
> I used EMS training:
> [LM:comb]
> nplm-dir = "Programs/nplm/"
> order = 5
> source-window = 4
> bilingual-lm = yes
> bilingual-lm-settings = "--prune-source-vocab 10 --prune-target-vocab
> 10"
>
> I am re-running train_nplm.py.
>
> Ergun
>
> On Mon, Apr 15, 2019 at 2:26 PM Ergun Bicici  wrote:
>
>>
>> Dear moses-support,
>>
>> I tried the nplm model on the German-English baseline dataset ( wget
>> http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz) and it improved
>> the scores from 0.2266 to 0.2317 BLEU.
>>
>> I tried the bilingual LM:
>>
>> http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel#ntoc37
>> However:
>> - vocab files were not written in the end and I used extract_training.py
>> to obtain them.
>> - I still obtained 'nan' scores from the bilingual lm model.
>> Error: "Not a label, not a score 'nan'. Failed to parse the scores string:
>> 0 ||| ... айта ... болатын .  ||| LexicalReordering0= -11.3723 -15.4848
>> -26.5152 -17.8301 -6.95664 -16.8553 -29.4425 -22.5538 OpSequenceModel0=
>> -403.825 99 22 45 5 Distortion0= -146 LM0= -685.828 BLMcomb= nan
>> WordPenalty0= -76 PhrasePenalty0= 53 TranslationModel0= -242.874 -179.189
>> -291.623 -342.085 ||| nan
>>
>> KENLM name=LM0 factor=0 path=en-kk/lm.corpus.tok.kk.6.blm.bin order=6
>> BilingualNPLM name=BLMcomb order=5 source_window=4
>> path=wmt19_en-kk/lm/comb.blm.2/train.10
>> source_vocab=wmt19_en-kk/lm/comb.blm.2/vocab.source
>> target_vocab=wmt19_en-kk/lm/comb.blm.2/vocab.target
>>
>> Therefore, this may be due to some bug in moses C++ code and not the
>> input data / configuration.
>>
>> The documentation appears also not in sync about "average the 
>> word embedding as per the instructions here
>> <http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel#anchorNULL>."
>> part since averageNullEmbedding.py asks for -i, -o, and -t.
>>
>> I found some related note in a paper by Barry Haddow at WMT'15 saying
>> that the model is not used in the final submission due to insignificant
>> differences.
>>
>> Do you have any recent results on the bilingual LM model?
>>
>> --
>>
>> Regards,
>> Ergun
>>
>>
>>
>
> --
>
> Regards,
> Ergun
>
>
>
> ___
> Moses-support mailing 
> listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] About Bilingual LM in Moses

2019-04-15 Thread Ergun Bicici
I found that training also produced 'nan' scores:
Training NCE log-likelihood: nan.

I used EMS training:
[LM:comb]
nplm-dir = "Programs/nplm/"
order = 5
source-window = 4
bilingual-lm = yes
bilingual-lm-settings = "--prune-source-vocab 10 --prune-target-vocab
10"

I am re-running train_nplm.py.

Ergun

On Mon, Apr 15, 2019 at 2:26 PM Ergun Bicici  wrote:

>
> Dear moses-support,
>
> I tried the nplm model on the German-English baseline dataset ( wget
> http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz) and it improved
> the scores from 0.2266 to 0.2317 BLEU.
>
> I tried the bilingual LM:
>
> http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel#ntoc37
> However:
> - vocab files were not written in the end and I used extract_training.py
> to obtain them.
> - I still obtained 'nan' scores from the bilingual lm model.
> Error: "Not a label, not a score 'nan'. Failed to parse the scores string:
> 0 ||| ... айта ... болатын .  ||| LexicalReordering0= -11.3723 -15.4848
> -26.5152 -17.8301 -6.95664 -16.8553 -29.4425 -22.5538 OpSequenceModel0=
> -403.825 99 22 45 5 Distortion0= -146 LM0= -685.828 BLMcomb= nan
> WordPenalty0= -76 PhrasePenalty0= 53 TranslationModel0= -242.874 -179.189
> -291.623 -342.085 ||| nan
>
> KENLM name=LM0 factor=0 path=en-kk/lm.corpus.tok.kk.6.blm.bin order=6
> BilingualNPLM name=BLMcomb order=5 source_window=4
> path=wmt19_en-kk/lm/comb.blm.2/train.10
> source_vocab=wmt19_en-kk/lm/comb.blm.2/vocab.source
> target_vocab=wmt19_en-kk/lm/comb.blm.2/vocab.target
>
> Therefore, this may be due to some bug in moses C++ code and not the input
> data / configuration.
>
> The documentation appears also not in sync about "average the  word
> embedding as per the instructions here
> <http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel#anchorNULL>."
> part since averageNullEmbedding.py asks for -i, -o, and -t.
>
> I found some related note in a paper by Barry Haddow at WMT'15 saying that
> the model is not used in the final submission due to insignificant
> differences.
>
> Do you have any recent results on the bilingual LM model?
>
> --
>
> Regards,
> Ergun
>
>
>

-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] About Bilingual LM in Moses

2019-04-15 Thread Ergun Bicici
Dear moses-support,

I tried the nplm model on the German-English baseline dataset ( wget
http://www.statmt.org/wmt13/training-parallel-nc-v8.tgz) and it improved
the scores from 0.2266 to 0.2317 BLEU.

I tried the bilingual LM:
http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel#ntoc37
However:
- vocab files were not written in the end and I used extract_training.py to
obtain them.
- I still obtained 'nan' scores from the bilingual lm model.
Error: "Not a label, not a score 'nan'. Failed to parse the scores string:
0 ||| ... айта ... болатын .  ||| LexicalReordering0= -11.3723 -15.4848
-26.5152 -17.8301 -6.95664 -16.8553 -29.4425 -22.5538 OpSequenceModel0=
-403.825 99 22 45 5 Distortion0= -146 LM0= -685.828 BLMcomb= nan
WordPenalty0= -76 PhrasePenalty0= 53 TranslationModel0= -242.874 -179.189
-291.623 -342.085 ||| nan

KENLM name=LM0 factor=0 path=en-kk/lm.corpus.tok.kk.6.blm.bin order=6
BilingualNPLM name=BLMcomb order=5 source_window=4
path=wmt19_en-kk/lm/comb.blm.2/train.10
source_vocab=wmt19_en-kk/lm/comb.blm.2/vocab.source
target_vocab=wmt19_en-kk/lm/comb.blm.2/vocab.target

Therefore, this may be due to some bug in moses C++ code and not the input
data / configuration.

The documentation appears also not in sync about "average the  word
embedding as per the instructions here
."
part since averageNullEmbedding.py asks for -i, -o, and -t.

I found some related note in a paper by Barry Haddow at WMT'15 saying that
the model is not used in the final submission due to insignificant
differences.

Do you have any recent results on the bilingual LM model?

-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] CfP: Shared Task: Parallel Corpus Filtering for Low-Resource Conditions

2019-02-16 Thread Ergun Bicici
The audience for this question might be search engines and not the WMT'19
task. But it is still relevant if/when UN Corpus is ranked lower in
internet search for parallel corpora to deliberately hide the searchers'
intended corpora and there may be places and search engines where internet
access is diverted. For those cases, internet tunneling might work. Some
legal clarification can also be added to specify that WMT'19 is not funded
by or working for search engines. Yet, Yandex did provide some data before
and thanked for that in WMT'18 but acknowledgments for this year appear not
finished yet. The license for the data provided also does matter as this
can protect against maluse cases; make the data useless for some audience.

Ergun

On Fri, Feb 15, 2019 at 11:40 PM Philipp Koehn  wrote:

> Hi,
>
> the identity of the languages that we are tackling here do matter so much
> as solving this problem for low resource languages. We did not prepare
> this for Arabic, so this would be possible at the moment.
>
> By the way, there is a massive parallel corpus for Arabic here:
> https://cms.unov.org/UNCorpus/
>
> -phi
>
> On Fri, Feb 15, 2019 at 1:43 PM Marwa Refaie 
> wrote:
>
>> Dear All
>>
>> I can’t find enough resources even for English-Arabic ... can this call
>> include Arabic with the mentioned languages ??
>>
>> Thanks in Advance
>>
>> Marwa N. Refaie
>>
>> On Feb 15, 2019, at 18:44, Paco Guzman  wrote:
>>
>> [Apologies for cross-posting]
>>
>> CALL FOR PARTICIPATION
>> *Shared Task: Parallel Corpus Filtering for Low-Resource Conditions*
>> at the Fourth Conference on Machine Translation (WMT19)
>> http://statmt.org/wmt19/parallel-corpus-filtering.html
>>
>> This new shared task tackles the problem of cleaning noisy parallel
>> corpora. Following the WMT18 shared task on parallel corpus filtering
>> , we now
>> pose the problem under more challenging low-resource conditions. Instead of
>> German-English, this year there are two low-resource language pairs:
>> Nepali-English and Sinhala-English.
>> Otherwise, the shared task follows the same set-up: given a noisy
>> parallel corpus (crawled from the web), participants develop methods to
>> filter it to a smaller size of high quality sentence pairs.
>>
>> *DETAILS*
>> We provide a very noisy 35.5 million-word (English token count)
>> Nepali-English corpus and a 59.6 million-word Sinhala-English corpus
>> crawled from the web as part of the Paracrawl 
>> project. We ask participants to provide scores for each sentence in each of
>> the noisy parallel sets. The scores will be used to subsample sentence
>> pairs that amount to 5 million English words. The quality of the resulting
>> subsets is determined by the quality of a statistical machine translation
>> (Moses, phrase-based) and neural machine translation system (FAIRseq)
>> trained on this data. The quality of the machine translation system is
>> measured by BLEU score (sacrebleu) on a held-out test set of Wikipedia
>> translations for
>> Sinhala-English and Nepali-English.
>>
>> We also provide links to training data for the two language pairs. This
>> existing data comes from a variety of sources and is of mixed quality and
>> relevance. We provide a script to fetch and compose the training data.
>>
>> Note that the task addresses the challenge of *data quality* and *not
>> domain-relatedness* of the data for a particular use case. While we
>> provide a development and development test set that are also drawn from
>> Wikipedia articles, these may be very different from the final official
>> test set in terms of topics.
>> The provided raw parallel corpora are the outcome of a processing
>> pipeline that aimed from high recall at the cost of precision, so they are
>> very noisy. They exhibit noise of all kinds (wrong language in source and
>> target, sentence pairs that are not translations of each other, bad
>> language, incomplete of bad translations, etc.).
>>
>>
>> *IMPORTANT DATES*
>> Release of raw parallel data: February 8, 2019
>> Submission deadline for subsampled sets: May 10, 2019
>> System descriptions due: May 17, 2019
>> Announcement of results: June 3, 2019
>> Paper notification: June 7, 2019
>> Camera-ready for system descriptions: June 17, 2019
>>
>>
>> * ORGANIZERS*
>> Philipp Koehn (Johns Hopkins University / University of Edinburgh)
>> Francisco (Paco) Guzmán (Facebook)
>> Vishrav Chaudhary (Facebook)
>> Juan Pino (Facebook)
>>
>> More information is available at
>> http://statmt.org/wmt19/parallel-corpus-filtering.html
>>
>> Similarly to other WMT tasks, intending participants are encouraged to
>> register to https://groups.google.com/forum/#!forum/wmt-tasks for
>> discussions and announcements.
>>
>>
>>
>>
>>
>> -- Francisco (Paco) Guzman
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> 

Re: [Moses-support] German tokenizer may fail with numeric endings

2018-11-07 Thread Ergun Bicici
Could there be a workaround with "?!$" added to the prefixes/suffixes so
that if not at the end of a sentence, they will not be split?

Ergun

On Wed, Nov 7, 2018 at 1:20 PM Ozan Çağlayan  wrote:

> Hi Hieu,
>
> Here is it with some test cases in the commit message:
> https://github.com/moses-smt/mosesdecoder/pull/204
>
> Thanks.
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] German tokenizer may fail with numeric endings

2018-11-06 Thread Ergun Bicici
Funny part is trying all 1-99 :)

prefix is actually a suffix of the sentence: This need not be true since
there can be itemized lists. "1. one microsoft way from 9 to 1." Such
sentence can be frequently found in Europarl.

On Wed, Nov 7, 2018 at 1:46 AM Ozan Çağlayan  wrote:

> Yes the rules are coming from the nonbreaking_prefixes files which are
> text files listing which prefixes, when preceded by a  should not be
> tokenized. But I think this rule should not be applied if the prefix is
> actually a suffix of the sentence. Similar situations arise for French and
> other languages as well. For french, "sec." is a non-breaking prefix which
> is the abbreviation for "seconds" but sec also means "dry". So if a
> sentence ends with the "dry" meaning of "sec." the  is also not
> tokenized.
>
> When the size of the corpora goes to infinity, this means that all
> nonbreaking_prefixes for a language will end up in the model vocabulary for
> NMT.
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] German tokenizer may fail with numeric endings

2018-11-06 Thread Ergun Bicici
There might be some rule that prevents. Scripts contain language specific
tokenization rules and they are checked in a sequence.

Did you try all 1-99? :)

On Mon, Nov 5, 2018 at 9:15 PM Ozan Çağlayan  wrote:

> Hello,
>
> I just discovered that the German tokenizer does not split the final 
> if preceded by a number. This is because of the nonbreaking prefixes file
> which lists ordinals in the form '.'. Since the list is between
> 1-99, for numbers > 99, the tokenizer works correctly. Here's a sentence
> from europarl:
>
> $ echo 'Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll
> die Änderungsanträge 2 und *3.*' | tokenizer.perl -q -l de
> Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die
> Änderungsanträge 2 und *3.*
>
> $ echo 'Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll
> die Änderungsanträge 2 und *100.*' | tokenizer.perl -q -l de
> Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die
> Änderungsanträge 2 und *100 .*
>
>
>
> --
> Ozan Caglayan
> PhD student @ University of Le Mans
> Team LST -- Language and Speech Technology
> http://www.ozancaglayan.com
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] mert error

2018-09-06 Thread Ergun Bicici
since PRO outputs more robust results.

On Thu, Sep 6, 2018 at 12:32 PM Ergun Bicici  wrote:

>
> Aside from the error, I recommend using PRO for tuning.
>
> On Thu, Sep 6, 2018 at 12:28 PM Claudia Matos Veliz <
> claudia.matosve...@ugent.be> wrote:
>
>> Hello,
>> I have trained a SMT model using Moses, everything was right whit the
>> training but when I tried to tunned the model I got the following error
>> with SRILM
>>
>> Using SCRIPTS_ROOTDIR:
>> /dependencies/normalisation_demo/moses_kenlm10/scripts
>> Assuming --mertdir=/dependencies/normalisation_demo/moses_kenlm10/bin
>> Assuming the tables are already filtered, reusing filtered/moses.ini
>> Asking moses for feature names and values from filtered/moses.ini
>> Executing: /opt/mosesdecoder/bin/moses  -config
>> filtered/moses.ini  -inputtype 0 -show-weights > ./features.list
>> Defined parameters (per moses.ini or switch):
>> config: filtered/moses.ini
>> distortion-limit: 6
>> feature: UnknownWordPenalty WordPenalty PhraseDictionaryMemory
>> name=TranslationModel0 table-limit=20 num-features=5
>> path=/home/claudia/NeuralMoses/twe_token_test/mert-work/filtered/phrase-table.0-0.1.1.gz
>> input-factor=0 output-factor=0 LexicalReordering name=LexicalReordering0
>> num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0
>> output-factor=0
>> path=/home/claudia/NeuralMoses/twe_token_test/mert-work/filtered/reordering-table.wbe-msd-bidirectional-fe
>> Distortion SRILM name=LM0 factor=0
>> path=/home/claudia/NeuralMoses/lm/model.5 order=5
>> input-factors: 0
>> inputtype: 0
>> mapping: 0 T 0
>> show-weights:
>> weight: UnknownWordPenalty0= 1 WordPenalty0= -1 TranslationModel0= 0.2
>> 0.2 0.2 0.2 0.2 LexicalReordering0= 0.3 0.3 0.3 0.3 0.3 0.3 Distortion0=
>> 0.3 LM0= 0.5
>> /opt/mosesdecoder/bin
>> line=UnknownWordPenalty
>> FeatureFunction: UnknownWordPenalty0 start: 0 end: 1
>> WEIGHT UnknownWordPenalty0=1.000,
>> line=WordPenalty
>> FeatureFunction: WordPenalty0 start: 1 end: 2
>> WEIGHT WordPenalty0=-1.000,
>> line=PhraseDictionaryMemory name=TranslationModel0 table-limit=20
>> num-features=5
>> path=/home/claudia/NeuralMoses/twe_token_test/mert-work/filtered/phrase-table.0-0.1.1.gz
>> input-factor=0 output-factor=0
>> FeatureFunction: TranslationModel0 start: 2 end: 7
>> WEIGHT TranslationModel0=0.200,0.200,0.200,0.200,0.200,
>> line=LexicalReordering name=LexicalReordering0 num-features=6
>> type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0
>> path=/home/claudia/NeuralMoses/twe_token_test/mert-work/filtered/reordering-table.wbe-msd-bidirectional-fe
>> FeatureFunction: LexicalReordering0 start: 7 end: 13
>> Initializing LexicalReordering..
>> Loading table into memory...done.
>> WEIGHT LexicalReordering0=0.300,0.300,0.300,0.300,0.300,0.300,
>> line=Distortion
>> FeatureFunction: Distortion0 start: 13 end: 14
>> WEIGHT Distortion0=0.300,
>> line=SRILM name=LM0 factor=0 path=/home/claudia/NeuralMoses/lm/model.5
>> order=5
>> ERROR:Unknown feature function:SRILM
>> Exit code: 1
>> Failed to run moses with the config filtered/moses.ini at
>> /dependencies/normalisation_demo/moses_kenlm10/scripts/training/
>> mert-moses.pl line 1271.
>>
>> The command I used was this:
>>
>> nice /dependencies/normalisation_demo/moses_kenlm10/scripts/training/
>> mert-moses.pl dev/twe_tok.dev.ori dev/twe_tok.dev.tgt
>> /opt/mosesdecoder/bin/moses model/moses.ini
>>
>> Any help???
>> Thanks!!
>> Claudia
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
> --
>
> Regards,
> Ergun
>
>
>

-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Some one E-MAIL TO U Re: Fwd: Different translations are obtained from the same decoder without alignment information

2018-08-24 Thread Ergun Bicici
Ok, thank you.

On Fri, Aug 24, 2018 at 7:22 PM Hieu Hoang  wrote:

> If you want an email to the mailing list, the archives are here
>https://www.mail-archive.com/moses-support@mit.edu/
>
> Hieu Hoang
> http://statmt.org/hieu
>
>
> On Fri, 24 Aug 2018 at 17:04, Ergun Bicici  wrote:
>
>>
>> Dear Hieu,
>>
>> I did not receive any further message from metaseby...@gmail.com. Can
>> you forward to me if you received any other message until now (7:03 pm)
>> from him?
>>
>> In the mean time I am waiting for the new results according to your
>> suggestion.
>>
>> Thanks,
>> Ergun
>>
>> -- Forwarded message -
>> From: Ergun Bicici 
>> Date: Fri, Aug 24, 2018 at 6:47 PM
>> Subject: Re: Some one E-MAIL TO U Re: [Moses-support] Fwd: Different
>> translations are obtained from the same decoder without alignment
>> information
>> To: 
>>
>>
>>
>> I received only this:
>>
>> Bereketab Birhnu 
>> 5:53 PM (53 minutes ago)
>> to me
>>
>> Thanks
>>
>> I am not expecting any more nor I wanted any.
>>
>> Do you have any further message?
>>
>> Ergun
>>
>> On Fri, Aug 24, 2018 at 6:45 PM Ergun Bicici  wrote:
>>
>>> Why?
>>>
>>> On Fri, Aug 24, 2018 at 6:42 PM Bereketab Birhnu 
>>> wrote:
>>>
>>>> Check your emails
>>>>
>>>
>>>
>>> --
>>>
>>> Regards,
>>> Ergun
>>>
>>>
>>>
>>
>> --
>>
>> Regards,
>> Ergun
>>
>>
>>
>>
>> --
>>
>> Regards,
>> Ergun
>>
>>
>>

-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Fwd: Some one E-MAIL TO U Re: Fwd: Different translations are obtained from the same decoder without alignment information

2018-08-24 Thread Ergun Bicici
Dear Hieu,

I did not receive any further message from metaseby...@gmail.com. Can you
forward to me if you received any other message until now (7:03 pm) from
him?

In the mean time I am waiting for the new results according to your
suggestion.

Thanks,
Ergun

-- Forwarded message -
From: Ergun Bicici 
Date: Fri, Aug 24, 2018 at 6:47 PM
Subject: Re: Some one E-MAIL TO U Re: [Moses-support] Fwd: Different
translations are obtained from the same decoder without alignment
information
To: 



I received only this:

Bereketab Birhnu 
5:53 PM (53 minutes ago)
to me

Thanks

I am not expecting any more nor I wanted any.

Do you have any further message?

Ergun

On Fri, Aug 24, 2018 at 6:45 PM Ergun Bicici  wrote:

> Why?
>
> On Fri, Aug 24, 2018 at 6:42 PM Bereketab Birhnu 
> wrote:
>
>> Check your emails
>>
>
>
> --
>
> Regards,
> Ergun
>
>
>

-- 

Regards,
Ergun




-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Fwd: Different translations are obtained from the same decoder without alignment information

2018-08-24 Thread Ergun Bicici
I am still waiting for the new results.

Ergun

On Fri, Aug 24, 2018 at 5:53 PM Bereketab Birhnu 
wrote:

> Thanks
>
> On Friday, August 24, 2018, Ergun Bicici  wrote:
>
>>
>> ok.
>>
>> On Fri, Aug 24, 2018 at 5:31 PM Hieu Hoang  wrote:
>>
>>> could you run with alignments, but WITHOUT -unknown-word-prefix UNK.
>>>
>>> alignments shouldn't change the translation but the OOV prefix may do
>>>
>>> Hieu Hoang
>>> http://statmt.org/hieu
>>>
>>>
>>> On Fri, 24 Aug 2018 at 15:29, Ergun Bicici  wrote:
>>>
>>>>
>>>> ok, thank you. I'll upload and send you a link.
>>>>
>>>> On Fri, Aug 24, 2018 at 5:27 PM Hieu Hoang  wrote:
>>>>
>>>>> that would be a bug.
>>>>>
>>>>> could you please make the model and input files available for
>>>>> download. I'll check it out
>>>>>
>>>>> Hieu Hoang
>>>>> http://statmt.org/hieu
>>>>>
>>>>>
>>>>> On Fri, 24 Aug 2018 at 15:15, Ergun Bicici  wrote:
>>>>>
>>>>>>
>>>>>> only the evaluation decoding steps are repeated that are steps 10, 9,
>>>>>> and 7 in the following steps in EMS output:
>>>>>> 48 TRAINING:consolidate ->  re-using (1)
>>>>>> 47 TRAINING:prepare-data -> re-using (1)
>>>>>> 46 TRAINING:run-giza -> re-using (1)
>>>>>> 45 TRAINING:run-giza-inverse -> re-using (1)
>>>>>> 44 TRAINING:symmetrize-giza ->  re-using (1)
>>>>>> 43 TRAINING:build-lex-trans ->  re-using (1)
>>>>>> 40 TRAINING:build-osm ->re-using (1)
>>>>>> 39 TRAINING:extract-phrases ->  re-using (1)
>>>>>> 38 TRAINING:build-reordering -> re-using (1)
>>>>>> 37 TRAINING:build-ttable -> re-using (1)
>>>>>> 34 TRAINING:create-config ->re-using (1)
>>>>>> 28 TUNING:truecase-input -> re-using (1)
>>>>>> 24 TUNING:truecase-reference -> re-using (1)
>>>>>> 21 TUNING:filter -> re-using (1)
>>>>>> 20 TUNING:apply-filter ->   re-using (1)
>>>>>> 19 TUNING:tune ->   re-using (1)
>>>>>> 18 TUNING:apply-weights ->  re-using (1)
>>>>>> 15 EVALUATION:test:truecase-input ->re-using (1)
>>>>>> 12 EVALUATION:test:filter ->re-using (1)
>>>>>> 11 EVALUATION:test:apply-filter ->  re-using (1)
>>>>>>
>>>>>>
>>>>>>
>>>>>> *10 EVALUATION:test:decode ->run 9 EVALUATION:test:remove-markup
>>>>>> ->  run 7 EVALUATION:test:detruecase-output ->  run *3
>>>>>> EVALUATION:test:multi-bleu-c ->   run
>>>>>> 2 EVALUATION:test:analysis-coverage ->  re-using (1)
>>>>>> 1 EVALUATION:test:analysis-precision -> run
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 24, 2018 at 4:39 PM Hieu Hoang 
>>>>>> wrote:
>>>>>>
>>>>>>> are you rerunning tuning for each case? Or are you using exactly the
>>>>>>> same moses.ini file for the with and with alignment experiments?
>>>>>>>
>>>>>>> Hieu Hoang
>>>>>>> http://statmt.org/hieu
>>>>>>>
>>>>>>>
>>>>>>> On Fri, 24 Aug 2018 at 14:34, Ergun Bicici  wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Dear Moses maintainers,
>>>>>>>>
>>>>>>>> I discovered that the translations obtained differ when alignment
>>>>>>>> flags (--mark-unknown --unknown-word-prefix UNK
>>>>>>>> --print-alignment-inf) are used. Comparison table is attached
>>>>>>>> (en-ru and ru-en are being recomputed). We expect them to be the same 
>>>>>>>> since
>>>>>>>> alignment flags only print additional information and they are not 
>>>>>>>> supposed
>>>>>>>> to alter decoding. In both, the same EMS system was re-run with the
>>>>>>>> alignment information flags or not.
>>>>>>>>
>>>>>>>>

Re: [Moses-support] Fwd: Different translations are obtained from the same decoder without alignment information

2018-08-24 Thread Ergun Bicici
Dear Tom,

Thank you for sharing your finding. This does not apply in this case since
I re-compiled the code to build the initial Moses 4.0 model. Then moses
binary is not changed and even though I am observing different scores, they
are better when the alignment flags are included. I am waiting for de-en
results with "-print-alignment-info" flag.

I tried to debug some decentralized Moses server-client model before that
was encountering similar symptoms where the error could source from
additional sources such as the network being interrupted, issues with the
syncing of buffers etc. With a binarized version you get a translation, but
the translation options are somewhat fixed. Could Moses provide a better
translation? Turns out that truecasing before detruecasing improves the
scores by 0.002 BLEU for instance on average of 8 translation directions in
WMT18.

Regards,
Ergun
bicici.github.com

On Fri, Aug 24, 2018 at 5:55 PM Tom Hoar  wrote:

> I remember 3 years ago, I reported a similar (same?) problem with
> --print-alignment-inf flag, without EMS. The time, I was using the legacy
> binarized translation and reordering table and everything was great. Then,
> I started testing the compact binarized format. The flag caused
> translations to change and some were even lost (blank lines). No one on the
> support list knew of any reason and I didn't have bandwidth to
> troubleshoot. Instead, I continued using the legacy binarized files. Maybe
> try changing to the legacy binarized files and see if the problem
> disappears. This could help you narrow-down where to look.
>
> Best regards,
> Tom Hoar
> *Slate Rocks, LLC*
> Web: https://www.slate.rocks
> Thailand Mobile: +66 87 345-1875 <+66873451875>
> Skype: tahoar
>
> On 8/24/2018 9:31 PM, moses-support-requ...@mit.edu wrote:
>
> Date: Fri, 24 Aug 2018 15:31:14 +0100
> From: Hieu Hoang  
> Subject: Re: [Moses-support] Fwd: Different translations are obtained
>   from the same decoder without alignment information
> To: Ergun Bicici  
> Cc: moses-support  
> Message-ID:
>
> 
> Content-Type: text/plain; charset="utf-8"
>
> could you run with alignments, but WITHOUT -unknown-word-prefix UNK.
>
> alignments shouldn't change the translation but the OOV prefix may do
>
> Hieu Hoanghttp://statmt.org/hieu
>
>
> On Fri, 24 Aug 2018 at 15:29, Ergun Bicici  
>  wrote:
>
>
> ok, thank you. I'll upload and send you a link.
>
> On Fri, Aug 24, 2018 at 5:27 PM Hieu Hoang  
>  wrote:
>
>
> that would be a bug.
>
> could you please make the model and input files available for download.
> I'll check it out
>
> Hieu Hoanghttp://statmt.org/hieu
>
>
> On Fri, 24 Aug 2018 at 15:15, Ergun Bicici  
>  wrote:
>
>
> only the evaluation decoding steps are repeated that are steps 10, 9,
> and 7 in the following steps in EMS output:
> 48 TRAINING:consolidate ->  re-using (1)
> 47 TRAINING:prepare-data -> re-using (1)
> 46 TRAINING:run-giza -> re-using (1)
> 45 TRAINING:run-giza-inverse -> re-using (1)
> 44 TRAINING:symmetrize-giza ->  re-using (1)
> 43 TRAINING:build-lex-trans ->  re-using (1)
> 40 TRAINING:build-osm ->re-using (1)
> 39 TRAINING:extract-phrases ->  re-using (1)
> 38 TRAINING:build-reordering -> re-using (1)
> 37 TRAINING:build-ttable -> re-using (1)
> 34 TRAINING:create-config ->re-using (1)
> 28 TUNING:truecase-input -> re-using (1)
> 24 TUNING:truecase-reference -> re-using (1)
> 21 TUNING:filter -> re-using (1)
> 20 TUNING:apply-filter ->   re-using (1)
> 19 TUNING:tune ->   re-using (1)
> 18 TUNING:apply-weights ->  re-using (1)
> 15 EVALUATION:test:truecase-input ->re-using (1)
> 12 EVALUATION:test:filter ->re-using (1)
> 11 EVALUATION:test:apply-filter ->  re-using (1)
>
>
>
> *10 EVALUATION:test:decode ->run 9 EVALUATION:test:remove-markup ->
>  run 7 EVALUATION:test:detruecase-output ->  run *3
> EVALUATION:test:multi-bleu-c ->   run
> 2 EVALUATION:test:analysis-coverage ->  re-using (1)
> 1 EVALUATION:test:analysis-precision -> run
>
>
> On Fri, Aug 24, 2018 at 4:39 PM Hieu Hoang  
>  wrote:
>
>
> are you rerunning tuning for each case? Or are you using exactly the
> same moses.ini file for the with and with alignment experiments?
>
> Hieu Hoanghttp://statmt.org/hieu
>
>
> On Fri, 24 Aug 2018 at 14:34, Ergun Bicici  
>  wrote:
>
>
> Dear Moses maintainers,
>
> I discovered that the translations obtained differ when alignment
> flags (--mark-unknown --unknown-word-prefix UNK --print-alignment-inf)
> are used. Comparison table is attached (en-ru and r

Re: [Moses-support] Fwd: Different translations are obtained from the same decoder without alignment information

2018-08-24 Thread Ergun Bicici
ok.

On Fri, Aug 24, 2018 at 5:31 PM Hieu Hoang  wrote:

> could you run with alignments, but WITHOUT -unknown-word-prefix UNK.
>
> alignments shouldn't change the translation but the OOV prefix may do
>
> Hieu Hoang
> http://statmt.org/hieu
>
>
> On Fri, 24 Aug 2018 at 15:29, Ergun Bicici  wrote:
>
>>
>> ok, thank you. I'll upload and send you a link.
>>
>> On Fri, Aug 24, 2018 at 5:27 PM Hieu Hoang  wrote:
>>
>>> that would be a bug.
>>>
>>> could you please make the model and input files available for download.
>>> I'll check it out
>>>
>>> Hieu Hoang
>>> http://statmt.org/hieu
>>>
>>>
>>> On Fri, 24 Aug 2018 at 15:15, Ergun Bicici  wrote:
>>>
>>>>
>>>> only the evaluation decoding steps are repeated that are steps 10, 9,
>>>> and 7 in the following steps in EMS output:
>>>> 48 TRAINING:consolidate ->  re-using (1)
>>>> 47 TRAINING:prepare-data -> re-using (1)
>>>> 46 TRAINING:run-giza -> re-using (1)
>>>> 45 TRAINING:run-giza-inverse -> re-using (1)
>>>> 44 TRAINING:symmetrize-giza ->  re-using (1)
>>>> 43 TRAINING:build-lex-trans ->  re-using (1)
>>>> 40 TRAINING:build-osm ->re-using (1)
>>>> 39 TRAINING:extract-phrases ->  re-using (1)
>>>> 38 TRAINING:build-reordering -> re-using (1)
>>>> 37 TRAINING:build-ttable -> re-using (1)
>>>> 34 TRAINING:create-config ->re-using (1)
>>>> 28 TUNING:truecase-input -> re-using (1)
>>>> 24 TUNING:truecase-reference -> re-using (1)
>>>> 21 TUNING:filter -> re-using (1)
>>>> 20 TUNING:apply-filter ->   re-using (1)
>>>> 19 TUNING:tune ->   re-using (1)
>>>> 18 TUNING:apply-weights ->  re-using (1)
>>>> 15 EVALUATION:test:truecase-input ->re-using (1)
>>>> 12 EVALUATION:test:filter ->re-using (1)
>>>> 11 EVALUATION:test:apply-filter ->  re-using (1)
>>>>
>>>>
>>>>
>>>> *10 EVALUATION:test:decode ->run 9 EVALUATION:test:remove-markup ->
>>>>  run 7 EVALUATION:test:detruecase-output ->  run *3
>>>> EVALUATION:test:multi-bleu-c ->   run
>>>> 2 EVALUATION:test:analysis-coverage ->  re-using (1)
>>>> 1 EVALUATION:test:analysis-precision -> run
>>>>
>>>>
>>>> On Fri, Aug 24, 2018 at 4:39 PM Hieu Hoang  wrote:
>>>>
>>>>> are you rerunning tuning for each case? Or are you using exactly the
>>>>> same moses.ini file for the with and with alignment experiments?
>>>>>
>>>>> Hieu Hoang
>>>>> http://statmt.org/hieu
>>>>>
>>>>>
>>>>> On Fri, 24 Aug 2018 at 14:34, Ergun Bicici  wrote:
>>>>>
>>>>>>
>>>>>> Dear Moses maintainers,
>>>>>>
>>>>>> I discovered that the translations obtained differ when alignment
>>>>>> flags (--mark-unknown --unknown-word-prefix UNK --print-alignment-inf)
>>>>>> are used. Comparison table is attached (en-ru and ru-en are being
>>>>>> recomputed). We expect them to be the same since alignment flags only 
>>>>>> print
>>>>>> additional information and they are not supposed to alter decoding. In
>>>>>> both, the same EMS system was re-run with the alignment information flags
>>>>>> or not.
>>>>>>
>>>>>>- Average of the absolute difference is 0.0094 BLEU (about 1 BLEU
>>>>>>points).
>>>>>>- Average of the difference is 0.0051 BLEU (about 0.5 BLEU
>>>>>>points, results are better with alignment flags).
>>>>>>
>>>>>> 
>>>>>>
>>>>>> /opt/Programs/SMT/moses/mosesdecoder/bin/moses --version
>>>>>>
>>>>>> Moses code version (git tag or commit hash):
>>>>>>   mmt-mvp-v0.12.1-2775-g65c75ff07-dirty
>>>>>> Libraries used:
>>>>>>  Boost  version 1.62.0
>>>>>>
>>>>>> git status
>>>>>> On branch RELEASE-4.0
>>>>>> Your branch is up to date with 'origin/RELEASE-4.0'.
>>>>>>
>>>>>>
>>>>>> Note: Using alignment information to recase tokens was tried in [1]
>>>>>> for en-fi and en-tr to claim positive results. We tried this method in 
>>>>>> all
>>>>>> translation directions we considered as as can be seen in the align row,
>>>>>> this only improves the performance for tr-en and en-tr and for tr-en 
>>>>>> Moses
>>>>>> provides better translations without the alignment flags.
>>>>>> [1]The JHU Machine Translation Systems for WMT 2016
>>>>>> Shuoyang Ding, Kevin Duh, Huda Khayrallah, Philipp Koehn and Matt Post
>>>>>> http://www.statmt.org/wmt16/pdf/W16-2310.pdf
>>>>>>
>>>>>>
>>>>>> Best Regards,
>>>>>> Ergun
>>>>>>
>>>>>> Ergun Biçici
>>>>>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>>>>>>
>>>>>> ___
>>>>>> Moses-support mailing list
>>>>>> Moses-support@mit.edu
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Regards,
>>>> Ergun
>>>>
>>>>
>>>>
>>
>> --
>>
>> Regards,
>> Ergun
>>
>>
>>

-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Fwd: Different translations are obtained from the same decoder without alignment information

2018-08-24 Thread Ergun Bicici
tuning step is not repeated. decoding use the same moses.ini and the same
input but different parameters:
moses/mosesdecoder/65c75ff/bin/moses -search-algorithm 1
-cube-pruning-pop-limit 5000 -s 5000 -threads 8 -text-type "test" -v 0 -f
wmt18_en-de/evaluation/test.filtered.ini.7 <
wmt18_en-de/evaluation/test.input.tc.1 >
wmt18_en-de/evaluation/test.output.7

vs. with alignment:
moses/mosesdecoder/65c75ff/bin/moses -search-algorithm 1
-cube-pruning-pop-limit 5000 -s 5000 -threads 8 --mark-unknown
--unknown-word-prefix UNK_ --print-alignment-info -text-type "test" -v 0 -f
wmt18_en-de/evaluation/test.filtered.ini.7  <
wmt18_en-de/evaluation/test.input.tc.1 >
wmt18_en-de/evaluation/test.output.9

both are followed by the following steps:
moses/mosesdecoder/scripts/ems/support/remove-segmentation-markup.perl <
wmt18_en-de/evaluation/test.output.7 > wmt18_en-de/evaluation/test.cleaned.7
moses/mosesdecoder/scripts/recaser/detruecase.perl <
wmt18_en-de/evaluation/test.cleaned.7 >
wmt18_en-de/evaluation/test.truecased.7
and equivalently with:
moses/mosesdecoder/scripts/ems/support/remove-segmentation-markup.perl <
wmt18_en-de/evaluation/test.output.9 > wmt18_en-de/evaluation/test.cleaned.9
moses/mosesdecoder/scripts/recaser/detruecase.perl <
wmt18_en-de/evaluation/test.cleaned.9 >
wmt18_en-de/evaluation/test.truecased.9

scoring step use test.truecased.7 and test.truecased.9.

Ergun

On Fri, Aug 24, 2018 at 5:15 PM Ergun Bicici  wrote:

>
> only the evaluation decoding steps are repeated that are steps 10, 9, and
> 7 in the following steps in EMS output:
> 48 TRAINING:consolidate ->  re-using (1)
> 47 TRAINING:prepare-data -> re-using (1)
> 46 TRAINING:run-giza -> re-using (1)
> 45 TRAINING:run-giza-inverse -> re-using (1)
> 44 TRAINING:symmetrize-giza ->  re-using (1)
> 43 TRAINING:build-lex-trans ->  re-using (1)
> 40 TRAINING:build-osm ->re-using (1)
> 39 TRAINING:extract-phrases ->  re-using (1)
> 38 TRAINING:build-reordering -> re-using (1)
> 37 TRAINING:build-ttable -> re-using (1)
> 34 TRAINING:create-config ->re-using (1)
> 28 TUNING:truecase-input -> re-using (1)
> 24 TUNING:truecase-reference -> re-using (1)
> 21 TUNING:filter -> re-using (1)
> 20 TUNING:apply-filter ->   re-using (1)
> 19 TUNING:tune ->   re-using (1)
> 18 TUNING:apply-weights ->  re-using (1)
> 15 EVALUATION:test:truecase-input ->re-using (1)
> 12 EVALUATION:test:filter ->re-using (1)
> 11 EVALUATION:test:apply-filter ->  re-using (1)
>
>
>
> *10 EVALUATION:test:decode ->run 9 EVALUATION:test:remove-markup ->
>  run 7 EVALUATION:test:detruecase-output ->  run *3
> EVALUATION:test:multi-bleu-c ->   run
> 2 EVALUATION:test:analysis-coverage ->  re-using (1)
> 1 EVALUATION:test:analysis-precision -> run
>
>
> On Fri, Aug 24, 2018 at 4:39 PM Hieu Hoang  wrote:
>
>> are you rerunning tuning for each case? Or are you using exactly the same
>> moses.ini file for the with and with alignment experiments?
>>
>> Hieu Hoang
>> http://statmt.org/hieu
>>
>>
>> On Fri, 24 Aug 2018 at 14:34, Ergun Bicici  wrote:
>>
>>>
>>> Dear Moses maintainers,
>>>
>>> I discovered that the translations obtained differ when alignment flags 
>>> (--mark-unknown
>>> --unknown-word-prefix UNK --print-alignment-inf) are used. Comparison
>>> table is attached (en-ru and ru-en are being recomputed). We expect them to
>>> be the same since alignment flags only print additional information and
>>> they are not supposed to alter decoding. In both, the same EMS system was
>>> re-run with the alignment information flags or not.
>>>
>>>- Average of the absolute difference is 0.0094 BLEU (about 1 BLEU
>>>points).
>>>- Average of the difference is 0.0051 BLEU (about 0.5 BLEU points,
>>>results are better with alignment flags).
>>>
>>> 
>>>
>>> /opt/Programs/SMT/moses/mosesdecoder/bin/moses --version
>>>
>>> Moses code version (git tag or commit hash):
>>>   mmt-mvp-v0.12.1-2775-g65c75ff07-dirty
>>> Libraries used:
>>>  Boost  version 1.62.0
>>>
>>> git status
>>> On branch RELEASE-4.0
>>> Your branch is up to date with 'origin/RELEASE-4.0'.
>>>
>>>
>>> Note: Using alignment information to recase tokens was tried in [1] for
>>> en-fi and en-tr to claim positive results. We tried this method in all
>>> translation directions we considered as as can be seen in the align row,
>>> this only 

Re: [Moses-support] Fwd: Different translations are obtained from the same decoder without alignment information

2018-08-24 Thread Ergun Bicici
ok, thank you. I'll upload and send you a link.

On Fri, Aug 24, 2018 at 5:27 PM Hieu Hoang  wrote:

> that would be a bug.
>
> could you please make the model and input files available for download.
> I'll check it out
>
> Hieu Hoang
> http://statmt.org/hieu
>
>
> On Fri, 24 Aug 2018 at 15:15, Ergun Bicici  wrote:
>
>>
>> only the evaluation decoding steps are repeated that are steps 10, 9, and
>> 7 in the following steps in EMS output:
>> 48 TRAINING:consolidate ->  re-using (1)
>> 47 TRAINING:prepare-data -> re-using (1)
>> 46 TRAINING:run-giza -> re-using (1)
>> 45 TRAINING:run-giza-inverse -> re-using (1)
>> 44 TRAINING:symmetrize-giza ->  re-using (1)
>> 43 TRAINING:build-lex-trans ->  re-using (1)
>> 40 TRAINING:build-osm ->re-using (1)
>> 39 TRAINING:extract-phrases ->  re-using (1)
>> 38 TRAINING:build-reordering -> re-using (1)
>> 37 TRAINING:build-ttable -> re-using (1)
>> 34 TRAINING:create-config ->re-using (1)
>> 28 TUNING:truecase-input -> re-using (1)
>> 24 TUNING:truecase-reference -> re-using (1)
>> 21 TUNING:filter -> re-using (1)
>> 20 TUNING:apply-filter ->   re-using (1)
>> 19 TUNING:tune ->   re-using (1)
>> 18 TUNING:apply-weights ->  re-using (1)
>> 15 EVALUATION:test:truecase-input ->re-using (1)
>> 12 EVALUATION:test:filter ->re-using (1)
>> 11 EVALUATION:test:apply-filter ->  re-using (1)
>>
>>
>>
>> *10 EVALUATION:test:decode ->run 9 EVALUATION:test:remove-markup ->
>>  run 7 EVALUATION:test:detruecase-output ->  run *3
>> EVALUATION:test:multi-bleu-c ->   run
>> 2 EVALUATION:test:analysis-coverage ->  re-using (1)
>> 1 EVALUATION:test:analysis-precision -> run
>>
>>
>> On Fri, Aug 24, 2018 at 4:39 PM Hieu Hoang  wrote:
>>
>>> are you rerunning tuning for each case? Or are you using exactly the
>>> same moses.ini file for the with and with alignment experiments?
>>>
>>> Hieu Hoang
>>> http://statmt.org/hieu
>>>
>>>
>>> On Fri, 24 Aug 2018 at 14:34, Ergun Bicici  wrote:
>>>
>>>>
>>>> Dear Moses maintainers,
>>>>
>>>> I discovered that the translations obtained differ when alignment flags
>>>> (--mark-unknown --unknown-word-prefix UNK --print-alignment-inf) are
>>>> used. Comparison table is attached (en-ru and ru-en are being recomputed).
>>>> We expect them to be the same since alignment flags only print additional
>>>> information and they are not supposed to alter decoding. In both, the same
>>>> EMS system was re-run with the alignment information flags or not.
>>>>
>>>>- Average of the absolute difference is 0.0094 BLEU (about 1 BLEU
>>>>points).
>>>>- Average of the difference is 0.0051 BLEU (about 0.5 BLEU points,
>>>>results are better with alignment flags).
>>>>
>>>> 
>>>>
>>>> /opt/Programs/SMT/moses/mosesdecoder/bin/moses --version
>>>>
>>>> Moses code version (git tag or commit hash):
>>>>   mmt-mvp-v0.12.1-2775-g65c75ff07-dirty
>>>> Libraries used:
>>>>  Boost  version 1.62.0
>>>>
>>>> git status
>>>> On branch RELEASE-4.0
>>>> Your branch is up to date with 'origin/RELEASE-4.0'.
>>>>
>>>>
>>>> Note: Using alignment information to recase tokens was tried in [1] for
>>>> en-fi and en-tr to claim positive results. We tried this method in all
>>>> translation directions we considered as as can be seen in the align row,
>>>> this only improves the performance for tr-en and en-tr and for tr-en Moses
>>>> provides better translations without the alignment flags.
>>>> [1]The JHU Machine Translation Systems for WMT 2016
>>>> Shuoyang Ding, Kevin Duh, Huda Khayrallah, Philipp Koehn and Matt Post
>>>> http://www.statmt.org/wmt16/pdf/W16-2310.pdf
>>>>
>>>>
>>>> Best Regards,
>>>> Ergun
>>>>
>>>> Ergun Biçici
>>>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>>>>
>>>> ___
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>
>>
>> --
>>
>> Regards,
>> Ergun
>>
>>
>>

-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Fwd: Different translations are obtained from the same decoder without alignment information

2018-08-24 Thread Ergun Bicici
only the evaluation decoding steps are repeated that are steps 10, 9, and 7
in the following steps in EMS output:
48 TRAINING:consolidate ->  re-using (1)
47 TRAINING:prepare-data -> re-using (1)
46 TRAINING:run-giza -> re-using (1)
45 TRAINING:run-giza-inverse -> re-using (1)
44 TRAINING:symmetrize-giza ->  re-using (1)
43 TRAINING:build-lex-trans ->  re-using (1)
40 TRAINING:build-osm ->re-using (1)
39 TRAINING:extract-phrases ->  re-using (1)
38 TRAINING:build-reordering -> re-using (1)
37 TRAINING:build-ttable -> re-using (1)
34 TRAINING:create-config ->re-using (1)
28 TUNING:truecase-input -> re-using (1)
24 TUNING:truecase-reference -> re-using (1)
21 TUNING:filter -> re-using (1)
20 TUNING:apply-filter ->   re-using (1)
19 TUNING:tune ->   re-using (1)
18 TUNING:apply-weights ->  re-using (1)
15 EVALUATION:test:truecase-input ->re-using (1)
12 EVALUATION:test:filter ->re-using (1)
11 EVALUATION:test:apply-filter ->  re-using (1)



*10 EVALUATION:test:decode ->run 9 EVALUATION:test:remove-markup ->
 run 7 EVALUATION:test:detruecase-output ->  run *3
EVALUATION:test:multi-bleu-c ->   run
2 EVALUATION:test:analysis-coverage ->  re-using (1)
1 EVALUATION:test:analysis-precision -> run


On Fri, Aug 24, 2018 at 4:39 PM Hieu Hoang  wrote:

> are you rerunning tuning for each case? Or are you using exactly the same
> moses.ini file for the with and with alignment experiments?
>
> Hieu Hoang
> http://statmt.org/hieu
>
>
> On Fri, 24 Aug 2018 at 14:34, Ergun Bicici  wrote:
>
>>
>> Dear Moses maintainers,
>>
>> I discovered that the translations obtained differ when alignment flags 
>> (--mark-unknown
>> --unknown-word-prefix UNK --print-alignment-inf) are used. Comparison
>> table is attached (en-ru and ru-en are being recomputed). We expect them to
>> be the same since alignment flags only print additional information and
>> they are not supposed to alter decoding. In both, the same EMS system was
>> re-run with the alignment information flags or not.
>>
>>- Average of the absolute difference is 0.0094 BLEU (about 1 BLEU
>>points).
>>- Average of the difference is 0.0051 BLEU (about 0.5 BLEU points,
>>results are better with alignment flags).
>>
>> 
>>
>> /opt/Programs/SMT/moses/mosesdecoder/bin/moses --version
>>
>> Moses code version (git tag or commit hash):
>>   mmt-mvp-v0.12.1-2775-g65c75ff07-dirty
>> Libraries used:
>>  Boost  version 1.62.0
>>
>> git status
>> On branch RELEASE-4.0
>> Your branch is up to date with 'origin/RELEASE-4.0'.
>>
>>
>> Note: Using alignment information to recase tokens was tried in [1] for
>> en-fi and en-tr to claim positive results. We tried this method in all
>> translation directions we considered as as can be seen in the align row,
>> this only improves the performance for tr-en and en-tr and for tr-en Moses
>> provides better translations without the alignment flags.
>> [1]The JHU Machine Translation Systems for WMT 2016
>> Shuoyang Ding, Kevin Duh, Huda Khayrallah, Philipp Koehn and Matt Post
>> http://www.statmt.org/wmt16/pdf/W16-2310.pdf
>>
>>
>> Best Regards,
>> Ergun
>>
>> Ergun Biçici
>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>

-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Fwd: Different translations are obtained from the same decoder without alignment information

2018-08-24 Thread Ergun Bicici
Dear Moses maintainers,

I discovered that the translations obtained differ when alignment
flags (--mark-unknown
--unknown-word-prefix UNK --print-alignment-inf) are used. Comparison table
is attached (en-ru and ru-en are being recomputed). We expect them to be
the same since alignment flags only print additional information and they
are not supposed to alter decoding. In both, the same EMS system was re-run
with the alignment information flags or not.

   - Average of the absolute difference is 0.0094 BLEU (about 1 BLEU
   points).
   - Average of the difference is 0.0051 BLEU (about 0.5 BLEU points,
   results are better with alignment flags).



/opt/Programs/SMT/moses/mosesdecoder/bin/moses --version

Moses code version (git tag or commit hash):
  mmt-mvp-v0.12.1-2775-g65c75ff07-dirty
Libraries used:
 Boost  version 1.62.0

git status
On branch RELEASE-4.0
Your branch is up to date with 'origin/RELEASE-4.0'.


Note: Using alignment information to recase tokens was tried in [1] for
en-fi and en-tr to claim positive results. We tried this method in all
translation directions we considered as as can be seen in the align row,
this only improves the performance for tr-en and en-tr and for tr-en Moses
provides better translations without the alignment flags.
[1]The JHU Machine Translation Systems for WMT 2016
Shuoyang Ding, Kevin Duh, Huda Khayrallah, Philipp Koehn and Matt Post
http://www.statmt.org/wmt16/pdf/W16-2310.pdf


Best Regards,
Ergun

Ergun Biçici
http://bicici.github.com/ 


Moses4.0_translation_comparisonwith_alignment.pdf
Description: Adobe PDF document
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] https://www.mail-archive.com/moses-support@mit.edu/ is still not reached from Turkey after 3 years

2018-04-23 Thread Ergun Bicici
Dear Hieu,

Thank you for your message. Is there a mirror website to access? or could
the archives and current group moved to another domain that may be more
secure against impersonation or internet access cutoff? I found that
http://www.mail-archive.com/ is blocked for some reason. For instance
googlegroups might be a possibility as they might be more focused on
personal communication rather than forums / blogs as the raised issue might
be sourcing from this. My main concern is that I would like to be able to
send messages to moses-support and see that they are there at least for
some time so that even if there is some man-in-the-middle attack, the
messages can still find their way.

Anyway, I did the reporting part from my side to gain access to
moses-support from Turkey without internet tunnelling. I don't know the
exact reason for blocking wikipedia but I heard that it is due to some
content and my current suggestion is to block that part of wikipedia and it
is damaging internet search since usually wikipedia entries come up.

Also,​ when I tried to download the CzEng dataset, we encountered a
difficulty with Onrei Bojar, for that we talked about md5sum usage to
verify downloaded files. Maybe there was some interception of the network
download. Therefore, md5sums can be provided with datasets shared.

Regards,
Ergun

On Mon, Apr 23, 2018 at 1:43 PM, Hieu Hoang <hieuho...@gmail.com> wrote:

> sorry to hear that. I was in Turkey a few weeks ago and saw they blocked
> innocuous sites such wikipedia, so not surprised mail-archived is also
> blocked for whatever reason.
>
> PM you link to the mit.edu's internal archive. I don't share this
> publicly as it has people's raw emails, don't want it to be harvested by
> spammers
>
> On 23/04/18 11:28, Ergun Bicici wrote:
>
>
> Dear Moses mailing list,
>
> I am not able to reach https://www.mail-archive.com/moses-support@mit.edu/
> from Turkey for the last 3 years and why is so is still a mystery to me. I
> am able to post/send messages but could check them through internet
> tunneling.
>
> Do you know why / have some explanation? Thank you.
>
> Here are steps:
> - I verified that is is blocked
> - I asked about this to BTK (https://www.btk.gov.tr/, but they did not
> solve a more specific issue I asked for)
> - I reported to https://turkeyblocks.org
>
> https://www.comparitech.com/privacy-security-tools/blockedinturkey/
>
> DOMAIN TO CHECK
>
> Istanbul -  www.mail-archive.com/moses-support@mit.edu/ *Not Working* in
> Turkey.
>
> Ankara -  www.mail-archive.com/moses-support@mit.edu/ *Not Working* in
> Turkey.
> --
> This URL appears to be blocked in Turkey.
>
>
>
> Best Regards,
> Ergun
>
> Ergun Biçici
> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Workshop on Statistical Machine Translation" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to wmt-tasks+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> Hieu Hoanghttp://moses-smt.org/
>
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] EMS for the neural age?

2017-11-26 Thread Ergun Bicici
Dear Marcin,

I have uploaded my EMS files for WMT'16:
https://github.com/bicici/ParFDAWMT16

Text processing steps can be language-dependent, might require domain
knowledge and expertise, and distinct you from others elevating your
results.
I suggest reading relevant sections from the papers of WMT participants to
get a feel of the computational requirements, that are not necessarily
made obvious, such as the use of unsupervised learning of classes in
language models and alignment. Text processing helps the datasets to take
the form you like them to have even if you consider as evil. If removing
punctuation from some dataset helps, then this may be found ingenuious as
well.

Barry Haddow has prepared preprocessed WMT'17 datasets:
http://data.statmt.org/wmt17/translation-task/preprocessed/
http://www.statmt.org/wmt17/translation-task.html


Regards,
Ergun


On Sun, Nov 26, 2017 at 12:41 PM, Marcin Junczys-Dowmunt  wrote:

> Hi list,
>
> I am preparing a couple of usage example for my NMT toolkit and got hung
> up on all the preprocessing and other evil stuff. I am wondering is
> there now anything decent around for doing preprocessing, running
> experiments and evaluation? Or is the best thing still GNU make (isn't
> that embarrassing)?
>
> Best,
>
> Marcin
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] ParFDA WMT'17 and WMT'16 Datasets

2017-09-25 Thread Ergun Bicici
Dear moses-list,

We make the English, Czech, Finnish, German, Latvian, Romanian, Russian,
Turkish, and Chinese datasets used for WMT'16 (http://www.statmt.org/wmt16/
<http://www.statmt.org/wmt15/>) and WMT'17 (http://www.statmt.org/wmt17/
<http://www.statmt.org/wmt15/>) translation task when building ParFDA Moses
SMT models available on the web, downloadable from:

[WMT'17]
https://drive.google.com/drive/folders/0B2k8ISN7gmi1SnA1d1gxcTQ5TTg?usp=sharing
[WMT'16] https://drive.google.com/drive/folders/0B2k8ISN7gmi
1NHNTSGFrMGhfaVU?usp=sharing

WMT'16 results are in the following paper:

Ergun Bicici. *ParFDA for Instance Selection for Statistical Machine
Translation*. In *Proc. of the First Conference on Statistical Machine
Translation (WMT16)*, Berlin, Germany, 8 2016. Association for
Computational Linguistics.

The datasets are selected by ParFDA for WMT'16 and WMT'17 translation tasks
from among the pool of sentences made available by the WMT organization and
ParFDA Moses SMT results can serve as a benchmark for SMT research. Language
model corpora used contain ~15M sentences and language models were built
using kenlm (https://kheafield.com/code/kenlm/).

LICENSE Note: BSD license. We also inherit characteristics of the license
of WMT conference organization, which allows the use for research purposes,
to make the datasets available.

ParFDA WMT SMT datasets:

   - ParFDA WMT'17 Datasets (https://github.com/bicici/ParFDAWMT17)
   - ParFDA WMT'16 Datasets (https://github.com/bicici/ParFDAWMT16)
   - ParFDA WMT'15 Datasets (https://github.com/bicici/ParFDAWMT15)
   - ParFDA WMT'14 Datasets (https://github.com/bicici/ParFDAWMT14)


Best Regards,
Ergun

TUBITAK BILGEM B3LAB Cloud Computing Laboratory
bicici.github.com
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] TRAINING_extract-phrases ERROR: malformed XML

2017-05-13 Thread Ergun Bicici
Ok, thank you. Turns out that a dataset I used was not tokenized.

You already mentioned that these characters are escaped in a previous
thread:

https://www.mail-archive.com/moses-support@mit.edu/msg10412.html
> Also, it does not do tokenization, so if you want your data tokenized,
> you should use the tokenizer instead, which also escapes special
> characters.


Regards,
Ergun


On Fri, May 12, 2017 at 9:06 PM, Philipp Koehn <p...@jhu.edu> wrote:

> Hi,
>
> you should replace the "<" and ">" with  and 
>
> scripts/tokenizer/escape-special-chars.perl does that for you.
>
> -phi
>
> On Thu, May 11, 2017 at 3:12 PM, Ergun Bicici <bic...@gmail.com> wrote:
>
>>
>> clean-corpus-n.perl can clean XML tags before tokenization:
>>
>> sub word_count {
>>  my ($line) = @_;
>>  if ($ignore_xml) {
>>$line =~ s/<\S[^>]*\S>/ /g;
>>$line =~ s/\s+/ /g;
>>$line =~ s/^ //g;
>>    $line =~ s/ $//g;
>>  }
>>  my @w = split(/ /,$line);
>>  return scalar @w;
>> }
>>
>> Ergun
>>
>> On Thu, May 11, 2017 at 10:33 AM, Ergun Bicici <bic...@gmail.com> wrote:
>>
>>>
>>> Similarly:
>>> ERROR: some opened tags were never closed: it shares some features in
>>> common with the SGML < ! [ CDATA [ ] ] > construct , in that it declares a
>>> block of text which is not for parsing .
>>>
>>>
>>> On Thu, May 11, 2017 at 10:32 AM, Ergun Bicici <bic...@gmail.com> wrote:
>>>
>>>>
>>>> TRAINING_extract-phrases is giving
>>>> ERROR: malformed XML: Wirtschaftsjahr Betriebsgrösse < 50.000 kg
>>>> 120.000 kg
>>>> ERROR: malformed XML: < ! -- / * Font Definitions *
>>>>
>>>> etc.
>>>>
>>>> this appears to be due to the tokenization of html tags.
>>>>
>>>> Is there an option of Moses to handle these?
>>>>
>>>> --
>>>>
>>>> Regards,
>>>> Ergun
>>>>
>>>> Ergun Biçici
>>>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Regards,
>>> Ergun
>>>
>>> Ergun Biçici
>>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>>>
>>
>>
>>
>> --
>>
>> Regards,
>> Ergun
>>
>> Ergun Biçici
>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>


-- 

Regards,
Ergun

Ergun Biçici
http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] TRAINING_extract-phrases ERROR: malformed XML

2017-05-11 Thread Ergun Bicici
clean-corpus-n.perl can clean XML tags before tokenization:

sub word_count {
 my ($line) = @_;
 if ($ignore_xml) {
   $line =~ s/<\S[^>]*\S>/ /g;
   $line =~ s/\s+/ /g;
   $line =~ s/^ //g;
   $line =~ s/ $//g;
 }
 my @w = split(/ /,$line);
 return scalar @w;
}

Ergun

On Thu, May 11, 2017 at 10:33 AM, Ergun Bicici <bic...@gmail.com> wrote:

>
> Similarly:
> ERROR: some opened tags were never closed: it shares some features in
> common with the SGML < ! [ CDATA [ ] ] > construct , in that it declares a
> block of text which is not for parsing .
>
>
> On Thu, May 11, 2017 at 10:32 AM, Ergun Bicici <bic...@gmail.com> wrote:
>
>>
>> TRAINING_extract-phrases is giving
>> ERROR: malformed XML: Wirtschaftsjahr Betriebsgrösse < 50.000 kg 120.000
>> kg
>> ERROR: malformed XML: < ! -- / * Font Definitions *
>>
>> etc.
>>
>> this appears to be due to the tokenization of html tags.
>>
>> Is there an option of Moses to handle these?
>>
>> --
>>
>> Regards,
>> Ergun
>>
>> Ergun Biçici
>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>>
>
>
>
> --
>
> Regards,
> Ergun
>
> Ergun Biçici
> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>



-- 

Regards,
Ergun

Ergun Biçici
http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] TRAINING_extract-phrases ERROR: malformed XML

2017-05-11 Thread Ergun Bicici
Similarly:
ERROR: some opened tags were never closed: it shares some features in
common with the SGML < ! [ CDATA [ ] ] > construct , in that it declares a
block of text which is not for parsing .


On Thu, May 11, 2017 at 10:32 AM, Ergun Bicici <bic...@gmail.com> wrote:

>
> TRAINING_extract-phrases is giving
> ERROR: malformed XML: Wirtschaftsjahr Betriebsgrösse < 50.000 kg 120.000
> kg
> ERROR: malformed XML: < ! -- / * Font Definitions *
>
> etc.
>
> this appears to be due to the tokenization of html tags.
>
> Is there an option of Moses to handle these?
>
> --
>
> Regards,
> Ergun
>
> Ergun Biçici
> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>



-- 

Regards,
Ergun

Ergun Biçici
http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] TRAINING_extract-phrases ERROR: malformed XML

2017-05-11 Thread Ergun Bicici
TRAINING_extract-phrases is giving
ERROR: malformed XML: Wirtschaftsjahr Betriebsgrösse < 50.000 kg 120.000 kg
ERROR: malformed XML: < ! -- / * Font Definitions *

etc.

this appears to be due to the tokenization of html tags.

Is there an option of Moses to handle these?

-- 

Regards,
Ergun

Ergun Biçici
http://bicici.github.com/ 
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] MERT: process n-best list before mert

2017-04-22 Thread Ergun Bicici
Dear Jorg,

I encountered similar issue for my character SMT experiments concurrently
looking at your paper (
http://aclanthology.info/papers/combining-word-level-and-character-level-models-for-machine-translation-between-closely-related-languages)
and added a line to call a script before the gzip of the nbest file within
the MERT main loop.

At the end of section 2, you mention that:
"
[image: Inline image 1]
" Preslav Nakov  | Jörg
Tiedemann 
*Anthology:*P12-2059 Volume:Proceedings of the 50th Annual Meeting of the
Association for Computational Linguistics (Volume 2: Short Papers)

 *Authors:*Preslav Nakov
 | Jörg Tiedemann
  *Month:*July *Year:*
2012 *Venue:*ACL   *Address:*Jeju
Island, Korea *SIG:* *Publisher:*Association for Computational Linguistics
*Pages:*301–305 *URL:*http://aclweb.org/anthology/P12-2059

Regards,
Ergun

On Sat, Apr 22, 2017 at 8:41 PM, Hieu Hoang  wrote:

> sounds good. You're always welcome to check it in yourself as long as you
> look after it. Send me your github username or create a git pull request
>
> * Looking for MT/NLP opportunities *
> Hieu Hoang
> http://moses-smt.org/
>
>
> On 22 April 2017 at 15:44, Jorg Tiedemann  wrote:
>
>>
>> Great - thanks. I modified a little to make it possible to pass a command
>> to the new option.
>> I hope I didn’t break anything. Maybe this modification could become part
>> of the official moses package?
>>
>> Jörg
>>
>> Jörg Tiedemann
>> tiede...@gmail.com
>>
>>
>>
>>
>>
>>
>>
>> On 22 Apr 2017, at 15:50, Anoop (അനൂപ്) 
>> wrote:
>>
>> Hi Jorg,
>>
>> I had made changes to  mert-moses.perl  to achieve exactly what you are
>> looking for. Please find the script attached.
>>
>> To enable character level to word level transformation, you have to pass
>> the option '--transform-decoded-file' to mert-moses.pl
>> The script assumes that a caret token '^'  has been added between words
>>  while preprocessing the corpora. So, all subwords between two carets are
>> merged to create a single word.  The changes are on line 826--834.
>>
>> Regards,
>> Anoop.
>>
>>
>>
>> On Sat, Apr 22, 2017 at 5:10 PM, Jorg Tiedemann 
>> wrote:
>>
>>> Hi,
>>>
>>> Is there an easy way to integrate a small script to process n-best lists
>>> in mert-moses.perl before running mert at each iteration? An example would
>>> be to merge character-level translations to run mert on word-level
>>> segmentations. It’s probably rather straightforward to add an option to
>>> specify a script for filtering but it may already exist and I just don’t
>>> see it?
>>>
>>> Thanks!
>>> Jörg
>>>
>>> 
>>> **
>>> Jörg Tiedemann
>>> Department of Modern Languageshttp://blogs.helsinki.fi/tiedeman/
>>> University of Helsinki
>>> http://blogs.helsinki.fi/language-technology/
>>>
>>>
>>>
>>> ___
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>
>>
>> --
>> I claim to be a simple individual liable to err like any other fellow
>> mortal. I own, however, that I have humility enough to confess my errors
>> and to retrace my steps.
>>
>> http://flightsofthought.blogspot.com
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>


-- 

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] OSM and lmplz are both using -T as a parameter directive which causes error

2016-01-24 Thread Ergun Bicici
Dear Moses Support,

​OSM training script instance like the following:​

mosesdecoder/scripts/OSM/OSM-Train.perl --corpus-f SMT_de-en/training/cor
pus.1.de --corpus-e
​
SMT_de-en/training/corpus.1.en --alignment
​
SMT_de-en/mod
el/aligned.1.grow-diag-final-and --order 4 --out-dir
​
SMT_de-en/model/OSM.1 --moses-src-dir mosesdecoder --input-extension de
--output-extension en -lmplz 'mosesdecoder/bin/lmplz -S 40% -T
​
SMT_de-en/model/tmp'

Calls lmplz like the following:​
​Executing: mosesdecoder/bin/lmplz -S 40% -T
​
SMT_de-en/model/tmp -T
​
SMT_de-en
/model/OSM.1 --order 4 --text
​
SMT_de-en/model/OSM.1//opCorpus --arpa
​
SMT_de-en
/model/OSM.1//operationLM --prune 0 0 1

causing the following error:
option '--temp_prefix' cannot be specified more than once
​
​This works ok:​
mosesdecoder/scripts/OSM/OSM-Train.perl --corpus-f SMT_de-en/training/cor
pus.1.de --corpus-e
​
SMT_de-en/training/corpus.1.en --alignment
​
SMT_de-en/mod
el/aligned.1.grow-diag-final-and --order 4 --out-dir
​
SMT_de-en/model/OSM.1 --moses-src-dir mosesdecoder --input-extension de
--output-extension en -lmplz 'mosesdecoder/bin/lmplz -S 40% -T
​
SMT_de-en/model/tmp'

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] OSM and lmplz are both using -T as a parameter directive which causes error

2016-01-24 Thread Ergun Bicici
​This works ok (without additional -T directive to lmplz):​
mosesdecoder/scripts/OSM/OSM-Train.perl --corpus-f SMT_de-en/training/cor
pus.1.de --corpus-e
​
SMT_de-en/training/corpus.1.en --alignment
​
SMT_de-en/mod
el/aligned.1.grow-diag-final-and --order 4 --out-dir
​
SMT_de-en/model/OSM.1 --moses-src-dir mosesdecoder --input-extension de
--output-extension en -lmplz 'mosesdecoder/bin/lmplz -S 40% '

Ergun

On Sun, Jan 24, 2016 at 3:50 PM, Ergun Bicici <ergunbic...@yahoo.com> wrote:

>
> Dear Moses Support,
>
> ​OSM training script instance like the following:​
>
> mosesdecoder/scripts/OSM/OSM-Train.perl --corpus-f SMT_de-en/training/cor
> pus.1.de --corpus-e
> ​
> SMT_de-en/training/corpus.1.en --alignment
> ​
> SMT_de-en/mod
> el/aligned.1.grow-diag-final-and --order 4 --out-dir
> ​
> SMT_de-en/model/OSM.1 --moses-src-dir mosesdecoder --input-extension de
> --output-extension en -lmplz 'mosesdecoder/bin/lmplz -S 40% -T
> ​
> SMT_de-en/model/tmp'
>
> Calls lmplz like the following:​
> ​Executing: mosesdecoder/bin/lmplz -S 40% -T
> ​
> SMT_de-en/model/tmp -T
> ​
> SMT_de-en
> /model/OSM.1 --order 4 --text
> ​
> SMT_de-en/model/OSM.1//opCorpus --arpa
> ​
> SMT_de-en
> /model/OSM.1//operationLM --prune 0 0 1
>
> causing the following error:
> option '--temp_prefix' cannot be specified more than once
> ​
> ​This works ok:​
> mosesdecoder/scripts/OSM/OSM-Train.perl --corpus-f SMT_de-en/training/cor
> pus.1.de --corpus-e
> ​
> SMT_de-en/training/corpus.1.en --alignment
> ​
> SMT_de-en/mod
> el/aligned.1.grow-diag-final-and --order 4 --out-dir
> ​
> SMT_de-en/model/OSM.1 --moses-src-dir mosesdecoder --input-extension de
> --output-extension en -lmplz 'mosesdecoder/bin/lmplz -S 40% -T
> ​
> SMT_de-en/model/tmp'
>
> Regards,
> Ergun
>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Skip OOV when computing Language Model score

2016-01-15 Thread Ergun Bicici
Dear Kenneth,

In the Moses manual, -drop-unknown switch is mentioned:

4.7.2
 Handling Unknown Words
Unknown words are copied verbatim to the output. They are also scored by
the language
model, and may be placed out of order. Alternatively, you may want to drop
unknown words.
To do so add the switch -drop-unknown.

​Alternatively, you can write a script that replaces all OOV tokens​ with
some OOV-token-identifier such as  before sending for translation.


*Best Regards,*
Ergun

Ergun Biçici
DFKI Projektbüro Berlin


On Fri, Jan 15, 2016 at 12:22 AM, Kenneth Heafield 
wrote:

> Hi,
>
> I think oov-feature=1 just activates the OOV count feature while
> leaving LM score unchanged.  So it would still include p( | in).
>
> One might try setting the OOV feature weight to -weight_LM *
> weird_moses_internal_constant * log p() in an attempt to cancel out
> the log p() terms.  However that won't work either because:
>
> 1) It will still charge backoff penalties, b(the)b(house) in the example.
>
> 2) The context will be lost each time so it's p(house) not p(house | the).
>
> If the s follow a pattern, such as appearing every other word, one
> could insert them into the ARPA file though that would waste memory.
>
> I don't think there's any way to accomplish exactly what OP asked for
> without coding (though it wouldn't be that hard once one understands how
> the LM infrastructure works).
>
> Kenneth
>
> On 01/14/2016 11:07 PM, Philipp Koehn wrote:
> > Hi,
> >
> > You may get the behavior you want by adding
> >   "oov-feature=1"
> > to your LM specification line in moses.ini
> > and also add a second weight with value "0" to the corresponding LM
> > weight setting.
> >
> > This will then only use the scores
> > p(the|)
> > p(house|,the,) ---> backoff to p(house)
> > p(in|,the,,house,) ---> backoff to p(in)
> >
> > -phi
> >
> > On Thu, Jan 14, 2016 at 8:25 AM, LUONG NGOC Quang
> > > wrote:
> >
> > Dear All,
> >
> > I am currently using a SRILM Language Model (LM) in my Moses
> > decoder. Does anyone know how can I ask the decoder, at the decoding
> > time, skip all out-of-vocabulary words when computing the LM score
> > (instead of doing back-off)?
> >
> > For instance, with the n-gram: "the  house  in", I would
> > like the decoder to assign it the probability of the phrase: "the
> > house in" (existing in the LM).
> >
> > Do I need more options/declarations in moses.ini file?
> >
> > Any help is very much appreciated,
> >
> > Best,
> > Quang
> >
> >
> >
> > ___
> > Moses-support mailing list
> > Moses-support@mit.edu 
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
> >
> >
> > ___
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Skip OOV when computing Language Model score

2016-01-15 Thread Ergun Bicici
No comment.



*Best Regards,*
Ergun

Ergun Biçici
DFKI Projektbüro Berlin


On Fri, Jan 15, 2016 at 4:20 PM, Jie Jiang <mail.jie.ji...@gmail.com> wrote:

> Hi Ergun:
>
> I think the -skipoovs option would just drop all the n-gram scores that
> has OOV in it, rather than using a skip-ngram LM model.
>
> Easy way to test it is just run it with that option to calculate log prob
> on a sentence with OOV, and it should result in a rather high score.
>
> Please correct me if I'm wrong...
>
> 2016-01-15 14:07 GMT+00:00 Ergun Bicici <ergun.bic...@dfki.de>:
>
>>
>> Dear Jie,
>>
>> There may be some option from SRILM:
>> - http://www.speech.sri.com/pipermail/srilm-user/2013q2/001509.html
>> - http://www.speech.sri.com/projects/srilm/manpages/ngram.1.html:
>> *-skipoovs*
>> Instruct the LM to skip over contexts that contain out-of-vocabulary
>> words, instead of using a backoff strategy in these cases.
>>
>> ​if it is not ​there maybe for a reason...
>>
>> Bing appears fast to index this thread:
>> ​http://comments.gmane.org/gmane.comp.nlp.moses.user/14570​
>>
>>
>> *Best Regards,*
>> Ergun
>>
>> Ergun Biçici
>> DFKI Projektbüro Berlin
>>
>>
>> On Fri, Jan 15, 2016 at 2:37 PM, Jie Jiang <mail.jie.ji...@gmail.com>
>> wrote:
>>
>>> Hi Ergun:
>>>
>>> The original request in Quang's post was:
>>>
>>> *For instance, with the n-gram: "the  house  in", I would like
>>> the decoder to assign it the probability of the phrase: "the house in"
>>> (existing in the LM).*
>>>
>>> so each time there is a  when calculating the LM score, you need to
>>> look another word further.
>>>
>>> I believe that it cannot be achieved on current LM tools without
>>> modifying the source code, which has already been clarified by Kenneth.
>>>
>>>
>>> 2016-01-15 13:20 GMT+00:00 Ergun Bicici <ergun.bic...@dfki.de>:
>>>
>>>>
>>>> Dear Kenneth,
>>>>
>>>> In the Moses manual, -drop-unknown switch is mentioned:
>>>>
>>>> 4.7.2
>>>>  Handling Unknown Words
>>>> Unknown words are copied verbatim to the output. They are also scored
>>>> by the language
>>>> model, and may be placed out of order. Alternatively, you may want to
>>>> drop unknown words.
>>>> To do so add the switch -drop-unknown.
>>>>
>>>> ​Alternatively, you can write a script that replaces all OOV tokens​
>>>> with some OOV-token-identifier such as  before sending for
>>>> translation.
>>>>
>>>>
>>>> *Best Regards,*
>>>> Ergun
>>>>
>>>> Ergun Biçici
>>>> DFKI Projektbüro Berlin
>>>>
>>>>
>>>> On Fri, Jan 15, 2016 at 12:22 AM, Kenneth Heafield <mo...@kheafield.com
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I think oov-feature=1 just activates the OOV count feature
>>>>> while
>>>>> leaving LM score unchanged.  So it would still include p( | in).
>>>>>
>>>>> One might try setting the OOV feature weight to -weight_LM *
>>>>> weird_moses_internal_constant * log p() in an attempt to cancel
>>>>> out
>>>>> the log p() terms.  However that won't work either because:
>>>>>
>>>>> 1) It will still charge backoff penalties, b(the)b(house) in the
>>>>> example.
>>>>>
>>>>> 2) The context will be lost each time so it's p(house) not p(house |
>>>>> the).
>>>>>
>>>>> If the s follow a pattern, such as appearing every other word, one
>>>>> could insert them into the ARPA file though that would waste memory.
>>>>>
>>>>> I don't think there's any way to accomplish exactly what OP asked for
>>>>> without coding (though it wouldn't be that hard once one understands
>>>>> how
>>>>> the LM infrastructure works).
>>>>>
>>>>> Kenneth
>>>>>
>>>>> On 01/14/2016 11:07 PM, Philipp Koehn wrote:
>>>>> > Hi,
>>>>> >
>>>>> > You may get the behavior you want by adding
>>>>> >   "oov-feature=1"
>>>>> > to your LM specification line in moses.ini
>>>>> > a

Re: [Moses-support] Which symal?

2016-01-12 Thread Ergun Bicici
So, you can sublicense Moses (LGPL) but not mgiza (GPL):
http://choosealicense.com/licenses/
​​
​Which I guess may m​ean something like that you can re-license a modified
version of Moses without symal but not with symal. ​In other words, to
license a modified version of​ Moses with your own terms, you need to take
out symal from mgiza or any GPL or code having some more restrictive
license and keep symal that comes with Moses.


*Best Regards,*
Ergun

Ergun Biçici
DFKI Projektbüro Berlin


On Tue, Jan 12, 2016 at 3:09 AM, Tom Hoar <
tah...@precisiontranslationtools.com> wrote:

> Don't have experience with EMS, but our Slate packages (Linux & Windows)
> include both MGIZA and Moses binaries the same `$bin` folder and we set
> the train-model.perl --external-bin-dir=$bin value and all other binary
> path references to that one folder. The symal binary is the only
> conflict file name. So, we use the one from Moses. From what we could
> tell, they are essentially the same source but the Moses copy seems to
> have been maintained/updated more recently. Also be careful because the
> two versions are published under different open source licenses.
>
> I don't know if this approach will work with EMS, but it is a simple
> solution.
>
> Tom
>
>
> On 1/12/2016 3:56 AM, moses-support-requ...@mit.edu wrote:
> > Date: Mon, 11 Jan 2016 20:56:49 +
> > From: Hieu Hoang<hieuho...@gmail.com>
> > Subject: Re: [Moses-support] Which symal?
> > To: Ergun Bicici<ergun.bic...@dfki.de>
> > Cc: moses-support<moses-support@mit.edu>
> >
> > I'm not sure if the bjam argument
> > --with-giza
> > is actually used during compilation. Where did you see this mention? The
> > bad thing about bjam is it doesn't tell you if an argument is invalid,
> > it simply ignores it.
> >
> > It would be nice to have 1 directory for all your MT tools. If you wanna
> > make it happen, be my guest.
> >
> > I suppose we should be mindful of people who use mgiza but don't use
> > Moses, they'll still need symal so we can't just delete it from mgiza.
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Which symal?

2016-01-11 Thread Ergun Bicici
Warnings from bjam could be nice. Maybe from a previous installation
instruction set. If it is not used by bjam, makes sense to remove remaining
references:
https://github.com/moses-smt/mosesdecoder/blob/master/cruise-control/test_all_new_commits.sh


*Best Regards,*
Ergun

Ergun Biçici
DFKI Projektbüro Berlin


On Mon, Jan 11, 2016 at 9:56 PM, Hieu Hoang <hieuho...@gmail.com> wrote:

> I'm not sure if the bjam argument
>   --with-giza
> is actually used during compilation. Where did you see this mention? The
> bad thing about bjam is it doesn't tell you if an argument is invalid, it
> simply ignores it.
>
> It would be nice to have 1 directory for all your MT tools. If you wanna
> make it happen, be my guest.
>
> I suppose we should be mindful of people who use mgiza but don't use
> Moses, they'll still need symal so we can't just delete it from mgiza.
>
> On 11/01/16 18:04, Ergun Bicici wrote:
>
>
> ​Hi Hieu​,
>
> Since
> ​
> the path to mgiza is provided
> ​ during compilation, could be nice that moses knows where to look for
> mgiza afterwards without additional path directives and environment
> variable settings (​e.g. ​export BINDIR=~/workspace/bin/training-tools​
> in  <http://www.statmt.org/moses/?n=Moses.ExternalTools#ntoc3>
> http://www.statmt.org/moses/?n=Moses.ExternalTools#ntoc3 instead,
> I am using the following for external-bin-dir in EMS: external-bin-dir =
> $moses-install-dir/bin).
>
> This also did not appear to work for tools in opt/ compiled with make -f
> contrib/Makefiles/install-dependencies.gmake. I still added opt/lib and
> opt/bin to LD_LIBRARY_PATH and PATH respectively.
>
> Having all binaries and libraries in the same place may decrease confusion
> as well and can prevent further confusion if a file with the same name
> appears in both paths. A compilation procedure that puts these directives
> together (dependency compilation input to bjam) while reducing path
> directives may help simplify installation.
>
> *Best Regards,*
> Ergun
>
> Ergun Biçici
> DFKI Projektbüro Berlin
>
>
> On Mon, Jan 11, 2016 at 12:02 PM, Hieu Hoang <hieuho...@gmail.com> wrote:
>
>> you shouldn't copy anything into the moses/bin directory.
>>
>> The mgiza files should have its own directory. When you run Moses'
>> train-model.perl you can refer to that directory using
>>.../train-model.perl external-bin-dir=[directory with mgiza]
>>
>>
>> On 10/01/16 17:20, Ergun Bicici wrote:
>>
>>
>> Hi Hieu,
>>
>> First a compile:
>> ./bjam --max-kenlm-order=10 --git --prefix=/path/moses/mosesdecoder/
>> --with-giza=/path/mgiza/mgizapp/inst/ 
>> --with-xmlrpc-c=/path/moses/mosesdecoder/opt/
>> --with-boost=/path/moses/mosesdecoder/opt/ 
>> --with-cmph=/path/moses/mosesdecoder/opt/
>> -j 20
>>
>> then, a copy:
>> cp mgiza/mgizapp/inst/bin/* moses/mosesdecoder/instdir/bin/
>> cp mgiza/mgizapp/inst/lib/* moses/mosesdecoder/instdir/lib/
>> cp mgiza/mgizapp/inst/scripts/* moses/mosesdecoder/instdir/bin/
>>
>> With which another copy appears to be needed to use Moses' symal:
>> cp
>> moses/mosesdecoder/symal/bin/gcc-4.8/release/link-static/threading-multi/symal
>> moses/mosesdecoder/bin/symal
>>
>> Therefore, even
>> ​ ​
>> if the path to mgiza is provided (--with-giza=/path/mgiza/mgizapp/inst/),
>> some copying and updated appear to be needed (see also
>> ​ ​
>> <http://www.statmt.org/moses/?n=Moses.ExternalTools#ntoc3>
>> http://www.statmt.org/moses/?n=Moses.ExternalTools#ntoc3).
>>
>>
>> *Best Regards,*
>> Ergun
>>
>> Ergun Biçici
>> DFKI Projektbüro Berlin
>>
>>
>> On Sun, Jan 10, 2016 at 3:34 PM, Hieu Hoang < <hieuho...@gmail.com>
>> hieuho...@gmail.com> wrote:
>>
>>> What the exact commands u used to compile moses and mgiza? I'm pretty
>>> sure they don't overwrite each other unless you ask them too. They're
>>> independent projects
>>> On 10 Jan 2016 14:07, "Ergun Bicici" < <ergun.bic...@dfki.de>
>>> ergun.bic...@dfki.de> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> I compiled another Moses instance and symal appears to be copied from
>>>> mgiza still to mosesdecoder/bin/. ​
>>>>
>>>>
>>>> *Best Regards,*
>>>> Ergun
>>>>
>>>> Ergun Biçici
>>>> DFKI Projektbüro Berlin
>>>>
>>>>
>>>> On Sun, May 17, 2015 at 2:47 PM, Ergun Bicici <
>>>> <ergun.bic...@computing.dcu.ie>ergun.bic...@comput

Re: [Moses-support] Which symal?

2016-01-11 Thread Ergun Bicici
​Hi Hieu​,

Since
​
the path to mgiza is provided
​ during compilation, could be nice that moses knows where to look for
mgiza afterwards without additional path directives and environment
variable settings (​e.g. ​export BINDIR=~/workspace/bin/training-tools​ in
http://www.statmt.org/moses/?n=Moses.ExternalTools#ntoc3 instead,
I am using the following for external-bin-dir in EMS: external-bin-dir =
$moses-install-dir/bin).

This also did not appear to work for tools in opt/ compiled with make -f
contrib/Makefiles/install-dependencies.gmake. I still added opt/lib and
opt/bin to LD_LIBRARY_PATH and PATH respectively.

Having all binaries and libraries in the same place may decrease confusion
as well and can prevent further confusion if a file with the same name
appears in both paths. A compilation procedure that puts these directives
together (dependency compilation input to bjam) while reducing path
directives may help simplify installation.

*Best Regards,*
Ergun

Ergun Biçici
DFKI Projektbüro Berlin


On Mon, Jan 11, 2016 at 12:02 PM, Hieu Hoang <hieuho...@gmail.com> wrote:

> you shouldn't copy anything into the moses/bin directory.
>
> The mgiza files should have its own directory. When you run Moses'
> train-model.perl you can refer to that directory using
>.../train-model.perl external-bin-dir=[directory with mgiza]
>
>
> On 10/01/16 17:20, Ergun Bicici wrote:
>
>
> Hi Hieu,
>
> First a compile:
> ./bjam --max-kenlm-order=10 --git --prefix=/path/moses/mosesdecoder/
> --with-giza=/path/mgiza/mgizapp/inst/ 
> --with-xmlrpc-c=/path/moses/mosesdecoder/opt/
> --with-boost=/path/moses/mosesdecoder/opt/ 
> --with-cmph=/path/moses/mosesdecoder/opt/
> -j 20
>
> then, a copy:
> cp mgiza/mgizapp/inst/bin/* moses/mosesdecoder/instdir/bin/
> cp mgiza/mgizapp/inst/lib/* moses/mosesdecoder/instdir/lib/
> cp mgiza/mgizapp/inst/scripts/* moses/mosesdecoder/instdir/bin/
>
> With which another copy appears to be needed to use Moses' symal:
> cp
> moses/mosesdecoder/symal/bin/gcc-4.8/release/link-static/threading-multi/symal
> moses/mosesdecoder/bin/symal
>
> Therefore, even
> ​​
> if the path to mgiza is provided (--with-giza=/path/mgiza/mgizapp/inst/),
> some copying and updated appear to be needed (see also
> <http://www.statmt.org/moses/?n=Moses.ExternalTools#ntoc3>
> ​​
> http://www.statmt.org/moses/?n=Moses.ExternalTools#ntoc3).
>
>
> *Best Regards,*
> Ergun
>
> Ergun Biçici
> DFKI Projektbüro Berlin
>
>
> On Sun, Jan 10, 2016 at 3:34 PM, Hieu Hoang <hieuho...@gmail.com> wrote:
>
>> What the exact commands u used to compile moses and mgiza? I'm pretty
>> sure they don't overwrite each other unless you ask them too. They're
>> independent projects
>> On 10 Jan 2016 14:07, "Ergun Bicici" <ergun.bic...@dfki.de> wrote:
>>
>>>
>>> Hi,
>>>
>>> I compiled another Moses instance and symal appears to be copied from
>>> mgiza still to mosesdecoder/bin/. ​
>>>
>>>
>>> *Best Regards,*
>>> Ergun
>>>
>>> Ergun Biçici
>>> DFKI Projektbüro Berlin
>>>
>>>
>>> On Sun, May 17, 2015 at 2:47 PM, Ergun Bicici <
>>> <ergun.bic...@computing.dcu.ie>ergun.bic...@computing.dcu.ie> wrote:
>>>
>>>>
>>>> Moses' symal:
>>>> http://article.gmane.org/gmane.comp.nlp.moses.user/11544
>>>>
>>>>
>>>> Best Regards,
>>>> Ergun
>>>>
>>>> Ergun Biçici, CNGL, School of Computing, DCU, <http://www.cngl.ie>
>>>> www.cngl.ie
>>>> <http://www.computing.dcu.ie/%7Eebicici/>
>>>> http://www.computing.dcu.ie/~ebicici/
>>>>
>>>>
>>>> On Sun, May 17, 2015 at 1:15 PM, Jeroen Vermeulen <
>>>> <j...@precisiontranslationtools.com>j...@precisiontranslationtools.com>
>>>> wrote:
>>>>
>>>>> The symal source code is duplicated between the moses-smt and mgiza
>>>>> repositories.  Does it make sense to have both?  They're quietly
>>>>> diverging, which is probably a bad thing.
>>>>>
>>>>> Here's the differences that I can see:
>>>>>  * I modernized the code in moses-smt.  Big diff, no functional change.
>>>>>  * The moses-smt version supports longer source and target strings.
>>>>>  * The mgiza version has what looks like some extra debug output.
>>>>>  * The moses-smt version avoids non-portable use of /dev/stdout.
>>>>>  * One builds through bjam, the other through cmake.
>>>>>
>>>>> Could we perhaps just delete the mgiza one, and tell people to use the
>>>>> one from moses instead?
>>>>>
>>>>>
>>>>> Jeroen
>>>>> ___
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>
>>>>
>>>
>>> ___
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>
> --
> Hieu Hoanghttp://www.hoang.co.uk/hieu
>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Which symal?

2016-01-10 Thread Ergun Bicici
Hi,

I compiled another Moses instance and symal appears to be copied from mgiza
still to mosesdecoder/bin/. ​


*Best Regards,*
Ergun

Ergun Biçici
DFKI Projektbüro Berlin


On Sun, May 17, 2015 at 2:47 PM, Ergun Bicici <ergun.bic...@computing.dcu.ie
> wrote:

>
> Moses' symal:
> http://article.gmane.org/gmane.comp.nlp.moses.user/11544
>
>
> Best Regards,
> Ergun
>
> Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie
> http://www.computing.dcu.ie/~ebicici/
>
>
> On Sun, May 17, 2015 at 1:15 PM, Jeroen Vermeulen <
> j...@precisiontranslationtools.com> wrote:
>
>> The symal source code is duplicated between the moses-smt and mgiza
>> repositories.  Does it make sense to have both?  They're quietly
>> diverging, which is probably a bad thing.
>>
>> Here's the differences that I can see:
>>  * I modernized the code in moses-smt.  Big diff, no functional change.
>>  * The moses-smt version supports longer source and target strings.
>>  * The mgiza version has what looks like some extra debug output.
>>  * The moses-smt version avoids non-portable use of /dev/stdout.
>>  * One builds through bjam, the other through cmake.
>>
>> Could we perhaps just delete the mgiza one, and tell people to use the
>> one from moses instead?
>>
>>
>> Jeroen
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Which symal?

2016-01-10 Thread Ergun Bicici
Hi Hieu,

First a compile:
./bjam --max-kenlm-order=10 --git --prefix=/path/moses/mosesdecoder/
--with-giza=/path/mgiza/mgizapp/inst/
--with-xmlrpc-c=/path/moses/mosesdecoder/opt/
--with-boost=/path/moses/mosesdecoder/opt/
--with-cmph=/path/moses/mosesdecoder/opt/
-j 20

then, a copy:
cp mgiza/mgizapp/inst/bin/* moses/mosesdecoder/instdir/bin/
cp mgiza/mgizapp/inst/lib/* moses/mosesdecoder/instdir/lib/
cp mgiza/mgizapp/inst/scripts/* moses/mosesdecoder/instdir/bin/

With which another copy appears to be needed to use Moses' symal:
cp
moses/mosesdecoder/symal/bin/gcc-4.8/release/link-static/threading-multi/symal
moses/mosesdecoder/bin/symal

Therefore, even if the path to mgiza is provided (--with-giza=/path/mgiza/m
gizapp/inst/), some copying and updated appear to be needed (see also
http://www.statmt.org/moses/?n=Moses.ExternalTools#ntoc3).


*Best Regards,*
Ergun

Ergun Biçici
DFKI Projektbüro Berlin


On Sun, Jan 10, 2016 at 3:34 PM, Hieu Hoang <hieuho...@gmail.com> wrote:

> What the exact commands u used to compile moses and mgiza? I'm pretty sure
> they don't overwrite each other unless you ask them too. They're
> independent projects
> On 10 Jan 2016 14:07, "Ergun Bicici" <ergun.bic...@dfki.de> wrote:
>
>>
>> Hi,
>>
>> I compiled another Moses instance and symal appears to be copied from
>> mgiza still to mosesdecoder/bin/. ​
>>
>>
>> *Best Regards,*
>> Ergun
>>
>> Ergun Biçici
>> DFKI Projektbüro Berlin
>>
>>
>> On Sun, May 17, 2015 at 2:47 PM, Ergun Bicici <
>> ergun.bic...@computing.dcu.ie> wrote:
>>
>>>
>>> Moses' symal:
>>> http://article.gmane.org/gmane.comp.nlp.moses.user/11544
>>>
>>>
>>> Best Regards,
>>> Ergun
>>>
>>> Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie
>>> http://www.computing.dcu.ie/~ebicici/
>>>
>>>
>>> On Sun, May 17, 2015 at 1:15 PM, Jeroen Vermeulen <
>>> j...@precisiontranslationtools.com> wrote:
>>>
>>>> The symal source code is duplicated between the moses-smt and mgiza
>>>> repositories.  Does it make sense to have both?  They're quietly
>>>> diverging, which is probably a bad thing.
>>>>
>>>> Here's the differences that I can see:
>>>>  * I modernized the code in moses-smt.  Big diff, no functional change.
>>>>  * The moses-smt version supports longer source and target strings.
>>>>  * The mgiza version has what looks like some extra debug output.
>>>>  * The moses-smt version avoids non-portable use of /dev/stdout.
>>>>  * One builds through bjam, the other through cmake.
>>>>
>>>> Could we perhaps just delete the mgiza one, and tell people to use the
>>>> one from moses instead?
>>>>
>>>>
>>>> Jeroen
>>>> ___
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>
>>>
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses Compilation Error

2015-10-27 Thread Ergun Bicici
Ok. Nice...it compiled.

Regards,
Ergun

Ergun Bicici
Koc University

On Tue, Oct 27, 2015 at 4:34 PM, Ulrich Germann <ulrich.germ...@gmail.com>
wrote:

> in mosesdecoder root directory:
>
> make -f contrib/Makefiles/install-dependencies.gmake xmlrpc
>
> then
>
> ./bjam --with-xmlrpc-c=./opt .
>
> - Again
>
> On Tue, Oct 27, 2015 at 2:48 PM, Hieu Hoang <hieuho...@gmail.com> wrote:
>
>> what version of xmlrpc-c are you using? The compilation says
>> While building mosesserver ...
>> 
>> !
>> !!! You are linking the XMLRPC-C library; Must be v.1.32 (September 2012)
>> or higher !!!
>> 
>> !
>>
>> Hieu Hoang http://www.hoang.co.uk/hieu
>> <http://www.hoang.co.uk/hieu>
>>
>> On 27 October 2015 at 14:44, Ergun Bicici <ebic...@ku.edu.tr> wrote:
>>
>>>
>>> Hi,
>>>
>>> I get the following error when compiling Moses:
>>>
>>> ./bjam -q --max-kenlm-order=10 --git 
>>> --prefix=/project/qtleap/software/moses/latest
>>>
>>> XMLRPC-C: USING VERSION 1.16.33 FROM /usr
>>>
>>> gcc.compile.c++ contrib/server/bin/gcc-4.6/
>>> release/link-static/threading-multi/mosesserver.o
>>> contrib/server/mosesserver.cpp: In function ‘int main(int, char**)’:
>>> contrib/server/mosesserver.cpp:748:6: error: ‘class
>>> xmlrpc_c::serverAbyss::constrOpt’ has no member named ‘allowOrigin’
>>>
>>> Thank you.
>>>
>>> Regards,
>>> Ergun
>>>
>>>
>>>
>>> ___
>>> Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/
>>> <http://mailman.mit.edu/mailman/listinfo/moses-support>
>>> mailman/listinfo/moses-support
>>> <http://mailman.mit.edu/mailman/listinfo/moses-support>
>>> <Moses-support@mit.edu>
>>> <http://mailman.mit.edu/mailman/listinfo/moses-support>
>>>
>>>
>>
>> ___
>> Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/
>> <http://mailman.mit.edu/mailman/listinfo/moses-support>
>> mailman/listinfo/moses-support
>> <http://mailman.mit.edu/mailman/listinfo/moses-support>
>> <Moses-support@mit.edu>
>> <http://mailman.mit.edu/mailman/listinfo/moses-support>
>>
>>
>
>
> --
> Ulrich Germann
> Senior Researcher
> School of Informatics
> University of Edinburgh
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Moses Compilation Error

2015-10-27 Thread Ergun Bicici
Hi,

I get the following error when compiling Moses:

./bjam -q --max-kenlm-order=10 --git
--prefix=/project/qtleap/software/moses/latest
XMLRPC-C: USING VERSION 1.16.33 FROM /usr

gcc.compile.c++
contrib/server/bin/gcc-4.6/release/link-static/threading-multi/mosesserver.o

contrib/server/mosesserver.cpp: In function ‘int main(int, char**)’:
contrib/server/mosesserver.cpp:748:6: error: ‘class
xmlrpc_c::serverAbyss::constrOpt’ has no member named ‘allowOrigin’

Thank you.

Regards,
Ergun
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] ParFDA WMT'15 Datasets

2015-08-09 Thread Ergun Bicici
ParFDA WMT'15 Datasets

Dear moses-list,

We make the English, Czech, Finnish, French, German, and Russian datasets
available used when building ParFDA Moses SMT systems for research
purposes. Downloadable from:

https://drive.google.com/a/dcu.ie/folderview?id=0B6Jae6trZb1afjJ1T0ZOZlZFZUk0S2R3Z0U3eVdxN2tpQlVwTUgyX0tteVk4TnlhRHVJR2Musp=sharing

Results are presented in the following citation from WMT'15 (
http://www.statmt.org/wmt15/).

Citation:

Ergun Biçici, Qun Liu, and Andy Way. ParFDA for Fast Deployment of Accurate
Statistical Machine Translation Systems, Benchmarks, and Statistics. In
Proceedings of the EMNLP 2015 Tenth Workshop on Statistical Machine
Translation, Lisbon, Portugal, September 2015.

The datasets and the SMT results can serve as a benchmark for SMT research
where further linguistic processing can be performed. The datasets allow
fast deployment of accurate SMT systems and can be used for benchmarking
the performance of SMT systems.

Language models were built using SRILM (
http://www.speech.sri.com/projects/srilm/). Language model corpora used
contain 15M sentences some of which are selected from LDC Gigaword corpora
by the Parallel FDA5 algorithm:

[5 use the LDC English Gigaword 5th edition]

- Czech - English
- Finnish - English
- French - English
- German - English
- Russian - English

[1 use the LDC French Gigaword 3rd edition]

- English - French

LICENSE: Dublin City University License for Open Data allowing use for
research and academic purposes.


Best Regards,
Ergun

Ergun Biçici, School of Computing, DCU, www.cngl.ie
http://www.computing.dcu.ie/~ebicici/
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] When to truecase

2015-05-21 Thread Ergun Bicici
recaser: builds a Moses model for word translation from lowercased to cased
and also uses a language model. Input to recaser is lowercased.

truecaser: builds a casing model based on the number of times each version
appears in text (e.g. rivet (4/8) Rivet (3) RIVET (1)). Input to truecaser
is as it is and not lowercased.

Therefore, if text is noisy such as Tweets, recaser may perform better.


Best Regards,
Ergun

Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie
http://www.computing.dcu.ie/~ebicici/


On Wed, May 20, 2015 at 8:07 PM, Philipp Koehn p...@jhu.edu wrote:

 Hi,

 yes, this is what the RECASER section in EMS enables.

 -phi

 On Wed, May 20, 2015 at 2:50 PM, Lane Schwartz dowob...@gmail.com wrote:

  Got it. So then, how was casing handled in the mbr/mp column? Was all
 of the data lowercased, then models trained, then recasing applied after
 decoding? Or something else?

 On Wed, May 20, 2015 at 1:30 PM, Philipp Koehn p...@jhu.edu wrote:

 Hi,

  no, the changes are made incrementally.

  So the recesed baseline is the previous mbr/mp column.

  -phi

 On Wed, May 20, 2015 at 2:01 PM, Lane Schwartz dowob...@gmail.com
 wrote:

  Philipp,

  In Table 2 of the WMT 2009 paper, are the baseline and truecased
 columns directly comparable? In other words, do the two columns indicate
 identical conditions other than a single variable (how and/or when casing
 was handled)?

  In the baseline condition, how and when was casing handled?

  Thanks,
 Lane


 On Wed, May 20, 2015 at 12:43 PM, Philipp Koehn p...@jhu.edu wrote:

 Hi,

  see Section 2.2 in our WMT 2009 submission:
 http://www.statmt.org/wmt09/pdf/WMT-0929.pdf

  One practical reason to avoid recasing is the need
 for a second large cased language model.

  But there is of course also the practical issue with
 have a unique truecasing scheme for each data
 condition, handling of headlines, all-caps emphasis,
 etc.

  It would be worth to revisit this issue again under
 different data conditions / language pairs. Both
 options are readily available in EMS.

  Each of the two alternative methods could be
 improved as well. See for instance:
 http://www.aclweb.org/anthology/N06-1001

  -phi

  -phi


  On Wed, May 20, 2015 at 12:31 PM, Lane Schwartz dowob...@gmail.com
 wrote:

   Philipp (and others),

  I'm wondering what people's experience is regarding when truecasing
 is applied.

  One option is to truecase the training data, then train your TM and
 LM using that truecased data. Another option would be to lowercase the
 data, train TM and LM on the lowercased data, and then perform truecasing
 after decoding.

  I assume that the former gives better results, but the latter
 approach has an advantage in terms of extensibility (namely if you get 
 more
 data and update your truecase model, you don't have to re-train all of 
 your
 TMs and LMs).

  Does anyone have any insights they would care to share on this?

  Thanks,
 Lane


  ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





  --
 When a place gets crowded enough to require ID's, social collapse is not
 far away.  It is time to go elsewhere.  The best thing about space
 travel
 is that it made it possible to go elsewhere.
 -- R.A. Heinlein, Time Enough For Love

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support





  --
 When a place gets crowded enough to require ID's, social collapse is not
 far away.  It is time to go elsewhere.  The best thing about space travel
 is that it made it possible to go elsewhere.
 -- R.A. Heinlein, Time Enough For Love



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Which symal?

2015-05-17 Thread Ergun Bicici
Moses' symal:
http://article.gmane.org/gmane.comp.nlp.moses.user/11544


Best Regards,
Ergun

Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie
http://www.computing.dcu.ie/~ebicici/


On Sun, May 17, 2015 at 1:15 PM, Jeroen Vermeulen 
j...@precisiontranslationtools.com wrote:

 The symal source code is duplicated between the moses-smt and mgiza
 repositories.  Does it make sense to have both?  They're quietly
 diverging, which is probably a bad thing.

 Here's the differences that I can see:
  * I modernized the code in moses-smt.  Big diff, no functional change.
  * The moses-smt version supports longer source and target strings.
  * The mgiza version has what looks like some extra debug output.
  * The moses-smt version avoids non-portable use of /dev/stdout.
  * One builds through bjam, the other through cmake.

 Could we perhaps just delete the mgiza one, and tell people to use the
 one from moses instead?


 Jeroen
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Transliteration model is using processPhraseTable, which is not found in Moses version 3.0

2015-05-10 Thread Ergun Bicici
Transliteration config file is not copying LM order (order=5):
evaluation/Transliteration-Module/test.transliterated.3/evaluation/moses.filtered.ini

and appends the following to SRILM binary LM:
KENLM lazyken=0

which is giving the following:

Exception: lm/read_arpa.cc:65 in void lm::ReadARPACounts(util::FilePiece ,
std::vectorunsigned long, std::allocatorunsigned long ) threw
FormatLoadException'.
first non-empty line was SRILM_BINARY_NGRAM_002 not \data\. Byte: 23

I replaced this with SRILM and
obtained Transliteration-Module/test.transliterated.3 file.



Best Regards,
Ergun

Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie
http://www.computing.dcu.ie/~ebicici/


On Wed, May 6, 2015 at 1:45 PM, Ergun Bicici ergun.bic...@computing.dcu.ie
wrote:


 Dear Nadir,

 Thank you very much for explaining transliteration. I have yes for both
 transliteration-module and post-decoding-transliteration in the EMS
 configuration file used for en-ru.

 Best Regards,
 Ergun

 Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie
 http://www.computing.dcu.ie/~ebicici/


 -- Forwarded message --
 From: Nadir Durrani nadir.durr...@nu.edu.pk
 Date: Wed, May 6, 2015 at 11:17 AM
 Subject: Re: Transliteration model is using processPhraseTable, which is
 not found in Moses version 3.0
 To: Ergun Bicici ergun.bic...@computing.dcu.ie


 Hi Ergun,

 If you are only going to do

  transliteration-module = yes

 Moses will train the transliteration system but not going to do
 anything with it. You have to select whether you want to use
 post-deocoding or in-decoding transliteration.

 In post-decoding method, transliteration is done in the post-decoding
 step i.e. the decoder has translated all the sentences and now you
 just need to replace OOV words with their best transliteration given
 the context. This is Method 2 as described in the following paper

 http://aclweb.org/anthology//E/E14/E14-4029.pdf

 you can enable it by using

 post-decoding-transliteration = yes


 Using in-decoding method (Method 3 in the paper), you do
 transliteration inside the decoder on the fly. The advantage of this
 over Method 2 in theory is that you can also reorder the OOV word and
 make use of other features. But it does not give any clear-cut gains.

 More details here:

 http://www.statmt.org/moses/?n=Advanced.OOVs

 Nadir

  On Tue, May 5, 2015 at 5:33 PM, Ergun Bicici
  ergun.bic...@computing.dcu.ie wrote:
  
   Hi Nadir,
  
   I am using Moses 3.0 and for transliteration to work, I copied
   scripts/Transliteration/ from latest onto Moses 3.0 path, re-ran, and
   obtained translation results.
  
  
   Best Regards,
   Ergun
  
   Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie
   http://www.computing.dcu.ie/~ebicici/
  
  
   On Mon, May 4, 2015 at 7:32 AM, Nadir Durrani 
 nadir.durr...@nu.edu.pk
   wrote:
  
   Hi Ergun,
  
   processPhraseTable is no longer supported by Moses. But I see that
   Phil Williams has already fixed this problem in transliteration
   module, by changing
  
`$MOSES_SRC/scripts/training/filter-model-given-input.pl
   $TRANSLIT_MODEL/evaluation/$eval_file.filtered
   $TRANSLIT_MODEL/evaluation/$eval_file.moses.table.ini
   $TRANSLIT_MODEL/evaluation/$eval_file  -Binarizer
   $MOSES_SRC/bin/processPhraseTable`;
  
   to
  
   `$MOSES_SRC/scripts/training/filter-model-given-input.pl
   $TRANSLIT_MODEL/evaluation/$eval_file.filtered
   $TRANSLIT_MODEL/evaluation/$eval_file.moses.table.ini
   $TRANSLIT_MODEL/evaluation/$eval_file -Binarizer
   $MOSES_SRC/bin/CreateOnDiskPt 1 1 4 100 2`;
  
   in
  
   path-to-moses/scripts/Transliteration/in-decoding-transliteration.pl
  
   Here's the commit
  
  
  
  
 https://github.com/moses-smt/mosesdecoder/commit/7e54e23fe234ac48f44b0e473d09a5b4d5f6
  
   May be you pulled and in between version where the processPhraseTable
   was removed but transliteration scripts were not fixed.
  
   Cheers,
   Nadir
  
  
   On Mon, May 4, 2015 at 7:46 AM,  moses-support-requ...@mit.edu
 wrote:
Send Moses-support mailing list submissions to
moses-support@mit.edu
   
To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-requ...@mit.edu
   
You can reach the person managing the list at
moses-support-ow...@mit.edu
   
When replying, please edit your Subject line so it is more specific
than Re: Contents of Moses-support digest...
   
   
Today's Topics:
   
   1. Re: 12-gram language model ARPA file for 16GB (liling tan)
   2. Transliteration model is using processPhraseTable, which is
  not found in Moses version 3.0 (Ergun Bicici)
   3. Re: Transliteration model is using processPhraseTable, which
  is not found

[Moses-support] Fwd: Transliteration model is using processPhraseTable, which is not found in Moses version 3.0

2015-05-06 Thread Ergun Bicici
Dear Nadir,

Thank you very much for explaining transliteration. I have yes for both
transliteration-module and post-decoding-transliteration in the EMS
configuration file used for en-ru.

Best Regards,
Ergun

Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie
http://www.computing.dcu.ie/~ebicici/


-- Forwarded message --
From: Nadir Durrani nadir.durr...@nu.edu.pk
Date: Wed, May 6, 2015 at 11:17 AM
Subject: Re: Transliteration model is using processPhraseTable, which is
not found in Moses version 3.0
To: Ergun Bicici ergun.bic...@computing.dcu.ie


Hi Ergun,

If you are only going to do

 transliteration-module = yes

Moses will train the transliteration system but not going to do
anything with it. You have to select whether you want to use
post-deocoding or in-decoding transliteration.

In post-decoding method, transliteration is done in the post-decoding
step i.e. the decoder has translated all the sentences and now you
just need to replace OOV words with their best transliteration given
the context. This is Method 2 as described in the following paper

http://aclweb.org/anthology//E/E14/E14-4029.pdf

you can enable it by using

post-decoding-transliteration = yes


Using in-decoding method (Method 3 in the paper), you do
transliteration inside the decoder on the fly. The advantage of this
over Method 2 in theory is that you can also reorder the OOV word and
make use of other features. But it does not give any clear-cut gains.

More details here:

http://www.statmt.org/moses/?n=Advanced.OOVs

Nadir

 On Tue, May 5, 2015 at 5:33 PM, Ergun Bicici
 ergun.bic...@computing.dcu.ie wrote:
 
  Hi Nadir,
 
  I am using Moses 3.0 and for transliteration to work, I copied
  scripts/Transliteration/ from latest onto Moses 3.0 path, re-ran, and
  obtained translation results.
 
 
  Best Regards,
  Ergun
 
  Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie
  http://www.computing.dcu.ie/~ebicici/
 
 
  On Mon, May 4, 2015 at 7:32 AM, Nadir Durrani nadir.durr...@nu.edu.pk
  wrote:
 
  Hi Ergun,
 
  processPhraseTable is no longer supported by Moses. But I see that
  Phil Williams has already fixed this problem in transliteration
  module, by changing
 
   `$MOSES_SRC/scripts/training/filter-model-given-input.pl
  $TRANSLIT_MODEL/evaluation/$eval_file.filtered
  $TRANSLIT_MODEL/evaluation/$eval_file.moses.table.ini
  $TRANSLIT_MODEL/evaluation/$eval_file  -Binarizer
  $MOSES_SRC/bin/processPhraseTable`;
 
  to
 
  `$MOSES_SRC/scripts/training/filter-model-given-input.pl
  $TRANSLIT_MODEL/evaluation/$eval_file.filtered
  $TRANSLIT_MODEL/evaluation/$eval_file.moses.table.ini
  $TRANSLIT_MODEL/evaluation/$eval_file -Binarizer
  $MOSES_SRC/bin/CreateOnDiskPt 1 1 4 100 2`;
 
  in
 
  path-to-moses/scripts/Transliteration/in-decoding-transliteration.pl
 
  Here's the commit
 
 
 
 
https://github.com/moses-smt/mosesdecoder/commit/7e54e23fe234ac48f44b0e473d09a5b4d5f6
 
  May be you pulled and in between version where the processPhraseTable
  was removed but transliteration scripts were not fixed.
 
  Cheers,
  Nadir
 
 
  On Mon, May 4, 2015 at 7:46 AM,  moses-support-requ...@mit.edu
wrote:
   Send Moses-support mailing list submissions to
   moses-support@mit.edu
  
   To subscribe or unsubscribe via the World Wide Web, visit
   http://mailman.mit.edu/mailman/listinfo/moses-support
   or, via email, send a message with subject or body 'help' to
   moses-support-requ...@mit.edu
  
   You can reach the person managing the list at
   moses-support-ow...@mit.edu
  
   When replying, please edit your Subject line so it is more specific
   than Re: Contents of Moses-support digest...
  
  
   Today's Topics:
  
  1. Re: 12-gram language model ARPA file for 16GB (liling tan)
  2. Transliteration model is using processPhraseTable, which is
 not found in Moses version 3.0 (Ergun Bicici)
  3. Re: Transliteration model is using processPhraseTable, which
 is not found in Moses version 3.0 (Hieu Hoang)
  4. Europarl monolingual corpus (Hieu Hoang)
  
  
  
  
--
  
   Message: 1
   Date: Sun, 3 May 2015 19:44:12 +0200
   From: liling tan alvati...@gmail.com
   Subject: Re: [Moses-support] 12-gram language model ARPA file for
   16GB
   To: moses-support moses-support@mit.edu
   Message-ID:
  
   CAKzPaJJ7fY=9C89POact542vu32d+H3=0i_Dnaj=yfizbfa...@mail.gmail.com
   Content-Type: text/plain; charset=utf-8
  
   Dear Moses devs/users,
  
   For now, I only know that it takes more than 250GB. I've 250GB of
   free
   space and KenLM got poisoned by insufficient space...
  
   Does anyone have an idea how big would a 12-gram language model ARPA
   file
   trained on 16GB of text become?
  
   STDERR:
  
   === 1/5 Counting and sorting n-grams ===
   Reading /media/2tb/wmt15/corpus.truecase/train-lm.en
  
  
  
5---10---15---20---25---30---35---40---45---50---55---60---65

Re: [Moses-support] Transliteration model is using processPhraseTable, which is not found in Moses version 3.0

2015-05-05 Thread Ergun Bicici
Hi Nadir,

I am using Moses 3.0 and for transliteration to work, I copied
scripts/Transliteration/ from latest onto Moses 3.0 path, re-ran, and
obtained translation results.


Best Regards,
Ergun

Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie
http://www.computing.dcu.ie/~ebicici/


On Mon, May 4, 2015 at 7:32 AM, Nadir Durrani nadir.durr...@nu.edu.pk
wrote:

 Hi Ergun,

 processPhraseTable is no longer supported by Moses. But I see that
 Phil Williams has already fixed this problem in transliteration
 module, by changing

  `$MOSES_SRC/scripts/training/filter-model-given-input.pl
 $TRANSLIT_MODEL/evaluation/$eval_file.filtered
 $TRANSLIT_MODEL/evaluation/$eval_file.moses.table.ini
 $TRANSLIT_MODEL/evaluation/$eval_file  -Binarizer
 $MOSES_SRC/bin/processPhraseTable`;

 to

 `$MOSES_SRC/scripts/training/filter-model-given-input.pl
 $TRANSLIT_MODEL/evaluation/$eval_file.filtered
 $TRANSLIT_MODEL/evaluation/$eval_file.moses.table.ini
 $TRANSLIT_MODEL/evaluation/$eval_file -Binarizer
 $MOSES_SRC/bin/CreateOnDiskPt 1 1 4 100 2`;

 in

 path-to-moses/scripts/Transliteration/in-decoding-transliteration.pl

 Here's the commit


 https://github.com/moses-smt/mosesdecoder/commit/7e54e23fe234ac48f44b0e473d09a5b4d5f6

 May be you pulled and in between version where the processPhraseTable
 was removed but transliteration scripts were not fixed.

 Cheers,
 Nadir


 On Mon, May 4, 2015 at 7:46 AM,  moses-support-requ...@mit.edu wrote:
  Send Moses-support mailing list submissions to
  moses-support@mit.edu
 
  To subscribe or unsubscribe via the World Wide Web, visit
  http://mailman.mit.edu/mailman/listinfo/moses-support
  or, via email, send a message with subject or body 'help' to
  moses-support-requ...@mit.edu
 
  You can reach the person managing the list at
  moses-support-ow...@mit.edu
 
  When replying, please edit your Subject line so it is more specific
  than Re: Contents of Moses-support digest...
 
 
  Today's Topics:
 
 1. Re: 12-gram language model ARPA file for 16GB (liling tan)
 2. Transliteration model is using processPhraseTable, which is
not found in Moses version 3.0 (Ergun Bicici)
 3. Re: Transliteration model is using processPhraseTable, which
is not found in Moses version 3.0 (Hieu Hoang)
 4. Europarl monolingual corpus (Hieu Hoang)
 
 
  --
 
  Message: 1
  Date: Sun, 3 May 2015 19:44:12 +0200
  From: liling tan alvati...@gmail.com
  Subject: Re: [Moses-support] 12-gram language model ARPA file for 16GB
  To: moses-support moses-support@mit.edu
  Message-ID:
  CAKzPaJJ7fY=9C89POact542vu32d+H3=0i_Dnaj=
 yfizbfa...@mail.gmail.com
  Content-Type: text/plain; charset=utf-8
 
  Dear Moses devs/users,
 
  For now, I only know that it takes more than 250GB. I've 250GB of free
  space and KenLM got poisoned by insufficient space...
 
  Does anyone have an idea how big would a 12-gram language model ARPA file
  trained on 16GB of text become?
 
  STDERR:
 
  === 1/5 Counting and sorting n-grams ===
  Reading /media/2tb/wmt15/corpus.truecase/train-lm.en
 
 5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
  tcmalloc: large alloc 7846035456 bytes == 0x10f4000 @
  tcmalloc: large alloc 73229664256 bytes == 0x1d542e000 @
 
 
  Unigram tokens 3038737446 types 5924314
  === 2/5 Calculating and sorting adjusted counts ===
  Chain sizes: 1:71091768 2:804524736 3:1508483968 4:2413574144
 5:3519795968
  6:4827148288 7:6335632384 8:8045247488 9:9955993600 10:12067871744
  11:14380880896 12:16895020032
  tcmalloc: large alloc 16895025152 bytes == 0x1d542e000 @
  tcmalloc: large alloc 2413576192 bytes == 0x8f2a @
  tcmalloc: large alloc 3519799296 bytes == 0x5c4488000 @
  tcmalloc: large alloc 4827152384 bytes == 0x696146000 @
  tcmalloc: large alloc 6335635456 bytes == 0x7b5cce000 @
  tcmalloc: large alloc 8045248512 bytes == 0x92f6f @
  tcmalloc: large alloc 9955999744 bytes == 0xb0ef7c000 @
  tcmalloc: large alloc 12067872768 bytes == 0xd60644000 @
  tcmalloc: large alloc 14380883968 bytes == 0x12f616e000 @
  Last input should have been poison.
  Last input should have been poison.util/file.cc:196 in void
  util::WriteOrThrow(int, const void*, std::size_t) threw FDException
 because
  `ret  1'.
  No space left on device in /tmp/PC2o3z (deleted) while writing 5301120368
  bytes
 
  Last input should have been poison.util/file.cc:196 in void
  util::WriteOrThrow(int, const void*, std::size_t) threw FDException
 because
  `ret  1'.
  No space left on device in /tmp/PftXeo (deleted) while writing 1941075872
  bytesLast input should have been poison.
 
  util/file.cc:196 in void util::WriteOrThrow(int, const void*,
 std::size_t)
  threw FDException because `ret  1'.
  No space left on device in /tmp/CuZcPM

[Moses-support] Transliteration model is using processPhraseTable, which is not found in Moses version 3.0

2015-05-03 Thread Ergun Bicici
binarizing...gzip -cd
en-ru_path/model/Transliteration.8/tuning/filtered/phrase-table.0-0.1.1.gz
| LC_ALL=C sort -T en-ru_path/model/Transliteration.8/tuning/filtered |
moses_3.0/mosesdecoder/bin/processPhraseTable -ttable 0 0 - -nscores 4 -out
en-ru_path/model/Transliteration.8/tuning/filtered/phrase-table.0-0.1.1
sh: moses_3.0/mosesdecoder/bin/processPhraseTable: No such file or directory
sort: write failed: standard output: Broken pipe
sort: write error

How can I have processPhraseTable built?

Best Regards,
Ergun

Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie
http://www.computing.dcu.ie/~ebicici/
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Support for per-sentence language model

2015-04-27 Thread Ergun Bicici
Can you use Google n-grams through some API? How about word2vec (
https://code.google.com/p/word2vec/) ?


Best Regards,
Ergun

Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie
http://www.computing.dcu.ie/~ebicici/


On Sat, Apr 25, 2015 at 2:47 PM, Kenneth Heafield mo...@kheafield.com
wrote:

 Hi,

 We know how to produce filtered models.  The problem is StaticData
 enforces one feature set per process.  Lane could theoretically run
 single-threaded and hack StaticData in between each sentence.  The real
 answer is that StaticData needs to die.

 Kenneth

 On 04/25/2015 07:19 AM, Ergun Bicici wrote:
 
  From man ngram:
 -limit-vocab
Discard  LM  parameters  on  reading that do not pertain
  to the words specified in the vocabulary.  The default is that
words used in the LM are automatically added to the
  vocabulary.  This option can be used to reduce the memory  require‐
ments for large LMs that are going to be evaluated only on
  a small vocabulary subset.
 
  Best Regards,
  Ergun
 
  Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie
  http://www.cngl.ie
  http://www.computing.dcu.ie/~ebicici/
 
 
  On Fri, Apr 24, 2015 at 9:12 PM, Lane Schwartz dowob...@gmail.com
  mailto:dowob...@gmail.com wrote:
 
  To answer my own question...
 
  After talking with Hieu and Kenneth, it appears that the answer, at
  present, is no. But if anyone would be interested in working on this
  as an MT Marathon project, this would be great.
 
  On Fri, Apr 24, 2015 at 10:25 AM, Lane Schwartz dowob...@gmail.com
  mailto:dowob...@gmail.com wrote:
   Does moses (and particularly EMS) have a mechanism to allow for
 each
   test sentence to have its own LM file that should be used when
   translating just that sentence?
  
   This is in the context of taking a large LM and filtering it for a
   single sentence.
  
   Thanks,
   Lane
 
 
 
  --
  When a place gets crowded enough to require ID's, social collapse is
 not
  far away.  It is time to go elsewhere.  The best thing about space
  travel
  is that it made it possible to go elsewhere.
  -- R.A. Heinlein, Time Enough For Love
  ___
  Moses-support mailing list
  Moses-support@mit.edu mailto:Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 
 
 
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Support for per-sentence language model

2015-04-25 Thread Ergun Bicici
From man ngram:
   -limit-vocab
  Discard  LM  parameters  on  reading that do not pertain to
the words specified in the vocabulary.  The default is that
  words used in the LM are automatically added to the
vocabulary.  This option can be used to reduce the memory  require‐
  ments for large LMs that are going to be evaluated only on a
small vocabulary subset.

Best Regards,
Ergun

Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie
http://www.computing.dcu.ie/~ebicici/


On Fri, Apr 24, 2015 at 9:12 PM, Lane Schwartz dowob...@gmail.com wrote:

 To answer my own question...

 After talking with Hieu and Kenneth, it appears that the answer, at
 present, is no. But if anyone would be interested in working on this
 as an MT Marathon project, this would be great.

 On Fri, Apr 24, 2015 at 10:25 AM, Lane Schwartz dowob...@gmail.com
 wrote:
  Does moses (and particularly EMS) have a mechanism to allow for each
  test sentence to have its own LM file that should be used when
  translating just that sentence?
 
  This is in the context of taking a large LM and filtering it for a
  single sentence.
 
  Thanks,
  Lane



 --
 When a place gets crowded enough to require ID's, social collapse is not
 far away.  It is time to go elsewhere.  The best thing about space travel
 is that it made it possible to go elsewhere.
 -- R.A. Heinlein, Time Enough For Love
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Post-doctoral Research Opportunity in Ireland for Brazilians Through Science without Borders scheme and EAMT 2015 Summer Internships Fund

2014-10-31 Thread Ergun Bicici
We have a postdoctoral research opportunity for researchers from Brazil
through the Science without Borders scheme. For more information:
https://www.dcu.ie/research/fellowship-opportunities-brazil-Post-Doc/ICTs.shtml
https://www4.dcu.ie/sites/default/files/research/pdfs/SWB%20Qun%20Liu.pdf

EAMT 2015 Summer Internships Fund:
http://www.eamt.org/news/news_summer_internships_2015.php

Best regards,
Ergun

Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie, Phone:
+353-1-700-6711
http://www.computing.dcu.ie/~ebicici/
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Post-doctoral Researcher Job Advertisement

2014-09-05 Thread Ergun Bicici
We have a post-doctoral researcher position for the following project:

   - *Monolingual and Bilingual Text Quality Judgments with Translation
   Performance Prediction, 2014-2015*
   http://www.computing.dcu.ie/~ebicici/Projects/TIDA_RTM.html
   SFI project funded by Technology Innovation Development Award (TIDA).

Position is for 6 months. Salary is 37,750 Euros on a yearly basis. Application
deadline is 19 September, 2014.

http://www.dcu.ie/sites/default/files/hr/Postdoctoral%20Researcher%20CNGL.pdf
http://www.google.com/url?q=http%3A%2F%2Fwww.dcu.ie%2Fsites%2Fdefault%2Ffiles%2Fhr%2FPostdoctoral%2520Researcher%2520CNGL.pdfsa=Dsntz=1usg=AFQjCNG9o-MpZ5dtLh0CUUkrb9379H05mA

Regards,
Ergun

Ergun Bicici
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Parallel FDA5 WMT'14 Datasets

2014-09-05 Thread Ergun Bicici
Parallel FDA5 WMT'14 Datasets

Dear moses-list,

We make the English, Czech, French, German, and Russian datasets we used
when building Parallel FDA5 Moses SMT systems for research purposes,
available at:
https://github.com/bicici/ParFDA5WMT

Results are presented in the citation provided below.

Citation:

Ergun Biçici, Qun Liu, and Andy Way. Parallel FDA5 for Fast Deployment of
Accurate Statistical Machine Translation Systems. In Proceedings of the
Ninth Workshop on Statistical Machine Translation, Baltimore, USA, June
2014. Association for Computational Linguistics.


The datasets and the SMT results can serve as a benchmark for SMT research
where further linguistic processing can be performed to see whether the
results can be improved. The datasets allow fast deployment of accurate SMT
systems and can be used for benchmarking the performance of SMT systems.

Language models were built using SRILM (
http://www.speech.sri.com/projects/srilm/). Language model corpora used
contain 15M sentences some of which are selected from LDC Gigaword corpora
by the Parallel FDA5 algorithm:

[4 use the LDC English Gigaword 5th edition]
- Czech - English: 2.13 million sentences from LDC English Gigaword, ~1.69
%.
- French - English: 2.49 million sentences from LDC English Gigaword, ~1.97
%
- German - English: 2.57 million sentences from LDC English Gigaword, ~2.03
%
- Russian - English: 3.34 million sentences from LDC English Gigaword,
~2.64 %

[1 use the LDC French Gigaword 3rd edition]
- English - French: 0.47 million sentences from LDC French Gigaword, ~1.93 %


Work using the datasets:

- Ergun Biçici, Qun Liu, and Andy Way. Parallel FDA5 for Fast Deployment of
Accurate Statistical Machine Translation Systems. In Proceedings of the
Ninth Workshop on Statistical Machine Translation, Baltimore, USA, June
2014. Association for Computational Linguistics.
- Ergun Biçici (contributor), “Quality Estimation for Extending Good
Translations”, QTLaunchPad Deliverable:
http://www.qt21.eu/launchpad/deliverable/quality-estimation-extending-good-translations
- Ergun Biçici, High Quality Machine Translation with ITERPE, 2014. Note:
Dublin City University Invention Disclosure.


LICENSE:
CNGL License for Open Data allowing use for research and academic purposes.


Best Regards,
Ergun

Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie, Phone:
+353-1-700-6711
http://www.computing.dcu.ie/~ebicici/
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Moses RELEASE-1.0: Segmentation fault with gcc-4.8.2

2013-11-27 Thread Ergun Bicici
Hi,

I compiled Moses RELEASE-1.0 with gcc-4.8.2 and received a segmentation
fault during decoding:

(from TUNING_tune.9.STDERR)
...
binary file loaded, default OFF_T: -1
binary phrasefile loaded, default OFF_T: -1
binary file loaded, default OFF_T: -1
Translating line 8  in thread id 140727747983104
Translating line 9  in thread id 140727706019584
Translating line 10  in thread id 140727747983104
Translating line 11  in thread id 140727714412288
Translating line 12  in thread id 140727706019584
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted
Exit code: 134
The decoder died. CONFIG WAS -w -0.217391 -tm 0.043478 0.043478 0.043478
0.043478 0.043478 -d 0.065217 0.065217 0.065217 0.065217 0.065217 0.065217
0.065217 -lm 0.108696
cp: cannot stat ‘/path/tuning/tmp.9/moses.ini’: No such file or directory

I then re-compiled with gcc-4.7 and was able to finish training and testing.

Regards,
Ergun

Ergun Bicici
Koc University
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses RELEASE-1.0: Segmentation fault with gcc-4.8.2

2013-11-27 Thread Ergun Bicici
Hi Hieu,

Thank you for the explanation.


Regards,
Ergun

Ergun Bicici
Koc University


On Wed, Nov 27, 2013 at 1:52 PM, Hieu Hoang hieu.ho...@ed.ac.uk wrote:

 hi ergun,

 i noticed this problem a few weeks ago and reported it here
http://article.gmane.org/gmane.comp.nlp.moses.user/9730
 It's specific to gcc-4.8

 I've provided a workaround in the github master source code, but i haven't
 patched RELEASE-1.0

 Perhaps you should continue to work with gcc 4.7 until 4.8 has been fixed



 On 27 November 2013 11:36, Ergun Bicici ebic...@ku.edu.tr wrote:


 Hi,

 I compiled Moses RELEASE-1.0 with gcc-4.8.2 and received a segmentation
 fault during decoding:

 (from TUNING_tune.9.STDERR)
 ...
 binary file loaded, default OFF_T: -1
 binary phrasefile loaded, default OFF_T: -1
 binary file loaded, default OFF_T: -1
 Translating line 8  in thread id 140727747983104
 Translating line 9  in thread id 140727706019584
 Translating line 10  in thread id 140727747983104
 Translating line 11  in thread id 140727714412288
 Translating line 12  in thread id 140727706019584
 terminate called after throwing an instance of 'std::bad_alloc'
   what():  std::bad_alloc
 Aborted
 Exit code: 134
 The decoder died. CONFIG WAS -w -0.217391 -tm 0.043478 0.043478 0.043478
 0.043478 0.043478 -d 0.065217 0.065217 0.065217 0.065217 0.065217 0.065217
 0.065217 -lm 0.108696
 cp: cannot stat ‘/path/tuning/tmp.9/moses.ini’: No such file or directory

 I then re-compiled with gcc-4.7 and was able to finish training and
 testing.

 Regards,
 Ergun

 Ergun Bicici
 Koc University

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support




 --
 Hieu Hoang
 Research Associate
 University of Edinburgh
 http://www.hoang.co.uk/hieu


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Cannot find -lboost_thread-mt

2013-04-19 Thread Ergun Bicici
 For the Boost program options error, are you sure you have only one copy
 of Boost installed?

Explicitly specifying the boost path caused similar undefined reference
to problems in
my case:
./bjam --with-boost=...

Yet, this, without specifying the boost install dir, worked:
./bjam -a -q


On Sat, Dec 10, 2011 at 1:31 PM, Kenneth Heafield mo...@kheafield.comwrote:

 For the SRILM error, please follow step 9 of SRILM's install
 instructions.  Do a clean untar of SRILM, fix the broken machine-type as
 instructed in BUILD-INSTRUCTIONS.txt then build with:

 make MAKE_PIC=yes

 For the Boost program options error, are you sure you have only one copy
 of Boost installed?

 On 12/10/11 13:17, ra...@rszk.net wrote:
  Hi Hieu
 
  Thanks for the reply. Using link=shared sorted out that error. However,
  there are some new errors, for which I couldn't find reference by
  searching the list and web. For example:
 
 
 /usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld:
  /home/me/moses/tools/srilm/lib/i686-m64/liboolm.a(Vocab.o): relocation
  R_X86_64_32 against `Vocab::compare(unsigned int, unsigned int)' can not
  be used when making a shared object; recompile with -fPIC
  /home/me/moses/tools/srilm/lib/i686-m64/liboolm.a: could not read
  symbols: Bad value
  collect2: ld returned 1 exit status
 
 
 /home/me/moses/tools/boost/include/boost/program_options/detail/value_semantic.hpp:58:
  undefined reference to
 
 `boost::program_options::validation_error::validation_error(boost::program_options::validation_error::kind_t,
  std::basic_stringchar, std::char_traitschar, std::allocatorchar 
  const, std::basic_stringchar, std::char_traitschar,
  std::allocatorchar  const)'
 
  Best Regards,
  Rasul.
  
  *From:* Hieu Hoang kingofclap...@gmail.com
  mailto:kingofclap...@gmail.com
  *To:* moses-support@mit.edu mailto:moses-support@mit.edu
  *Sent:* Saturday, December 10, 2011 3:33 AM
  *Subject:* Re: [Moses-support] Cannot find -lboost_thread-mt
 
  hi rasul
 
  try using shared linking
./bjam link=shared ...
  it may be that the static library for thread-mt isn't installed on you
  computer.
 
  The static version has the name
  libboost_thread-mt.a
  the shared version is
  libboost_thread-mt.so
 
  or if you don't need threading, turn it off
./bjam threading=single ...
 
  On 10/12/2011 05:13, ra...@rszk.net mailto:ra...@rszk.net wrote:
  Hi all,
 
  I have been trying to install the latest version of Moses with latest
  version of SRILM, Giza++ and boost (1.48.0), but not with IRSTLM as I
  didn't manage to install it. The problem is that I receive this an
  error in several stages which says: ... Cannot find
  -lboost_thread-mt and fails the installation. I was wondering if
  anybody has some idea of what could the reason be. Here are some facts
  about my installation:
 
  - I'm not using --with-IRSTLM option
  - I'm installing everything in non-standard directories as I don't
  have admin permission (but following hints on both boost and Moses
  installation guidelines regarding this matter)
  - The platform is OpenSuse
  - I'm new to Moses
  - Anything else needed to be clarified??
 
  Best Regards,
  Rasul.
  ___
  Moses-support mailing list
  Moses-support@mit.edu mailto:Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu mailto:Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 
 
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Future costs calculation in MOSES

2009-01-28 Thread Ergun Bicici
Dear Moses Users,

I am trying to understand how the future cost is calculated.

From http://www.statmt.org/moses/?n=Moses.Background:
The language model cost is usually calculated by a trigram language model.

However, we do not know the preceding English words for
a translation operation. Therefore, we approximate this
cost by computing the language model score for
the generated English words alone.


Here English refers to the target language.

- Why would already translated portions of the sentence
contribute to the future cost?



Is Hypothesis::CalcFutureScore function is retrieving the
translation cost for each non-translated portion of the Hypothesis
or does it (const SquareMatrix futureScore) also include the LM cost?

LanguageModel::CalcScore is adding ngram score to retFull score:
fullScore += ngramScore;

But then in TranslationOption::CalcScore, this is subtracted back:
m_futureScore = retFullScore - ngramScore
+
m_scoreBreakdown.InnerProduct(StaticData::Instance().GetAllWeights()) -
phraseSize * StaticData::Instance().GetWeightWordPenalty();


- Is the n-gram order (3) fixed for LM cost calculations
used in future cost? It does not look so.


It would be helpful if someone could clarify the
future cost calculation further.

Thanks,
Ergun


Ergun Bicici
Koc University


On Wed, Sep 24, 2008 at 5:46 PM, Philipp Koehn pko...@inf.ed.ac.uk wrote:

 Hi,

 the future cost estimates includes an estimate of the phrase translation
 cost
 and language model cost, but not reordering costs. And yes, this is
 implemented
 as described in the Pharaoh manual.

 -phi

 On Wed, Sep 24, 2008 at 8:58 AM, Yee Seng Chan cha...@comp.nus.edu.sg
 wrote:
  Hi list members,
 
 
 
  Inside TranslationOption.cpp::CalcScore(), m_futureScore is effectively:
  retFullScore - (PhraseSize*WordPenalty)
 
  (Kindly correct me if I'm wrong).
 
 
 
  What's the reasoning for using the above as futureScore? I know
 retFullScore
  is n-gram score. Btw, does the approach here follows Section 3.5 Future
  Cost Estimation in the Pharaoh manual?
 
 
 
  Best regards,
 
  Yee Seng Chan
 
 
 
  ___
  Moses-support mailing list
  Moses-support@mit.edu
  http://mailman.mit.edu/mailman/listinfo/moses-support
 
 
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support