Re: [Moses-support] too few factors error in mert

2016-12-06 Thread Sašo Kuntaric
Please see my reply to another thread below. I believe you need your source
part of the tuning set factored as well.

On 30/06/2016 21:44, Sašo Kuntaric wrote:

Hi all,

I would like to ask one more question. When you say that my reference only
has the surface form, are you talking about the "tuning corpus", which in
the case of my command

~/mosesdecoder/scripts/training/mert-moses.pl ~/working/IT_corpus/TMX/txt/
factored_corpus/singles/tuning_corpus.tagged.clean.en
~/working/IT_corpus/TMX/txt/factored_corpus/singles/tuning
_corpus.tagged.clean.sl ~/mosesdecoder/bin/moses
~/working/IT_corpus/TMX/txt/factored_corpus/singles/test/model/moses.ini
--mertdir ~/mosesdecoder/bin/ --decoder-flags="-threads all"

are tuning_corpus.tagged.clean.en and tuning_corpus.tagged.clean.sl? Can
tuning be done with files that only contains surface forms?

it's usual that the reference tuning data does not have factors, even if
there are factors in the phrase table. After all, you don't care if the
output surface form is correct but the other factors are wrong.

Will the results be compatible with tuning done with a factored tuning
corpus?

yes

Best regards,

Sašo

2016-12-06 10:18 GMT+01:00 Hasan Sait ARSLAN :

> Hi,
>
> I have a factored dataset. It involves 4 factors, 
> factor1|factor2|factor3|factor4.
> I have trained my model with such a dataset.
>
> Now when I want to tune my model, I encounter with the following error:
>
>
>
>
> *Exception: moses/Word.cpp:159 in void
> Moses::Word::CreateFromString(Moses::FactorDirection, const
> std::vector&, const StringPiece&, bool, bool) threw
> util::Exception because `!isNonTerminal && i < factorOrder.size()'.Too few
> factors in string '-|-|Punc|Punc*
> The details of the error is in mert.txt file, which is attached to this
> e-mail.
>
> Thanks,
>
> Kind Regards,
> Hasan Sait Arslan
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
lp,

Sašo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Tuning for factored phrase based systems

2016-12-06 Thread Sašo Kuntaric
Hi Angli,

Here is an excerpt of Hieu's answers regarding this topic when I was doing
research in factored models, might be of some help:

On 30/06/2016 21:44, Sašo Kuntaric wrote:

Hi all,

I would like to ask one more question. When you say that my reference only
has the surface form, are you talking about the "tuning corpus", which in
the case of my command

~/mosesdecoder/scripts/training/mert-moses.pl ~/working/IT_corpus/TMX/txt/
factored_corpus/singles/tuning_corpus.tagged.clean.en
~/working/IT_corpus/TMX/txt/factored_corpus/singles/tuning
_corpus.tagged.clean.sl ~/mosesdecoder/bin/moses
~/working/IT_corpus/TMX/txt/factored_corpus/singles/test/model/moses.ini
--mertdir ~/mosesdecoder/bin/ --decoder-flags="-threads all"

are tuning_corpus.tagged.clean.en and tuning_corpus.tagged.clean.sl? Can
tuning be done with files that only contains surface forms?

it's usual that the reference tuning data does not have factors, even if
there are factors in the phrase table. After all, you don't care if the
output surface form is correct but the other factors are wrong.

Will the results be compatible with tuning done with a factored tuning
corpus?

yes

Best regards,

Sašo

2016-12-04 1:37 GMT+01:00 Hieu Hoang :

>
>
> Hieu
> Sent while bumping into things
>
> On 1 Dec 2016 07:01, "Angli Liu"  wrote:
>
> Hi, what's the major difference between the tuning process for a factored
> phrase based system (i.e., surface+pos data) and a simple baseline phrase
> based system?
>
>
> Nothing, the tuning just optimise weights for feature functions.
>
> If you decompose your translation so that it has multiple phrase tables
> and generation models, then they are just extra feature functions with
> weights to be tuned
>
> Do I need to organize the dev set the same way as the training set (i.e.,
> surface|pos)?
>
> Yes
>
> Is there a tutorial on the moses website on this topic?
>
> Maybe this
> http://www.statmt.org/moses/?n=FactoredTraining.FactoredTraining
>
>
> Thanks!
>
> -Angli
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
lp,

Sašo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Fwd: Moses Installation

2016-07-04 Thread Sašo Kuntaric
Hi Irene,

The command you posted is for training a translation model ... the error
states that you didn't tell Moses which corpus you wanna use for the model
and where it is.

Best regards,

Sašo

2016-07-04 20:43 GMT+02:00 Irene Nandutu :

> Hi all, am seeking support
> Am installing moses and am using thes link
> http://www.statmt.org/moses/?n=moses.baseline.
> All the steps were fine but when i came to run the training, i wanted to
> to tell the training script where GIZA++ was installed using the
> -external-bin-dir argument.
>
> Here is the output in the commandline:
>
> irene@irene-HP-650-Notebook-PC:~/mosesdecoder/scripts/training$
> ./train-model.perl -external-bin-dir $HOME/mosesdecoder/tools
> Using SCRIPTS_ROOTDIR: /home/irene/mosesdecoder/scripts
> Using single-thread GIZA
> using gzip
> ERROR: use --corpus to specify corpus at ./train-model.perl line 499.
> irene@irene-HP-650-Notebook-PC:~/mosesdecoder/scripts/training$
>
> I will be grateful for any help and advice.
> Thanks
>
> Irene Nandutu
> +256781595675
> Twitter 
>
> Google+ 
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
lp,

Sašo



-- 
lp,

Sašo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Tuning giving zero weights in Factored Based SMT

2016-07-04 Thread Sašo Kuntaric
Hi Saumitra,

Had the same issue a few days ago. What worked in my case was that I used a
tagged source side of the corpus and an untagged translated side of the
tuning corpus. The other thing I changed was the language model I used, so
you might want to experiment with those two things and see if you get a
resolution.

Best regards,

Sašo

2016-07-04 2:03 GMT+02:00 Saumitra Yadav :

> Hi,
> I'm trying Factored Based SMT (punjabi to english, with pos tag as factor
> and translation factor 0,1-0) and while tuning i get BEST at 2: 0 0 0 0 0 0
> 0 0 ==> 0 as weights. Is there a reason for this?
> Regards,
> Saumitra Yadav
> MS CSE
> IIIT-Hyderabad
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
lp,

Sašo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Fwd: Binarization fails with the Segmentation Fault error

2016-06-30 Thread Sašo Kuntaric
Hi all,

I would like to ask one more question. When you say that my reference only
has the surface form, are you talking about the "tuning corpus", which in
the case of my command

~/mosesdecoder/scripts/training/mert-moses.pl
~/working/IT_corpus/TMX/txt/factored_corpus/singles/tuning_corpus.tagged.clean.en
~/working/IT_corpus/TMX/txt/factored_corpus/singles/
tuning_corpus.tagged.clean.sl ~/mosesdecoder/bin/moses
~/working/IT_corpus/TMX/txt/factored_corpus/singles/test/model/moses.ini
--mertdir ~/mosesdecoder/bin/ --decoder-flags="-threads all"

are tuning_corpus.tagged.clean.en and tuning_corpus.tagged.clean.sl? Can
tuning be done with files that only contains surface forms? Will the
results be compatible with tuning done with a factored tuning corpus?

Models with one translation table work fine with corpora with only surface
forms, while models with 2 tables do not. Is that expected behavior?

I checked all my files and everything seems fine ... phrase table and
language model files look OK, there is almost 400 GB of free space, my
tuning set contains aligned source and target files.

The only strange thing that I could find in the tuning folder was line
UnknownWordPenalty0 UNTUNEABLE in the features.list file ... everything
else has values, although they can be zero.

Best regards and thanks again for all the help,

Saso


2016-06-30 10:00 GMT+02:00 Hieu Hoang :

>
>
> Hieu Hoang
> http://www.hoang.co.uk/hieu
>
> On 30 June 2016 at 08:11, Sašo Kuntaric  wrote:
>
>> Hi Hieu,
>>
>> Thanks for the tip, unfortunately it didn't solve my problem. I tried
>> creating a very simple model with the command:
>>
>> ~/mosesdecoder/scripts/training/train-model.perl -root-dir test -corpus
>> ~/working/IT_corpus/TMX/txt/factored_corpus/singles/corpus.tagged.clean -f
>> en -e sl -lm
>> 0:3:$HOME/working/IT_corpus/TMX/txt/factored_corpus/language_model/
>> IT_corpus_surface.blm.sl -lm
>> 2:3:$HOME/working/IT_corpus/TMX/txt/factored_corpus/language_model/
>> IT_corpus_parts.blm.sl --translation-factors 0-0,2 -external-bin-dir
>> ~/mosesdecoder/tools --cores 32,
>>
>> however the results of the tuning are still the same ... all zeros after
>> the second run.
>>
>> Do I have to use a factored or unfactored corpus for tuning?
>>
>> There was one suggestion I found online, namely to add something like
>>
>> [output-factors]
>> 0
>> 1
>> 2
>>
>> to moses.ini. I tried it, but it made no difference. Should I explore it 
>> further?
>>
>> no, this will output all the other factors, as well as the surface form.
> I'm sure your reference only has the surface form
>
> Are you sure your phrase-table and language models contains data? And your
> tuning set contains data for the input and reference? There's plenty of
> space on your hard disk?
>
> I would suggest you look at the files the tuning process creates and debug
> it. It's likely to be a data problem.
>
>
>> If anyone has another suggestion please let know.
>>
>> Best regards,
>>
>> Sašo
>>
>>
>> 2016-06-29 15:44 GMT+02:00 Hieu Hoang :
>>
>>> I don't know the exact problem but your factored model looks too
>>> complicated so the tuning algorithm kinda just gives up.
>>> i would try a very simple model 1st, eg.
>>>translate 0 -> 0,1,2,3
>>> or
>>>translate 0,1 -> 0,1,2,3
>>> Once you see that working correctly, add a generation model.
>>>
>>> You have to do this bit-by-bit and see what happens
>>>
>>>
>>> On 28/06/2016 20:44, Sašo Kuntaric wrote:
>>>
>>> Well, I installed Moses only a few months ago, so it should be the
>>> latest version.
>>>
>>> I find it really strange. I have tried everything - binarizing tables
>>> (which finishes with no problems), using the --no-filter-phrase-table
>>> parameter, adding language models for all the factors I have (this one gave
>>> me a segmentation fault) and I always get the same result. Tuning stops
>>> after two runs and all the weights get set to zero with the message
>>>
>>> (2) BEST at 2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 => 0 at Tue Jun 28
>>> 17:38:43 CEST 2016
>>> None of the weights changed more than 1e-05. Stopping.
>>>
>>> The translation models themselves are created with no issues. If I have
>>> one translation table, I can tune them with an unfactored corpus, but as
>>> soon as I use a factored one, everything goes south. If I have two
>>> translation tables, I cannot tune with an unfactored file, since it wa

Re: [Moses-support] Fwd: Binarization fails with the Segmentation Fault error

2016-06-30 Thread Sašo Kuntaric
Hi Hieu,

Thanks for the tip, unfortunately it didn't solve my problem. I tried
creating a very simple model with the command:

~/mosesdecoder/scripts/training/train-model.perl -root-dir test -corpus
~/working/IT_corpus/TMX/txt/factored_corpus/singles/corpus.tagged.clean -f
en -e sl -lm
0:3:$HOME/working/IT_corpus/TMX/txt/factored_corpus/language_model/
IT_corpus_surface.blm.sl -lm
2:3:$HOME/working/IT_corpus/TMX/txt/factored_corpus/language_model/
IT_corpus_parts.blm.sl --translation-factors 0-0,2 -external-bin-dir
~/mosesdecoder/tools --cores 32,

however the results of the tuning are still the same ... all zeros after
the second run.

Do I have to use a factored or unfactored corpus for tuning?

There was one suggestion I found online, namely to add something like

[output-factors]
0
1
2

to moses.ini. I tried it, but it made no difference. Should I explore
it further?

If anyone has another suggestion please let know.

Best regards,

Sašo


2016-06-29 15:44 GMT+02:00 Hieu Hoang :

> I don't know the exact problem but your factored model looks too
> complicated so the tuning algorithm kinda just gives up.
> i would try a very simple model 1st, eg.
>translate 0 -> 0,1,2,3
> or
>translate 0,1 -> 0,1,2,3
> Once you see that working correctly, add a generation model.
>
> You have to do this bit-by-bit and see what happens
>
>
> On 28/06/2016 20:44, Sašo Kuntaric wrote:
>
> Well, I installed Moses only a few months ago, so it should be the latest
> version.
>
> I find it really strange. I have tried everything - binarizing tables
> (which finishes with no problems), using the --no-filter-phrase-table
> parameter, adding language models for all the factors I have (this one gave
> me a segmentation fault) and I always get the same result. Tuning stops
> after two runs and all the weights get set to zero with the message
>
> (2) BEST at 2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 => 0 at Tue Jun 28
> 17:38:43 CEST 2016
> None of the weights changed more than 1e-05. Stopping.
>
> The translation models themselves are created with no issues. If I have
> one translation table, I can tune them with an unfactored corpus, but as
> soon as I use a factored one, everything goes south. If I have two
> translation tables, I cannot tune with an unfactored file, since it wants
> the stated number of factors.
>
> I would really appreciate if someone has an idea what to do.
>
> Best regards,
>
> Saso
>
> 2016-06-27 14:45 GMT+02:00 Rajen Chatterjee 
> :
>
>> Hi, in the past I had similar problem, the weights after 1 iteration of
>> tuning were getting to 0. I do not know the cause of this, but if I
>> remember when I used another version of Moses (I think Release-3.0) I
>> didn't had this problem.
>>
>> On Sun, Jun 26, 2016 at 1:40 PM, Sašo Kuntaric <
>> saso.kunta...@gmail.com> wrote:
>>
>>> Hi all again,
>>>
>>> A little more info, if someone has any ideas as I still haven't been
>>> able to figure it out.
>>>
>>> When I do tuning with models that only have one translation table, it
>>> works fine, however with a non-factored tuning corpus. If I use a factored
>>> tuning corpus, Moses does one run and sets all weights to zero. If I have
>>> two translation tables, Moses doesn't do the tuning as he is missing
>>> factors. If I use the factored corpus, I get a similar result as above.
>>> Tuning stops after one run and sets all weights to zero. There was a
>>> similar error mentioned a few monts back and the solution was to turn of
>>> mbr decoding, however I am not using it. I just use the command:
>>>
>>> ~/mosesdecoder/scripts/training/mert-moses.pl
>>> ~/working/IT_corpus/TMX/txt/tuning_corpus/tuning_corpus.tagged.en
>>> ~/working/IT_corpus/TMX/txt/tuning_corpus/tuning_corpus.tagged.sl
>>> ~/mosesdecoder/bin/moses
>>> ~/working/IT_corpus/TMX/txt/factored_corpus/complex/model/moses.ini
>>> --mertdir ~/mosesdecoder/bin/ --decoder-flags="-threads 32"
>>>
>>> Is there something I am missing? Do I have to add anything else for
>>> tuning a factored model?
>>>
>>> Any help will be greatly appreciated.
>>>
>>> Best regards,
>>>
>>> Saso
>>>
>>> -- Forwarded message --
>>> From: Sašo Kuntaric < saso.kunta...@gmail.com>
>>> Date: 2016-06-20 19:36 GMT+02:00
>>> Subject: Binarization fails with the Segmentation Fault error
>>> To: moses-support < moses-support@mit.edu>
>>>
>>>
>>> Hi all,
>>>
>>> Me again (last time I hope). I have s

Re: [Moses-support] Fwd: Binarization fails with the Segmentation Fault error

2016-06-28 Thread Sašo Kuntaric
Well, I installed Moses only a few months ago, so it should be the latest
version.

I find it really strange. I have tried everything - binarizing tables
(which finishes with no problems), using the --no-filter-phrase-table
parameter, adding language models for all the factors I have (this one gave
me a segmentation fault) and I always get the same result. Tuning stops
after two runs and all the weights get set to zero with the message

(2) BEST at 2: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 => 0 at Tue Jun 28
17:38:43 CEST 2016
None of the weights changed more than 1e-05. Stopping.

The translation models themselves are created with no issues. If I have one
translation table, I can tune them with an unfactored corpus, but as soon
as I use a factored one, everything goes south. If I have two translation
tables, I cannot tune with an unfactored file, since it wants the stated
number of factors.

I would really appreciate if someone has an idea what to do.

Best regards,

Saso

2016-06-27 14:45 GMT+02:00 Rajen Chatterjee :

> Hi, in the past I had similar problem, the weights after 1 iteration of
> tuning were getting to 0. I do not know the cause of this, but if I
> remember when I used another version of Moses (I think Release-3.0) I
> didn't had this problem.
>
> On Sun, Jun 26, 2016 at 1:40 PM, Sašo Kuntaric 
> wrote:
>
>> Hi all again,
>>
>> A little more info, if someone has any ideas as I still haven't been able
>> to figure it out.
>>
>> When I do tuning with models that only have one translation table, it
>> works fine, however with a non-factored tuning corpus. If I use a factored
>> tuning corpus, Moses does one run and sets all weights to zero. If I have
>> two translation tables, Moses doesn't do the tuning as he is missing
>> factors. If I use the factored corpus, I get a similar result as above.
>> Tuning stops after one run and sets all weights to zero. There was a
>> similar error mentioned a few monts back and the solution was to turn of
>> mbr decoding, however I am not using it. I just use the command:
>>
>> ~/mosesdecoder/scripts/training/mert-moses.pl
>> ~/working/IT_corpus/TMX/txt/tuning_corpus/tuning_corpus.tagged.en
>> ~/working/IT_corpus/TMX/txt/tuning_corpus/tuning_corpus.tagged.sl
>> ~/mosesdecoder/bin/moses
>> ~/working/IT_corpus/TMX/txt/factored_corpus/complex/model/moses.ini
>> --mertdir ~/mosesdecoder/bin/ --decoder-flags="-threads 32"
>>
>> Is there something I am missing? Do I have to add anything else for
>> tuning a factored model?
>>
>> Any help will be greatly appreciated.
>>
>> Best regards,
>>
>> Saso
>>
>> -- Forwarded message --
>> From: Sašo Kuntaric 
>> Date: 2016-06-20 19:36 GMT+02:00
>> Subject: Binarization fails with the Segmentation Fault error
>> To: moses-support 
>>
>>
>> Hi all,
>>
>> Me again (last time I hope). I have successfully trained and tuned my
>> factored model. Here are both moses.ini files:
>>
>> #
>> ### MOSES CONFIG FILE ###
>> #
>>
>> # input factors
>> [input-factors]
>> 0
>> 1
>>
>> # mapping steps
>> [mapping]
>> 0 T 0
>> 0 G 0
>> 0 T 1
>>
>> [distortion-limit]
>> 6
>>
>> # feature functions
>> [feature]
>> UnknownWordPenalty
>> WordPenalty
>> PhrasePenalty
>> PhraseDictionaryMemory name=TranslationModel0 num-features=4
>> path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/phrase-table.0-1.gz
>> input-factor=0 output-factor=1
>> PhraseDictionaryMemory name=TranslationModel1 num-features=4
>> path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/phrase-table.1-2.gz
>> input-factor=1 output-factor=2
>> Generation name=GenerationModel0 num-features=2
>> path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/generation.1-0,3.gz
>> input-factor=1 output-factor=0,3
>> Distortion
>> KENLM name=LM0 factor=0
>> path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/language_model/
>> IT_corpus_surface.blm.sl order=3
>> KENLM name=LM1 factor=2
>> path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/language_model/
>> IT_corpus_parts.blm.sl order=3
>>
>> # dense weights for feature functions
>> [weight]
>> # The default weights are NOT optimized for translation quality. You MUST
>> tune the weights.
>> # Documentation for tuning is here:
>> http://www.statmt.org/moses/?n=FactoredTraining.Tuning
>> UnknownWordPenalty0= 1
>> WordPenalty0=

[Moses-support] Fwd: Binarization fails with the Segmentation Fault error

2016-06-26 Thread Sašo Kuntaric
Hi all again,

A little more info, if someone has any ideas as I still haven't been able
to figure it out.

When I do tuning with models that only have one translation table, it works
fine, however with a non-factored tuning corpus. If I use a factored tuning
corpus, Moses does one run and sets all weights to zero. If I have two
translation tables, Moses doesn't do the tuning as he is missing factors.
If I use the factored corpus, I get a similar result as above. Tuning stops
after one run and sets all weights to zero. There was a similar error
mentioned a few monts back and the solution was to turn of mbr decoding,
however I am not using it. I just use the command:

~/mosesdecoder/scripts/training/mert-moses.pl
~/working/IT_corpus/TMX/txt/tuning_corpus/tuning_corpus.tagged.en
~/working/IT_corpus/TMX/txt/tuning_corpus/tuning_corpus.tagged.sl
~/mosesdecoder/bin/moses
~/working/IT_corpus/TMX/txt/factored_corpus/complex/model/moses.ini
--mertdir ~/mosesdecoder/bin/ --decoder-flags="-threads 32"

Is there something I am missing? Do I have to add anything else for tuning
a factored model?

Any help will be greatly appreciated.

Best regards,

Saso

-- Forwarded message --
From: Sašo Kuntaric 
Date: 2016-06-20 19:36 GMT+02:00
Subject: Binarization fails with the Segmentation Fault error
To: moses-support 


Hi all,

Me again (last time I hope). I have successfully trained and tuned my
factored model. Here are both moses.ini files:

#
### MOSES CONFIG FILE ###
#

# input factors
[input-factors]
0
1

# mapping steps
[mapping]
0 T 0
0 G 0
0 T 1

[distortion-limit]
6

# feature functions
[feature]
UnknownWordPenalty
WordPenalty
PhrasePenalty
PhraseDictionaryMemory name=TranslationModel0 num-features=4
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/phrase-table.0-1.gz
input-factor=0 output-factor=1
PhraseDictionaryMemory name=TranslationModel1 num-features=4
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/phrase-table.1-2.gz
input-factor=1 output-factor=2
Generation name=GenerationModel0 num-features=2
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/generation.1-0,3.gz
input-factor=1 output-factor=0,3
Distortion
KENLM name=LM0 factor=0
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/language_model/
IT_corpus_surface.blm.sl order=3
KENLM name=LM1 factor=2
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/language_model/
IT_corpus_parts.blm.sl order=3

# dense weights for feature functions
[weight]
# The default weights are NOT optimized for translation quality. You MUST
tune the weights.
# Documentation for tuning is here:
http://www.statmt.org/moses/?n=FactoredTraining.Tuning
UnknownWordPenalty0= 1
WordPenalty0= -1
PhrasePenalty0= 0.2
TranslationModel0= 0.2 0.2 0.2 0.2
TranslationModel1= 0.2 0.2 0.2 0.2
GenerationModel0= 0.3 0
Distortion0= 0.3
LM0= 0.5
LM1= 0.5

# MERT optimized configuration
# decoder /home/ksaso/mosesdecoder/bin/moses
# BLEU 0 on dev
/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/tuning/tuning-corpus.tagged.en
# We were before running iteration 2
# finished Mon Jun 20 16:19:08 CEST 2016
### MOSES CONFIG FILE ###
#

# input factors
[input-factors]
0
1

# mapping steps
[mapping]
0 T 0
0 G 0
0 T 1

[distortion-limit]
6

# feature functions
[feature]
UnknownWordPenalty
WordPenalty
PhrasePenalty
PhraseDictionaryMemory name=TranslationModel0 num-features=4
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/phrase-table.0-1.gz
input-factor=0 output-factor=1
PhraseDictionaryMemory name=TranslationModel1 num-features=4
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/phrase-table.1-2.gz
input-factor=1 output-factor=2
Generation name=GenerationModel0 num-features=2
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/generation.1-0,3.gz
input-factor=1 output-factor=0,3
Distortion
KENLM name=LM0 factor=0
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/language_model/
IT_corpus_surface.blm.sl order=3
KENLM name=LM1 factor=2
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/language_model/
IT_corpus_parts.blm.sl order=3

# dense weights for feature functions

[threads]
16
[weight]

Distortion0= 0
LM0= 0
LM1= 0
WordPenalty0= 0
PhrasePenalty0= 0
TranslationModel0= 0 0 0 0
TranslationModel1= 0 0 0 0
GenerationModel0= 0 0
UnknownWordPenalty0= 1

First of all, is it strange that I get all zeroes after tuning?

My problem is that the translation with this model is spectacularly slow (a
few days to translate a couple of thousand words with a 2,4 million line
corpus), so naturally I tried to binarize my phrase tables with the command

~/mosesdecoder/bin/processPhraseTableMin -in
~/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/phrase-table.0-1.gz
-nscores 4 -out ~/working/binarised_model/phrase-table.0-1 an

[Moses-support] Binarization fails with the Segmentation Fault error

2016-06-20 Thread Sašo Kuntaric
Hi all,

Me again (last time I hope). I have successfully trained and tuned my
factored model. Here are both moses.ini files:

#
### MOSES CONFIG FILE ###
#

# input factors
[input-factors]
0
1

# mapping steps
[mapping]
0 T 0
0 G 0
0 T 1

[distortion-limit]
6

# feature functions
[feature]
UnknownWordPenalty
WordPenalty
PhrasePenalty
PhraseDictionaryMemory name=TranslationModel0 num-features=4
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/phrase-table.0-1.gz
input-factor=0 output-factor=1
PhraseDictionaryMemory name=TranslationModel1 num-features=4
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/phrase-table.1-2.gz
input-factor=1 output-factor=2
Generation name=GenerationModel0 num-features=2
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/generation.1-0,3.gz
input-factor=1 output-factor=0,3
Distortion
KENLM name=LM0 factor=0
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/language_model/
IT_corpus_surface.blm.sl order=3
KENLM name=LM1 factor=2
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/language_model/
IT_corpus_parts.blm.sl order=3

# dense weights for feature functions
[weight]
# The default weights are NOT optimized for translation quality. You MUST
tune the weights.
# Documentation for tuning is here:
http://www.statmt.org/moses/?n=FactoredTraining.Tuning
UnknownWordPenalty0= 1
WordPenalty0= -1
PhrasePenalty0= 0.2
TranslationModel0= 0.2 0.2 0.2 0.2
TranslationModel1= 0.2 0.2 0.2 0.2
GenerationModel0= 0.3 0
Distortion0= 0.3
LM0= 0.5
LM1= 0.5

# MERT optimized configuration
# decoder /home/ksaso/mosesdecoder/bin/moses
# BLEU 0 on dev
/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/tuning/tuning-corpus.tagged.en
# We were before running iteration 2
# finished Mon Jun 20 16:19:08 CEST 2016
### MOSES CONFIG FILE ###
#

# input factors
[input-factors]
0
1

# mapping steps
[mapping]
0 T 0
0 G 0
0 T 1

[distortion-limit]
6

# feature functions
[feature]
UnknownWordPenalty
WordPenalty
PhrasePenalty
PhraseDictionaryMemory name=TranslationModel0 num-features=4
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/phrase-table.0-1.gz
input-factor=0 output-factor=1
PhraseDictionaryMemory name=TranslationModel1 num-features=4
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/phrase-table.1-2.gz
input-factor=1 output-factor=2
Generation name=GenerationModel0 num-features=2
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/generation.1-0,3.gz
input-factor=1 output-factor=0,3
Distortion
KENLM name=LM0 factor=0
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/language_model/
IT_corpus_surface.blm.sl order=3
KENLM name=LM1 factor=2
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/language_model/
IT_corpus_parts.blm.sl order=3

# dense weights for feature functions

[threads]
16
[weight]

Distortion0= 0
LM0= 0
LM1= 0
WordPenalty0= 0
PhrasePenalty0= 0
TranslationModel0= 0 0 0 0
TranslationModel1= 0 0 0 0
GenerationModel0= 0 0
UnknownWordPenalty0= 1

First of all, is it strange that I get all zeroes after tuning?

My problem is that the translation with this model is spectacularly slow (a
few days to translate a couple of thousand words with a 2,4 million line
corpus), so naturally I tried to binarize my phrase tables with the command

~/mosesdecoder/bin/processPhraseTableMin -in
~/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/phrase-table.0-1.gz
-nscores 4 -out ~/working/binarised_model/phrase-table.0-1 and
~/mosesdecoder/bin/processPhraseTableMin -in
~/working/IT_corpus/TMX/txt/factored_corpus/morphgen/model/phrase-table.1-2.gz
-nscores 4 -out ~/working/binarised_model/phrase-table.1-2

The process itself finishes without errors and I can run the translation
with the command

~/mosesdecoder/bin/moses -f
/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/binarised_model/moses.ini

But when I try to enter my text, I get the following:

 Translating: use|NN of|IN light|JJ
Line 1: Initialize search took 0.000 seconds total
Segmentation fault (core dumped)

When I try to filter my model, I get the same error. Any ideas what could
be causing this?

My final moses.ini file looks like this:

# MERT optimized configuration
# decoder /home/ksaso/mosesdecoder/bin/moses
# BLEU 0 on dev
/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/tuning/tuning-corpus.tagged.en
# We were before running iteration 2
# finished Mon Jun 20 16:19:08 CEST 2016
### MOSES CONFIG FILE ###
#

# input factors
[input-factors]
0
1

# mapping steps
[mapping]
0 T 0
0 G 0
0 T 1

[distortion-limit]
6

# feature functions
[feature]
UnknownWordPenalty
WordPenalty
PhrasePenalty
PhraseDictionaryCompact name=TranslationModel0 num-features=4
path=/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/binarised_model/phrase-table.0-1.minphr
in

Re: [Moses-support] Moses "died with error 11" error in factored training

2016-06-13 Thread Sašo Kuntaric
Hi Hieu,

let me try to explain. The mxpost program tags the text in such a way that
it divides factors with underlines, for example We_PRP collect_VBP
information_NN ,_, with_IN a_DT view_NN to_TO improve_VBG our_PRP$
website_NN and_CC provide_VBG users_NNS with_IN better_JJR experience_NN
._. Moses however only takes text, where factors are divided with the pipe
symbol, for example We|PRP collect|VBP information|NN,|, with|IN a|DT
view|NN to|TO improve|VBG our|PRP$ website|NN and|CC provide|VBG users|NNS
with|IN better|JJR experience|NN .|.

My question is, can a parameter be set in mxpost that it produces
the second output? I realize it's only a simple substitution, but one has
to be careful or errors like stated above occur and it is an extra step.

The second part of the question is can mxpost tag text with additional
factors, like lemmas, so instead of *surface form|POS* my text would be in
the format surface *form|POS|lemma*?

And two more general question. After doing the factored training should I
tune the model or is that not necessary in factored training?

In the factored training tutorial there is the command train-model.perl
--root-dir pos --corpus factored-corpus/proj-syndicate.1000 --f de --e en
--lm 0:3:factored-corpus/surface.lm --lm 2:3:factored-corpus/pos.lm
--translation-factors 0-0,2 --external-bin-dir .../tools. What is the first
parameter in listing the lm, namely the 2 in --lm
2:3:factored-corpus/pos.lm? 3 stands for the 3-gram model, but I am not
sure about the first parameter.

Sorry for the long e-mail.

Best regards,

Sašo

2016-06-13 12:12 GMT+02:00 Hieu Hoang :

>
>
> Hieu Hoang
> http://www.hoang.co.uk/hieu
>
> On 13 June 2016 at 07:51, Sašo Kuntaric  wrote:
>
>> Thanks for the tip, however in my case the problem was that after tagging
>> the files with mxpost and post-processing I had some standalone |PRP tags
>> in the source file.
>>
> that suggest the corpus file has not been cleaned. eg. there may be
> multiple white spaces '   '
>
>
>> Once I removed those, training resumed.
>>
>> Which leads me to another question. Since mxpost was used for the Moses
>> tutorial, I was wondering how did you create the input files for Moses
>> after tagging? Was there any post-processing done or can mxpost use the
>> pipes (|) instead of underlines? And one more thing, how can lemmas be
>> added, was a custom tagger project made or is there a parameter which tells
>> mxpost to do it?
>>
> not sure what you mean
>
>>
>> Best regards,
>>
>> Sašo
>>
>> 2016-06-12 21:08 GMT+02:00 Hieu Hoang :
>>
>>> judging by the source code in mgiza's getSentence.cpp line 366,
>>>
>>>cerr << "ERROR: Forbidden zero sentence length " <<
>>> sent.sentenceNo << endl;
>>> the 0 in your output is the line number.
>>>
>>> It may be that your corpora was produced on windows and has a BOM at the
>>> beginning
>>>
>>>
>>> On 12/06/2016 10:40, Sašo Kuntaric wrote:
>>>
>>>> Forbidden zero sentence
>>>>
>>>
>>>
>>
>>
>> --
>> lp,
>>
>> Sašo
>>
>
>


-- 
lp,

Sašo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] POS-tag Language Model

2016-06-13 Thread Sašo Kuntaric
Hi Dai,

This is what you do. When you tag your files in the tagger of your choice,
a random sentence in your file will look something like this:

please|VB Sign|NNP In|IN or|CC try|VB another|DT email|NN address|NN
explore|VB your|PRP list|NN of|IN registered|JJ products|NNS and|CC
register|VB your|PRP$ new|JJ purchases|NNS .|.

You need to write a script or manually process the file in such a way that
you only keep the POS tags. The above sentence will then look like this:

VB NNP IN CC VB DT NN NN VB PRP NN IN JJ NNS CC VB PRP JJ NNS .

Out of this file you now create a language model just as you would any
other.

Best regards,

Sašo


2016-06-13 15:59 GMT+02:00 dai xin :

> Hi,
>
> I am trying to do factored training namely with POS tags. I followed
> instructions of 'Train a model with POS tags' on
> http://www.statmt.org/moses/?n=Moses.FactoredTutorial#ntoc2 .
>
> The data used in the instruction contains a pos.lm which is the Language
> Model of POS-tag. Dose anyone know how to generate the POS-tag language
> model?
>
> Hoping someone have some ideas about that. Thanks in advance.
>
> Best regards
>
> Xin
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
lp,

Sašo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses "died with error 11" error in factored training

2016-06-12 Thread Sašo Kuntaric
Thanks for the tip, however in my case the problem was that after tagging
the files with mxpost and post-processing I had some standalone |PRP tags
in the source file. Once I removed those, training resumed.

Which leads me to another question. Since mxpost was used for the Moses
tutorial, I was wondering how did you create the input files for Moses
after tagging? Was there any post-processing done or can mxpost use the
pipes (|) instead of underlines? And one more thing, how can lemmas be
added, was a custom tagger project made or is there a parameter which tells
mxpost to do it?

Best regards,

Sašo

2016-06-12 21:08 GMT+02:00 Hieu Hoang :

> judging by the source code in mgiza's getSentence.cpp line 366,
>
>cerr << "ERROR: Forbidden zero sentence length " << sent.sentenceNo
> << endl;
> the 0 in your output is the line number.
>
> It may be that your corpora was produced on windows and has a BOM at the
> beginning
>
>
> On 12/06/2016 10:40, Sašo Kuntaric wrote:
>
>> Forbidden zero sentence
>>
>
>


-- 
lp,

Sašo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Moses "died with error 11" error in factored training

2016-06-12 Thread Sašo Kuntaric
Hi all,

I am trying to perform factored training on a corpus that I have prepared.
While doing it on a small subset, everything works fine. However while
doing it on the whole corpus, I get the following error.

ERROR: Execution of: /home/ksaso/mosesdecoder/tools/GIZA++
-CoocurrenceFile
/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/giza.en-sl/en-sl.cooc
-c
/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/corpus/en-sl-int-train.snt
-m1 5 -m2 0 -m3 3 -m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4
-nodumps 1 -nsmooth 4 -o
/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/giza.en-sl/en-sl
-onlyaldumps 1 -p0 0.999 -s
/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/corpus/sl.vcb
-t
/home/ksaso/working/IT_corpus/TMX/txt/factored_corpus/morphgen/corpus/en.vcb
  died with signal 11, with coredump

I have observed the training and Moses creates the vcb files fine and
completes a few iterations of the model1 training. However when it starts
the "Hmm Training Started at: Sun Jun 12 11:30:47 2016" it crashes with the
above warning. I do get a few of the "ERROR: Forbidden zero sentence length
0" errors, so I take it the zero length sentences are the issue. I tried
the Moses cleaning script, but it doesn't fix the error. I tried checking
manually, but can't find any empty lines.

Is there a way for Moses to tell me the line numbers that are problematic
while doing the training or does the problem lie somewhere else?

Best regards,

Saso
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] "Feature name SRILM is not registered."

2016-05-30 Thread Sašo Kuntaric
Hi Anna,

I wouldn't change it manually, especially since you have multiple lms in
factored training and you can get the "old and new format don't mix" error.
Just use the command that Kenneth now changed on the website and you'll be
fine.

p.s. No problem Kenneth, we don't mind giving you extra work :-).

Best regards,

Sašo

2016-05-29 23:23 GMT+02:00 Kenneth Heafield :

> It's KENLM, not KenLM according to Hieu, who did name it after all.
>
> Kenneth
>
> On 05/29/2016 10:19 PM, Anna Garbar wrote:
> > Hi Sašo,
> >
> > Thanks for your reply. Before recompiling moses with srilm, I also tried
> > to changed SRILM to KenLM im the moses.ini (under feature functions),
> > but received "Feature name KenLM is not registered." Or did I have to
> > make any other changes?
> >
> > Best,
> > Anna
> >
> > 2016-05-29 23:10 GMT+02:00 Sašo Kuntaric  > <mailto:saso.kunta...@gmail.com>>:
> >
> > Hi Anna,
> >
> > You are probably using KenLM as it's the default language model
> > making tool. The factored tutorial however has the parameter for
> > using SRILM. In the "lm 0:3:factored-corpus/surface.lm:0" part of
> > the command, leave the last zero parameter ":0" out. Moses will then
> > use the default KenLM tool.
> >
> > To the website authors: maybe it would be a good thing to mention
> > this in the tutorial as I had the same issue.
> >
> > Best regards,
> >
> > Sašo
> >
> > 2016-05-29 22:48 GMT+02:00 Rajen Chatterjee
> > mailto:rajen.k.chatter...@gmail.com
> >>:
> >
> > Hi,
> >
> > Since you are using SRILM have you compiled Moses with this flag
> > "--with-srilm=/path/to/srilm" (look here for various compilation
> > options http://www.statmt.org/moses/?n=Development.GetStarted).
> >
> > On Sun, May 29, 2016 at 9:48 PM, Anna Garbar
> > mailto:anna.gar...@gmail.com>> wrote:
> >
> > Dear Moses Team,
> >
> > I am working through the factored tutorial
> > (http://www.statmt.org/moses/?n=Moses.FactoredTutorial) and
> > when trying out the pos model on a sample sentence ('putin
> > beschreibt menschen .'), I receive
> >
> > Exception: moses\FF\Factory.cpp:321 in void
> > Moses::FeatureRegistry::Construct(const string&, const
> > string&) threw UnknownFeatureException because `i ==
> > registry_.end()'.
> > Feature name SRILM is not registered.
> >
> > The last command was:
> >
> > /home/AG/mosesdecoder/bin/moses -f pos/model/moses.ini < in
> >
> > Where should I look for mistake?
> >
> > Thanks in advance,
> > Anna
> >
> > ___
> > Moses-support mailing list
> > Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
> >
> >
> > --
> > -Regards,
> >  Rajen Chatterjee.
> >
> > ___
> > Moses-support mailing list
> > Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
> >
> >
> > --
> > lp,
> >
> > Sašo
> >
> >
> >
> >
> > ___
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



-- 
lp,

Sašo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] "Feature name SRILM is not registered."

2016-05-29 Thread Sašo Kuntaric
Hi Anna,

You are probably using KenLM as it's the default language model making
tool. The factored tutorial however has the parameter for using SRILM. In
the "lm 0:3:factored-corpus/surface.lm:0" part of the command, leave the
last zero parameter ":0" out. Moses will then use the default KenLM tool.

To the website authors: maybe it would be a good thing to mention this in
the tutorial as I had the same issue.

Best regards,

Sašo

2016-05-29 22:48 GMT+02:00 Rajen Chatterjee :

> Hi,
>
> Since you are using SRILM have you compiled Moses with this flag "
> --with-srilm=/path/to/srilm" (look here for various compilation options
> http://www.statmt.org/moses/?n=Development.GetStarted).
>
> On Sun, May 29, 2016 at 9:48 PM, Anna Garbar 
> wrote:
>
>> Dear Moses Team,
>>
>> I am working through the factored tutorial (
>> http://www.statmt.org/moses/?n=Moses.FactoredTutorial) and when trying
>> out the pos model on a sample sentence ('putin beschreibt menschen .'),
>> I receive
>>
>> Exception: moses\FF\Factory.cpp:321 in void
>> Moses::FeatureRegistry::Construct(const string&, const string&) threw
>> UnknownFeatureException because `i == registry_.end()'.
>> Feature name SRILM is not registered.
>>
>> The last command was:
>>
>> /home/AG/mosesdecoder/bin/moses -f pos/model/moses.ini < in
>>
>> Where should I look for mistake?
>>
>> Thanks in advance,
>> Anna
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
>
> --
> -Regards,
>  Rajen Chatterjee.
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
lp,

Sašo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Moses hangs without an error message while training a factored model

2016-05-15 Thread Sašo Kuntaric
Hi all,

I am trying to train a factored model, but Moses hang while performing the
step "(1.0.5) reducing factors to produce
/home/ksaso/Obeliks/test/prepared/tm_tagged/model/aligned.0,1.sl".

I have prepared the English corpus with the mxpost tool and manually
replaced the underscores with the pipe sign. It looks something like this:

streaming|VBG music|NN at|IN the|DT touch|NN of|IN a|DT button|NN
introducing|VBG SoundTouch|NNP ™|.
how|WRB it|PRP works|VBZ
SoundTouch|NNP ™|NNP Wi-Fi|NNP ®|NNP music|NN systems|NNS are|VBP much|RB
more|JJR than|IN just|RB speakers|NNS because|IN they|PRP connect|VBP
directly|RB to|TO the|DT Internet|NN over|IN your|PRP$ Wi-Fi|JJ network|NN
.|.
it|PRP makes|VBZ listening|VBG to|TO your|PRP$ favorite|JJ music|NN
easier|JJR .|.
all|DT around|IN your|PRP$ home|NN .|.
wirelessly|RB .|.
SoundTouch|NNP ™|NNP Wi-Fi|NNP ®|NNP music|NN systems|NNS are|VBP much|RB
more|JJR than|IN just|RB speakers|NNS because|IN they|PRP connect|VBP
directly|RB to|TO the|DT Internet|NN over|IN your|PRP$ Wi-Fi|JJ network|NN
so|IN you|PRP can|MD stream|NN Internet|NNP radio|NN and|CC your|PRP$
music|NN library|NN without|IN having|VBG go|NN to|TO your|PRP$ computer|NN
or|CC open|VB an|DT app|NN .|.
it|PRP makes|VBZ listening|VBG to|TO your|PRP$ favorite|JJ music|NN
quicker|NN and|CC easier|JJR .|.

The Slovenian side was prepared with a specialized tool and is converted
from the xml format. It looks like this>

pretakanje|pretakanje|S|Soset glasbe|glasba|S|Sozer z|z|D|Do
dotikom|dotik|S|Someo gumba|gumb|S|Somer
predstavljamo|predstavljati|G|Ggnspm vam|ti|Z|Zod-md
SoundTouch|Soundtouch|S|Slmei
delovanje|delovanje|S|Soset
glasbeni|glasben|P|Ppnmmi sistemi|sistem|S|Sommi
SoundTouch|Soundtouch|S|Slmei Wi|Wi|S|Slmei -|-|- Fi|fi|S|Somei
so|biti|G|Gp-stm-n veliko|veliko|R|Rsn več|več|R|Rsr kot|kot|V|Vd
samo|samo|L|L zvočniki|zvočnik|S|Sommi  ,|,|,  saj|saj|V|Vp
se|se|Z|Zp--k povežejo|povezati|G|Ggdstm neposredno|neposredno|R|Rsn
z|z|D|Do internetom|internet|S|Someo prek|prek|D|Dr omrežja|omrežje|S|Soser
Wi|Wi|S|Slmei -|-|- Fi|fi|S|Somei  .|.|.
poslušanje|poslušanje|S|Sosei priljubljene|priljubljen|P|Ppnzer
glasbe|glasba|S|Sozer je|biti|G|Gp-ste-n tako|tako|R|Rsn
enostavnejše|enostaven|P|Pppsei  .|.|.
povsod|povsod|R|Rsn v|v|D|Dm vašem|vaš|Z|Zsdmemm domu|dom|S|Somem  .|.|.
brezžično|brezžičen|P|Ppnsei  .|.|.
SoundTouch|Soundtouch|S|Slmei Wi|Wi|S|Slmei -|-|- Fi|fi|S|Somei
so|biti|G|Gp-stm-n veliko|veliko|R|Rsn več|več|R|Rsr kot|kot|V|Vd
samo|samo|L|L zvočniki|zvočnik|S|Sommi  ,|,|,  saj|saj|V|Vp
se|se|Z|Zp--k povežejo|povezati|G|Ggdstm neposredno|neposredno|R|Rsn
z|z|D|Do internetom|internet|S|Someo prek|prek|D|Dr omrežja|omrežje|S|Soser
Wi|Wi|S|Slmei -|-|- Fi|fi|S|Somei  ,|,|,  tako|tako|V|Vp da|da|V|Vd
lahko|lahko|R|Rsn internetni|interneten|P|Ppnmeid radio|radio|S|Sometn
in|in|V|Vp glasbeno|glasben|P|Ppnzet knjižnico|knjižnica|S|Sozet
pretakate|pretakati|G|Ggnsdm  ,|,|,  ne|ne|L|L da|da|V|Vd bi|biti|G|Gp-g
pristopili|pristopiti|G|Ggdd-mm k|k|D|Dd računalniku|računalnik|S|Somed
ali|ali|V|Vp odprli|odpreti|G|Ggdd-mm aplikacijo|aplikacija|S|Sozet  .|.|.
poslušanje|poslušanje|S|Sosei priljubljene|priljubljen|P|Ppnzer
glasbe|glasba|S|Sozer je|biti|G|Gp-ste-n hitrejše|hitro|R|Rsr in|in|V|Vp
enostavneje|enostavno|R|Rsr  .|.|.

I ran the following command:

~/mosesdecoder/scripts/training/train-model.perl --root-dir tm_tagged
--corpus ~/Obeliks/test/prepared/bose_tagged --f en --e sl --lm
4:3:/home/ksaso/Obeliks/test/prepared/lm/bose_tagged.blm.sl:0
--translation-factors 0-0,1 --external-bin-dir ~/mosesdecoder/tools --cores
16 &>training.out &

Both corpora have the same number of lines and the Slovenian language model
was created successfully. It looks something like this:

-3.480905pretakanje|pretakanje|S|Soset-0.16498125
-2.820011glasbe|glasba|S|Sozer-0.25699353
-2.1714172z|z|D|Do-0.24650323
-3.9205537dotikom|dotik|S|Someo-0.09213096
-3.63479gumba|gumb|S|Somer-0.11902898
-3.9205537predstavljamo|predstavljati|G|Ggnspm-0.09213096
-3.3675106vam|ti|Z|Zod-md-0.12703034

I am attaching the whole training.out file in case it helps. Like I said no
error message, Moses just hangs. I am assuming --lm 4:3:filename is OK,
since I have four factors? Are these parameters described in more detail
somewhere? I get the same result with different parameter values. Anyone
has an idea what I am doing wrong?

Thank you in advance and best regards,

Saso


training.out
Description: Binary data
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Data for building a factored model

2016-05-06 Thread Sašo Kuntaric
Hi all,

Thank you Philipp for all the useful info, I will take a closer look at the
mentioned scripts.

I do have one follow-up question. Like I said, I really enjoyed working
with the factored corpora in the example. How were those created? Is there
a tool I can use to create similar ones?

Best regards,

Sašo

2016-05-06 0:08 GMT+02:00 Philipp Koehn :

> Hi,
>
> life is easier with factored models, if you use the experiment.perl set-up,
> where you just have to specify the factor set-up and scripts that generate
> factors.
>
> These scripts take the tokenized text and replace each word with a factor
> (e.g., replace each word with the POS tag).
>
> The POS LM is trained on such a corpus - each word is replaced by a
> POS tag, and then the standard LM training process is run over it.
>
> See $MOSES/scripts/ems/example/config.factored for an example.
>
> -phi
>
> On Wed, May 4, 2016 at 3:30 PM, Sašo Kuntaric 
> wrote:
> > Hello again,
> >
> > I believe I can wrap my head around the theoretical part, but the English
> > and German corpora in the Moses factored model tutorial
> > (http://www.statmt.org/moses/?n=Moses.FactoredTutorial) look beautifully
> > factored, so my question is how were the original corpora processed? Was
> a
> > specific tagger used and was there any manual/script postprocessing done?
> >
> > And since I am already bugging everyone, how is the language model pos.lm
> > created? Is it extracted from a file, created manually or in another way?
> >
> > Thank you in advance for all the replies.
> >
> > Best regards,
> >
> > Sašo
> >
> > 2016-05-02 19:45 GMT+02:00 Marwa Refaie :
> >>
> >> Corpus for translation model should be on 2 parallel files in the format
> >> Word | pos | Lema  For example , by a file for each language. You
> can
> >> prepare files using word net , Stanford , or any tagger & stemmer  as
> can
> >> deal with your language pairs. May be before enter the files to moses
> you
> >> should adjust the text files by a python script (write it your self)
> >>
> >> For language model ... You must build it as follows
> >> Verb noun noun
> >> Noun Det adj
> >> ... Depending on the target language only ,, Then build it as usual
> >> n-gram lm.
> >>
> >> Sent from my iPad
> >>
> >> > On May 2, 2016, at 10:11, Sašo Kuntaric 
> wrote:
> >> >
> >> > Hi all,
> >> >
> >> > I am having some issues producing the corpora in the correct format
> for
> >> > Moses to execute factored training.
> >> >
> >> > I am looking at the factored tutorial on the Moses website and I am
> >> > wondering, how to get such consistent corpora for two languages. What
> tools
> >> > are being used and can they be trained for specific languages
> (Slovenian in
> >> > my example). Are such tools available for download or is such data
> produced
> >> > with custom scripts?
> >> >
> >> > --
> >> > Best regards,
> >> >
> >> > Sašo
> >> > ___
> >> > Moses-support mailing list
> >> > Moses-support@mit.edu
> >> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
> >
> >
> > --
> > lp,
> >
> > Sašo
> >
> > ___
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
>



-- 
lp,

Sašo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Data for building a factored model

2016-05-04 Thread Sašo Kuntaric
Hello again,

I believe I can wrap my head around the theoretical part, but the English
and German corpora in the Moses factored model tutorial (
http://www.statmt.org/moses/?n=Moses.FactoredTutorial) look beautifully
factored, so my question is how were the original corpora processed? Was a
specific tagger used and was there any manual/script postprocessing done?

And since I am already bugging everyone, how is the language model pos.lm
created? Is it extracted from a file, created manually or in another way?

Thank you in advance for all the replies.

Best regards,

Sašo

2016-05-02 19:45 GMT+02:00 Marwa Refaie :

> Corpus for translation model should be on 2 parallel files in the format
> Word | pos | Lema  For example , by a file for each language. You can
> prepare files using word net , Stanford , or any tagger & stemmer  as can
> deal with your language pairs. May be before enter the files to moses you
> should adjust the text files by a python script (write it your self)
>
> For language model ... You must build it as follows
> Verb noun noun
> Noun Det adj
> ... Depending on the target language only ,, Then build it as usual
> n-gram lm.
>
> Sent from my iPad
>
> > On May 2, 2016, at 10:11, Sašo Kuntaric  wrote:
> >
> > Hi all,
> >
> > I am having some issues producing the corpora in the correct format for
> Moses to execute factored training.
> >
> > I am looking at the factored tutorial on the Moses website and I am
> wondering, how to get such consistent corpora for two languages. What tools
> are being used and can they be trained for specific languages (Slovenian in
> my example). Are such tools available for download or is such data produced
> with custom scripts?
> >
> > --
> > Best regards,
> >
> > Sašo
> > ___
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
>



-- 
lp,

Sašo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Data for building a factored model

2016-05-02 Thread Sašo Kuntaric
Hi all,

I am having some issues producing the corpora in the correct format for
Moses to execute factored training.

I am looking at the factored tutorial on the Moses website and I am
wondering, how to get such consistent corpora for two languages. What tools
are being used and can they be trained for specific languages (Slovenian in
my example). Are such tools available for download or is such data produced
with custom scripts?

-- 
Best regards,

Sašo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Training a Moses translation system on multiple cores

2016-03-23 Thread Sašo Kuntaric
Hi all,

I am trying to train a language system on a server with 2x8 cores. The
problem is that no matter how I add an argument for multiple cores, the
system states it cannot recognize the command. I have tried:

nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir train
-corpus ~/corpus/Individual/combined.clean -f en -e sl -alignment
grow-diag-final-and -reordering msd-bidirectional-fe -lm
0:3:$HOME/corpus/Individual/corpus.blm.sl:8 -external-bin-dir
~/mosesdecoder/tools >& training.out & --decoder-flags="-threads 8"

and

nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir train
-corpus ~/corpus/Individual/combined.clean -f en -e sl -alignment
grow-diag-final-and -reordering msd-bidirectional-fe -lm
0:3:$HOME/corpus/Individual/corpus.blm.sl:8 -external-bin-dir
~/mosesdecoder/tools >& training.out & -cores

Any ideas what I am doing wrong?

Best regards,

Sašo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Preparing TMX files for use in Moses

2016-03-13 Thread Sašo Kuntaric
Thank you for your reply.

It's one of those errors it's hard to admit one's mistake for, because it's
so trivial, namely I mistyped the language name (EN-US instead of en-US),
since I am mostly a Windows user. The script works fine now and I can
confirm it works well with Studio-exported TMX files.

I do have another question regarding the training of the truecaser. In the
example shown on the Moses homepage, a truecase-model.en file is used,
however it is downloaded with the example files. If I want to train my
truecaser for Slovenian, how do I get the truecase-model file. Is it
something I need to create myself and how do I go about and do it?

Thanks in advance for the replies.

Best regards,

Sašo

2016-03-13 12:03 GMT+01:00 Tom Hoar :

> I don't know the tmx2txt.pl script, but I can suggest where to look for
> problems.
>
> The most frequent problem we have when extracting data from TMX files
> comes from files that don't comply with the TMX specification, especially
> regarding compliance with the srclang attributes. The spec states this
> about how to identify the source language:
>
> "*the  holding the source segment will have its xml:lang attribute
> set to the same value as srclang. (except if srclang is set to "*all*"). If
> a  element does not have a srclang attribute specified, it uses the one
> defined in the  element.*"
>
> Sadly, many TMX creation tools, including tools from SDL, do not properly
> identify the source language. Each tool that looks for the source language
> TUV according to the spec handles erroneous TMX segments in its own way.
> So, you need to learn how your TMX declares the srclang attribute, and then
> study the script to see where there's a mismatch.
>
> You can see how we managed these sloppy TMX files in this post, only a
> week old: https://pttools.freshdesk.com/discussions/topics/634251
>
> Hope this helps.
>
> Tom
>
>
> On 3/12/2016 8:57 PM, moses-support-requ...@mit.edu wrote:
>
> Date: Sat, 12 Mar 2016 13:42:05 +0100
> From: Sa?o Kuntaric  
> Subject: [Moses-support] Preparing TMX files for use in Moses
> To: moses-support@mit.edu
>
> Hi all,
>
> I have a question that is not connected directly to Moses. I am trying to
> prepare the corpora for training my engine. I have exported a few of my TMs
> to the TMX format and now I am trying to create two separate UTF-8 text
> files. I have tried it with the extract-tmx-corpus and tmx2txt.pl tools. I
> get empty text files for both (the former tool claims that the input file
> can't be read). Are there any special setting I need to set when extracting
> the TMX files? I am using SDL Trados Studio 2015 for exporting the files.
>
> Has anyone come across anything like this?
>
> --
> lp,
>
> Sa?o
>
>
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
lp,

Sašo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Preparing TMX files for use in Moses

2016-03-12 Thread Sašo Kuntaric
Hi all,

I have a question that is not connected directly to Moses. I am trying to
prepare the corpora for training my engine. I have exported a few of my TMs
to the TMX format and now I am trying to create two separate UTF-8 text
files. I have tried it with the extract-tmx-corpus and tmx2txt.pl tools. I
get empty text files for both (the former tool claims that the input file
can't be read). Are there any special setting I need to set when extracting
the TMX files? I am using SDL Trados Studio 2015 for exporting the files.

Has anyone come across anything like this?

-- 
lp,

Sašo
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support