Re: [Moses-support] Moses vocabulary code

2015-10-10 Thread Lane Schwartz
Wouldn't factor->GetId() be the unique integer ID of the string?

On Fri, Oct 9, 2015 at 5:54 PM, Hieu Hoang  wrote:

> const Factor* is the vocab id. It's guaranteed to be unique for each
> unique string. You can map directly to the string using
>factor->GetString()
>
>
>
> On 09/10/2015 22:55, Lane Schwartz wrote:
>
> Thanks, Marcin.
>
> So when the various components of Moses pass words back and forth, what do
> they send each other? std::string? StringPiece?
>
> On Fri, Oct 9, 2015 at 4:28 PM, Marcin Junczys-Dowmunt  > wrote:
>
>> For instance in my phrase table that would be
>>
>> mosesdecoder/moses/TranslationModel/CompactPT/PhraseDecoder.h
>>
>>   StringVector
>> m_sourceSymbols;
>>   StringVector m_targetSymbols;
>>
>> That's a memory-mapped vector of strings.
>>
>> W dniu 09.10.2015 o 23:22, Lane Schwartz pisze:
>>
>> Seriously? That sounds inefficient.
>>
>> I've found code in KenLM that maps from strings to integers, but not the
>> other way around.
>>
>> Marcin, do you know, for example, where any Moses code is for doing the
>> mapping for any data structure?
>>
>>
>> On Fri, Oct 9, 2015 at 4:14 PM, Marcin Junczys-Dowmunt <
>> junc...@amu.edu.pl> wrote:
>>
>>> Hi,
>>> This would only be a simple thing if there was a common framework for
>>> that, but there isn't. Each datastructure implements its own vocabularies
>>> and look-up tables. There is no common set of integers.
>>> Best,
>>> Marcin
>>>
>>> W dniu 09.10.2015 o 23:11, Lane Schwartz pisze:
>>>
>>> Hey,
>>>
>>> I know this should be a simple thing to find, but what code in Moses is
>>> responsible for mapping back and forth between strings and integers?
>>>
>>> Thanks,
>>> Lane
>>>
>>>
>>>
>>> ___
>>> Moses-support mailing 
>>> listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>>
>>> ___
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>
>>
>> --
>> When a place gets crowded enough to require ID's, social collapse is not
>> far away.  It is time to go elsewhere.  The best thing about space travel
>> is that it made it possible to go elsewhere.
>> -- R.A. Heinlein, "Time Enough For Love"
>>
>>
>>
>
>
> --
> When a place gets crowded enough to require ID's, social collapse is not
> far away.  It is time to go elsewhere.  The best thing about space travel
> is that it made it possible to go elsewhere.
> -- R.A. Heinlein, "Time Enough For Love"
>
>
> ___
> Moses-support mailing 
> listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> --
> Hieu Hoanghttp://www.hoang.co.uk/hieu
>
>


-- 
When a place gets crowded enough to require ID's, social collapse is not
far away.  It is time to go elsewhere.  The best thing about space travel
is that it made it possible to go elsewhere.
-- R.A. Heinlein, "Time Enough For Love"
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Segmentation Fault during Tuning

2015-10-10 Thread Alex Martinez

Hello,
I'm trying to build a factored system using EMS based on this example from the 
tutorial:
-
% train-model.perl \
    --corpus factored-corpus/proj-syndicate.1000 \
    --root-dir morphgen-backoff \
    --f de --e en \
    --lm 0:3:factored-corpus/surface.lm:0 \
    --lm 2:3:factored-corpus/pos.lm:0 \
    --translation-factors 1-1+3-2+0-0,2 \
    --generation-factors 1-2+1,2-0 \
    --decoding-steps t0,g0,t1,g1:t2 \
    --external-bin-dir .../tools
--
I'm getting a segmentation fault during tuning and I have the feeling that the 
problem is related to the line defining the decoding-steps.
What I have on my EMS config file to get a similar model is:

### factored training: specify here which factors used
# if none specified, single factor training is assumed
# (one translation step, surface to surface)
#
input-factors = word lemma pos
output-factors = word lemma pos
alignment-factors = "word+lemma -> word+lemma"
translation-factors = "lemma -> lemma, pos -> pos, word -> word + pos"
reordering-factors = "word -> word"
generation-factors = "lemma -> pos, lemma+pos -> word"
decoding-steps = "t0,g0,t1,g1:t2"
generation-type = single
prune-generation = "$moses-bin-dir/pruneGeneration 100"
-

The training fails in the tuning step and I'm getting this in the 
TUNING_tune.1.STDERR:

Executing: /opt/moses/bin/moses -threads all -v 0   -config 
/mnt/a62/devel/en_es/processfin/model/moses.bin.ini.1 -weight-overwrite 
'WordPenalty0= -0.128205 TranslationModel0= 0.025641 0.025641 0.025641 0.025641 
LM2= 0.064103 LM0= 0.064103 GenerationModel1= 0.038462 0.00 TranslationModel2= 
0.025641 0.025641 0.025641 0.025641 GenerationModel0= 0.038462 PhrasePenalty0= 
0.025641 Distortion0= 0.038462 TranslationModel1= 0.025641 0.025641 0.025641 
0.025641 LexicalReordering0= 0.038462 0.038462 0.038462 0.038462 0.038462 0.038462 
LM1= 0.064103'  -n-best-list run1.best100.out 100 distinct  -input-file 
/mnt/a62/devel/en_es/data/corpora.tuning.en > run1.out
Segmentation fault (core dumped)
Exit code: 139
The decoder died. CONFIG WAS -weight-overwrite 'WordPenalty0= -0.128205 
TranslationModel0= 0.025641 0.025641 0.025641 0.025641 LM2= 0.064103 LM0= 
0.064103 GenerationModel1= 0.038462 0.00 TranslationModel2= 0.025641 
0.025641 0.025641 0.025641 GenerationModel0= 0.038462 PhrasePenalty0= 0.025641 
Distortion0= 0.038462 TranslationModel1= 0.025641 0.025641 0.025641 0.025641 
LexicalReordering0= 0.038462 0.038462 0.038462 0.038462 0.038462 0.038462 LM1= 
0.064103' 
cp: cannot stat ‘/mnt/a62/devel/en_es/processfin/tuning/tmp.1/moses.ini’: No 
such file or directory
---

If I change this line in the config file from

decoding-steps = "t0,g0,t1,g1:t2"

 to

decoding-steps = "t0,g0,t1,g1"

then the training ends without errors. 

I'll appreciate suggestions on how to solve that.

Alex


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses vocabulary code

2015-10-10 Thread Lane Schwartz
BTW, what's the rationale for using StringPiece instead of std::string? I
thought the main reason for using StringPiece was for implicit conversion
from char *

On Fri, Oct 9, 2015 at 5:15 PM, Kenneth Heafield 
wrote:

> The Moses common vocabulary is moses/FactorCollection.h.  Common
> practice in core Moses code is to pass around a const Factor * (which
> can be resolved to a StringPiece or a consecutive ID).
>
> If a feature/phrase table has its own ids because e.g. it's baked into
> the binary file, then there's a std::vector to map from Moses ID to
> feature function ID.  See moses/LM/Ken.h:99 for an example.
>
> std::string (or even StringPiece) conversion at decode time is a bug.  A
> sadly common one.
>
> On 10/09/2015 10:22 PM, Lane Schwartz wrote:
> > Seriously? That sounds inefficient.
> >
> > I've found code in KenLM that maps from strings to integers, but not the
> > other way around.
> >
> > Marcin, do you know, for example, where any Moses code is for doing the
> > mapping for any data structure?
> >
> >
> > On Fri, Oct 9, 2015 at 4:14 PM, Marcin Junczys-Dowmunt
> > > wrote:
> >
> > Hi,
> > This would only be a simple thing if there was a common framework
> > for that, but there isn't. Each datastructure implements its own
> > vocabularies and look-up tables. There is no common set of integers.
> > Best,
> > Marcin
> >
> > W dniu 09.10.2015 o 23:11, Lane Schwartz pisze:
> >> Hey,
> >>
> >> I know this should be a simple thing to find, but what code in
> >> Moses is responsible for mapping back and forth between strings
> >> and integers?
> >>
> >> Thanks,
> >> Lane
> >>
> >>
> >>
> >> ___
> >> Moses-support mailing list
> >> Moses-support@mit.edu 
> >> http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
> > ___
> > Moses-support mailing list
> > Moses-support@mit.edu 
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
> >
> >
> > --
> > When a place gets crowded enough to require ID's, social collapse is not
> > far away.  It is time to go elsewhere.  The best thing about space travel
> > is that it made it possible to go elsewhere.
> > -- R.A. Heinlein, "Time Enough For Love"
> >
> >
> > ___
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



-- 
When a place gets crowded enough to require ID's, social collapse is not
far away.  It is time to go elsewhere.  The best thing about space travel
is that it made it possible to go elsewhere.
-- R.A. Heinlein, "Time Enough For Love"
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses vocabulary code

2015-10-10 Thread Hieu Hoang
Yep. The cinst factor* is the original unique vocab I'd and its more useful
IMO cos u can get the string back without u referring back to the vocab
factory. But use what u like

String piece is apparently faster for some operations
On 10 Oct 2015 5:35 pm, "Lane Schwartz"  wrote:

> Wouldn't factor->GetId() be the unique integer ID of the string?
>
> On Fri, Oct 9, 2015 at 5:54 PM, Hieu Hoang  wrote:
>
>> const Factor* is the vocab id. It's guaranteed to be unique for each
>> unique string. You can map directly to the string using
>>factor->GetString()
>>
>>
>>
>> On 09/10/2015 22:55, Lane Schwartz wrote:
>>
>> Thanks, Marcin.
>>
>> So when the various components of Moses pass words back and forth, what
>> do they send each other? std::string? StringPiece?
>>
>> On Fri, Oct 9, 2015 at 4:28 PM, Marcin Junczys-Dowmunt <
>> junc...@amu.edu.pl> wrote:
>>
>>> For instance in my phrase table that would be
>>>
>>> mosesdecoder/moses/TranslationModel/CompactPT/PhraseDecoder.h
>>>
>>>   StringVector
>>> m_sourceSymbols;
>>>   StringVector m_targetSymbols;
>>>
>>> That's a memory-mapped vector of strings.
>>>
>>> W dniu 09.10.2015 o 23:22, Lane Schwartz pisze:
>>>
>>> Seriously? That sounds inefficient.
>>>
>>> I've found code in KenLM that maps from strings to integers, but not the
>>> other way around.
>>>
>>> Marcin, do you know, for example, where any Moses code is for doing the
>>> mapping for any data structure?
>>>
>>>
>>> On Fri, Oct 9, 2015 at 4:14 PM, Marcin Junczys-Dowmunt <
>>> junc...@amu.edu.pl> wrote:
>>>
 Hi,
 This would only be a simple thing if there was a common framework for
 that, but there isn't. Each datastructure implements its own vocabularies
 and look-up tables. There is no common set of integers.
 Best,
 Marcin

 W dniu 09.10.2015 o 23:11, Lane Schwartz pisze:

 Hey,

 I know this should be a simple thing to find, but what code in Moses is
 responsible for mapping back and forth between strings and integers?

 Thanks,
 Lane



 ___
 Moses-support mailing 
 listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


>>>
>>>
>>> --
>>> When a place gets crowded enough to require ID's, social collapse is not
>>> far away.  It is time to go elsewhere.  The best thing about space travel
>>> is that it made it possible to go elsewhere.
>>> -- R.A. Heinlein, "Time Enough For Love"
>>>
>>>
>>>
>>
>>
>> --
>> When a place gets crowded enough to require ID's, social collapse is not
>> far away.  It is time to go elsewhere.  The best thing about space travel
>> is that it made it possible to go elsewhere.
>> -- R.A. Heinlein, "Time Enough For Love"
>>
>>
>> ___
>> Moses-support mailing 
>> listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>> --
>> Hieu Hoanghttp://www.hoang.co.uk/hieu
>>
>>
>
>
> --
> When a place gets crowded enough to require ID's, social collapse is not
> far away.  It is time to go elsewhere.  The best thing about space travel
> is that it made it possible to go elsewhere.
> -- R.A. Heinlein, "Time Enough For Love"
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] BLEU score difference about 0.13 for one dataset is normal?

2015-10-10 Thread Tom Hoar
Yes. Each tuning with the same test set will give you small variations in the 
final BLEU. Yours looks like they're in a normal range. 



Date: Sun, 11 Oct 2015 04:23:56 +
From: Davood Mohammadifar 
Subject: [Moses-support] BLEU score difference about 0.13 for one
dataset is  normal?
To: Moses Support 

Hello every one

I noticed different BLEU scores for same dataset. Also the difference is not so 
much and is about 0.13.

I trained my dataset and tuned development set for Persian-English translation. 
after testing, the score was 21.95. For second time i did the same process and 
obtained 21.82. (my tools were mgiza, mert, ...)

is this difference normal?

My system:
CPU: Core i7-4790K
RAM: 16GB
OS: ubuntu 12.04

Thanks
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] BLEU score difference about 0.13 for one dataset is normal?

2015-10-10 Thread Davood Mohammadifar
Hello every one

I noticed different BLEU scores for same dataset. Also the difference is not so 
much and is about 0.13.

I trained my dataset and tuned development set for Persian-English translation. 
after testing, the score was 21.95. For second time i did the same process and 
obtained 21.82. (my tools were mgiza, mert, ...)

is this difference normal?

My system:
CPU: Core i7-4790K
RAM: 16GB
OS: ubuntu 12.04

Thanks
  ___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses vocabulary code

2015-10-10 Thread Kenneth Heafield
Agreed about the cuteness of const Factor *.

Let's say you're reading space-delimited file input.

std::string line("Foo Bar Baz Quux .");

One can make a StringPiece(line.data(), 3) that looks and for most
purposes acts like std::string("Foo") but requires zero memory
allocation.  It's not null terminated.  It's just a const char * and a
length without owning the underlying memory.  This makes it super fast
to parse/split text.  util/tokenize_piece.hh provides an iterator
operation for string splitting.

Taking it a step further, util::FilePiece does a rolling mmap of a text
file and gives you StringPiece.  Zero-copy file reading.

In Moses preference order for function parameters: const Factor *,
StringPiece, std::string or char *.

On 10/10/2015 06:22 PM, Hieu Hoang wrote:
> Yep. The cinst factor* is the original unique vocab I'd and its more
> useful IMO cos u can get the string back without u referring back to the
> vocab factory. But use what u like
> 
> String piece is apparently faster for some operations
> 
> On 10 Oct 2015 5:35 pm, "Lane Schwartz"  > wrote:
> 
> Wouldn't factor->GetId() be the unique integer ID of the string?
> 
> On Fri, Oct 9, 2015 at 5:54 PM, Hieu Hoang  > wrote:
> 
> const Factor* is the vocab id. It's guaranteed to be unique for
> each unique string. You can map directly to the string using
>factor->GetString()
> 
> 
> 
> On 09/10/2015 22:55, Lane Schwartz wrote:
>> Thanks, Marcin.
>>
>> So when the various components of Moses pass words back and
>> forth, what do they send each other? std::string? StringPiece? 
>>
>> On Fri, Oct 9, 2015 at 4:28 PM, Marcin Junczys-Dowmunt
>> > wrote:
>>
>> For instance in my phrase table that would be
>>
>> mosesdecoder/moses/TranslationModel/CompactPT/PhraseDecoder.h
>>
>>   StringVector
>> m_sourceSymbols;   
>>   StringVector
>> m_targetSymbols;
>>
>> That's a memory-mapped vector of strings.
>>
>> W dniu 09.10.2015 o 23:22, Lane Schwartz pisze:
>>> Seriously? That sounds inefficient.
>>>
>>> I've found code in KenLM that maps from strings to
>>> integers, but not the other way around.
>>>
>>> Marcin, do you know, for example, where any Moses code is
>>> for doing the mapping for any data structure?
>>>
>>>
>>> On Fri, Oct 9, 2015 at 4:14 PM, Marcin Junczys-Dowmunt
>>> <junc...@amu.edu.pl
>>> > wrote:
>>>
>>> Hi,
>>> This would only be a simple thing if there was a
>>> common framework for that, but there isn't. Each
>>> datastructure implements its own vocabularies and
>>> look-up tables. There is no common set of integers.
>>> Best,
>>> Marcin
>>>
>>> W dniu 09.10.2015 o 23:11, Lane Schwartz pisze:
 Hey,

 I know this should be a simple thing to find, but
 what code in Moses is responsible for mapping back
 and forth between strings and integers?

 Thanks,
 Lane



 ___
 Moses-support mailing list
 Moses-support@mit.edu 
 http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>> ___
>>> Moses-support mailing list
>>> Moses-support@mit.edu 
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>>
>>>
>>> -- 
>>> When a place gets crowded enough to require ID's, social
>>> collapse is not
>>> far away.  It is time to go elsewhere.  The best thing
>>> about space travel
>>> is that it made it possible to go elsewhere.
>>> -- R.A. Heinlein, "Time Enough For Love"
>>
>>
>>
>>
>> -- 
>> When a place gets crowded enough to require ID's, social
>> collapse is not
>> far away.  It is time to go elsewhere.  The best thing about
>> space travel
>> is that it made it possible to go elsewhere.
>> -- R.A. Heinlein, "Time Enough For Love"
>>
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu 
>> 

Re: [Moses-support] BLEU score difference about 0.13 for one dataset is normal?

2015-10-10 Thread Michael Denkowski
Hi Davood,

Optimizers like MERT will give you a slightly different result every time
you run them, leading to variance in BLEU score.  It's generally a good
idea to use multiple optimizer runs, especially when comparing two
systems.  There's a good paper on hypothesis testing for MT that goes into
detail on this .  Some
other parts of a standard system like word alignment can also be
non-deterministic but the optimizer is the most frequent cause of
fluctuating metric scores.

Best,
Michael

On Sun, Oct 11, 2015 at 12:23 AM, Davood Mohammadifar  wrote:

> Hello every one
>
> I noticed different BLEU scores for same dataset. Also the difference is
> not so much and is about 0.13.
>
> I trained my dataset and tuned development set for Persian-English
> translation. after testing, the score was 21.95. For second time i did the
> same process and obtained 21.82. (my tools were mgiza, mert, ...)
>
> is this difference normal?
>
> My system:
> CPU: Core i7-4790K
> RAM: 16GB
> OS: ubuntu 12.04
>
> Thanks
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support