[Moses-support] Phrase extraction with --IncludeSentenceId messes up phrase table counts

2014-07-23 Thread Marcin Junczys-Dowmunt
Hi, I am using train-model.perl with --extract-options=--IncludeSentenceId and it seems that the sentence id is somehow getting into the phrase table as a count and later used for phrase translation weight calculation, for instance the extract (last column is the Id): #c the compound or

Re: [Moses-support] Phrase extraction with --IncludeSentenceId messes up phrase table counts

2014-07-23 Thread Marcin Junczys-Dowmunt
I was planning to use it for a custom feature function later. W dniu 23.07.2014 13:11, Hieu Hoang pisze: i can change it so that the sentence id is put into a key-value field in the last column. what is the sentence id used for? is it just for debugging purposes? On 23 July 2014 11:36,

Re: [Moses-support] Phrase extraction with --IncludeSentenceId messes up phrase table counts

2014-07-23 Thread Marcin Junczys-Dowmunt
Key-value format would actually be fine. W dniu 23.07.2014 13:12, Marcin Junczys-Dowmunt pisze: I was planning to use it for a custom feature function later. W dniu 23.07.2014 13:11, Hieu Hoang pisze: i can change it so that the sentence id is put into a key-value field in the last column.

Re: [Moses-support] Phrase extraction with --IncludeSentenceId messes up phrase table counts

2014-07-23 Thread Philipp Koehn
Hi, the sentence ID is being used for the domain indicator features. If you run phrase-extract's score with specifying a domain file, it then it uses the sentence IDs to find out which domain the phrase pair was found in. This is a standard features in Edinburgh's phrase-based system for the

Re: [Moses-support] Phrase extraction with --IncludeSentenceId messes up phrase table counts

2014-07-23 Thread Hieu Hoang
ah ok. I thought it was just for debugging. I'm not gonna change it since it's gonna involve months of debugging. Ideally, the extract format should be fixed like the phrase-table, with the last column being key-value pairs. Also, way the key-value pairs are processed should be automatic

Re: [Moses-support] Phrase extraction with --IncludeSentenceId messes up phrase table counts

2014-07-23 Thread Marcin Junczys-Dowmunt
So, how come this is not damaging the Edinburgh system? W dniu 23.07.2014 17:32, Hieu Hoang pisze: ah ok. I thought it was just for debugging. I'm not gonna change it since it's gonna involve months of debugging. Ideally, the extract format should be fixed like the phrase-table, with the

Re: [Moses-support] Phrase extraction with --IncludeSentenceId messes up phrase table counts

2014-07-23 Thread Barry Haddow
Because calculating translation probabilities from sentence ids is unexpectedly beneficial? On 23/07/14 16:34, Marcin Junczys-Dowmunt wrote: So, how come this is not damaging the Edinburgh system? W dniu 23.07.2014 17:32, Hieu Hoang pisze: ah ok. I thought it was just for debugging. I'm

Re: [Moses-support] Phrase extraction with --IncludeSentenceId messes up phrase table counts

2014-07-23 Thread Hieu Hoang
it's likely we're using fractional count so there's a extra column On 23 July 2014 16:34, Marcin Junczys-Dowmunt junc...@amu.edu.pl wrote: So, how come this is not damaging the Edinburgh system? W dniu 23.07.2014 17:32, Hieu Hoang pisze: ah ok. I thought it was just for debugging. I'm

Re: [Moses-support] Phrase extraction with --IncludeSentenceId messes up phrase table counts

2014-07-23 Thread Marcin Junczys-Dowmunt
In a corpus sorted with sentences sorted by release date this could actually make sense :) W dniu 23.07.2014 17:40, Barry Haddow pisze: Because calculating translation probabilities from sentence ids is unexpectedly beneficial? On 23/07/14 16:34, Marcin Junczys-Dowmunt wrote: So, how come

Re: [Moses-support] Phrase extraction with --IncludeSentenceId messes up phrase table counts

2014-07-23 Thread Philipp Koehn
Hi, this is how extract is called: extract corpus.en corpus.fr align extract 5 --IncludeSentenceId this is how score is called: score extract lex.f2e phrase-table.half --GoodTuring --DomainIndicator domains.5 phrase table looks fine to me -phi On Wed, Jul 23, 2014 at 11:42 AM, Marcin

Re: [Moses-support] Phrase extraction with --IncludeSentenceId messes up phrase table counts

2014-07-23 Thread Marcin Junczys-Dowmunt
So, adding --IgnoreSentenceId to score might fix that without messing up your stuff? I guess I can do that if you can't be bothered, Hieu. W dniu 23.07.2014 17:53, Philipp Koehn pisze: Hi, this is how extract is called: extract corpus.en corpus.fr http://corpus.fr align extract 5

Re: [Moses-support] Phrase extraction with --IncludeSentenceId messes up phrase table counts

2014-07-23 Thread Barry Haddow
Hi Marcin It appears that there is an --IgnoreSentenceId argument already, added by Maria during last year's MTM [gna]bhaddow: git blame ScoreFeature.cpp | grep Ignore bff12363 (maria nadejde 2013-09-13 12:45:46 +0200 42) if (args[i] == --IgnoreSentenceId) { cheers - Barry On 23/07/14

Re: [Moses-support] Phrase extraction with --IncludeSentenceId messes up phrase table counts

2014-07-23 Thread Hieu Hoang
i was doing it it, but mine was a more holistic approach but it would have broken compability. so i can't be bothered On 23 July 2014 16:56, Marcin Junczys-Dowmunt junc...@amu.edu.pl wrote: So, adding --IgnoreSentenceId to score might fix that without messing up your stuff? I guess I can

Re: [Moses-support] Phrase extraction with --IncludeSentenceId messes up phrase table counts

2014-07-23 Thread Marcin Junczys-Dowmunt
Oh. Good! I guess there is a lesson to be learned somewhere. Thanks. W dniu 23.07.2014 18:06, Barry Haddow pisze: Hi Marcin It appears that there is an --IgnoreSentenceId argument already, added by Maria during last year's MTM [gna]bhaddow: git blame ScoreFeature.cpp | grep Ignore

[Moses-support] 2014 EAMT Best Thesis Award

2014-07-23 Thread Mikel Forcada
Dear Moses Support list members: the European Association for Machine Translation (EAMT) has published the call for candidacies to the 2014 EAMT Best Thesis Award. For details, please visit the following URL: http://www.eamt.org/news/news_best_thesis2014.php Best regards, Mikel L. Forcada

[Moses-support] 2014 EAMT call for proposals and internships

2014-07-23 Thread Mikel Forcada
Dear list members: the European Association for Machine Translation (EAMT) has published the 2014 call for proposals and the 2015 call for student internships. For details, please visit the following URLs: http://www.eamt.org/news/news_call_for_proposals2014.php

[Moses-support] Deadline extension for SSST-8, 8th Workshop on Syntax, Semantics and Structure in Statistical Translation (EMNLP 2014)

2014-07-23 Thread Carpuat, Marine
Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8) EMNLP 2014 / SIGMT / SIGLEX Workshop Oct 2014, Doha, Qatar http://www.cse.ust.hk/~dekai/ssst/ *** New submission deadline for papers and abstracts: August 1st, 2014 *** *** Special theme: Compositional

Re: [Moses-support] Some questions about the output of GIZA++

2014-07-23 Thread Hieu Hoang
I'm not an expert on giza++ but a problem is that it creates similar files that only differ in the case of the file name, eg file.a3 file.A3 on operating systems that have case insensitive filesystems (Windows/cygwin, Mac OSX) they cause problems as the files are overwritten. I personally