i was doing it it, but mine was a more holistic approach but it would have broken compability.
so i can't be bothered On 23 July 2014 16:56, Marcin Junczys-Dowmunt <junc...@amu.edu.pl> wrote: > So, adding "--IgnoreSentenceId" to "score" might fix that without > messing up your stuff? I guess I can do that if you can't be bothered, > Hieu. > > W dniu 23.07.2014 17:53, Philipp Koehn pisze: > > Hi, > > this is how extract is called: > extract corpus.en corpus.fr align extract 5 --IncludeSentenceId > > this is how score is called: > score extract lex.f2e phrase-table.half --GoodTuring --DomainIndicator > domains.5 > > phrase table looks fine to me > > -phi > > > On Wed, Jul 23, 2014 at 11:42 AM, Marcin Junczys-Dowmunt < > junc...@amu.edu.pl> wrote: > >> In a corpus sorted with sentences sorted by release date this could >> actually make sense :) >> >> W dniu 23.07.2014 17:40, Barry Haddow pisze: >> >> Because calculating translation probabilities from sentence ids is >>> unexpectedly beneficial? >>> >>> On 23/07/14 16:34, Marcin Junczys-Dowmunt wrote: >>> >>>> >>>> So, how come this is not damaging the Edinburgh system? >>>> >>>> W dniu 23.07.2014 17:32, Hieu Hoang pisze: >>>> >>>>> ah ok. >>>>> >>>>> I thought it was just for debugging. I'm not gonna change it since >>>>> it's gonna involve months of debugging. >>>>> >>>>> Ideally, the extract format should be fixed like the phrase-table, >>>>> with the last column being key-value pairs. Also, way the key-value pairs >>>>> are processed should be automatic like in the decoder. >>>>> >>>>> marcin - sorry mate. you're on your own >>>>> >>>>> On 23/07/14 16:20, Philipp Koehn wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> the sentence ID is being used for the domain indicator features. >>>>>> >>>>>> If you run phrase-extract's score with specifying a domain file, >>>>>> it then it uses the sentence IDs to find out which domain the >>>>>> phrase pair was found in. >>>>>> >>>>>> This is a standard features in Edinburgh's phrase-based system >>>>>> for the last 1-2 years, so if you want to make changes, make >>>>>> sure that this functionality still works (see [1381-5] for an example >>>>>> with extract* files still in place). >>>>>> >>>>>> -phi >>>>>> >>>>>> >>>>>> On Wed, Jul 23, 2014 at 7:15 AM, Marcin Junczys-Dowmunt < >>>>>> junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote: >>>>>> >>>>>> Key-value format would actually be fine. >>>>>> >>>>>> W dniu 23.07.2014 13:12, Marcin Junczys-Dowmunt pisze: >>>>>> >>>>>>> I was planning to use it for a custom feature function later. >>>>>>> >>>>>>> W dniu 23.07.2014 13:11, Hieu Hoang pisze: >>>>>>> >>>>>>>> i can change it so that the sentence id is put into a >>>>>>>> key-value field in the last column. >>>>>>>> >>>>>>>> what is the sentence id used for? is it just for debugging >>>>>>>> purposes? >>>>>>>> >>>>>>>> >>>>>>>> On 23 July 2014 11:36, Marcin Junczys-Dowmunt >>>>>>>> <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> I am using train-model.perl with >>>>>>>> >>>>>>>> --extract-options="--IncludeSentenceId" >>>>>>>> >>>>>>>> and it seems that the sentence id is somehow getting into >>>>>>>> the phrase >>>>>>>> table as a count and later used for phrase translation >>>>>>>> weight >>>>>>>> calculation, for instance the extract (last column is the >>>>>>>> Id): >>>>>>>> >>>>>>>> #c the compound or process ||| #c verbindung oder >>>>>>>> verfahren ||| 0-0 2-1 >>>>>>>> 3-2 4-3 ||| 1374618 >>>>>>>> #c the compound or process ||| #c verbindung oder >>>>>>>> verfahren ||| 0-0 2-1 >>>>>>>> 3-2 4-3 ||| 1374619 >>>>>>>> #c the compound or process ||| #c verbindung oder >>>>>>>> verfahren ||| 0-0 2-1 >>>>>>>> 3-2 4-3 ||| 1374620 >>>>>>>> #c the compound or process ||| #c verbindung oder >>>>>>>> verfahren ||| 0-0 2-1 >>>>>>>> 3-2 4-3 ||| 1374621 >>>>>>>> #c the compound or process ||| #c verbindung oder >>>>>>>> verfahren ||| 0-0 2-1 >>>>>>>> 3-2 4-3 ||| 1374622 >>>>>>>> #c the compound or process ||| #c verbindung oder >>>>>>>> verfahren ||| 0-0 2-1 >>>>>>>> 3-2 4-3 ||| 4587318 >>>>>>>> >>>>>>>> results in a phrase table entry like this: >>>>>>>> >>>>>>>> #c the compound or process ||| #c verbindung oder >>>>>>>> verfahren ||| 1 >>>>>>>> 0.0100206 5.23542e-07 0.524577 ||| 0-0 2-1 3-2 4-3 ||| 6 >>>>>>>> 1.14604e+07 6 >>>>>>>> ||| ||| >>>>>>>> >>>>>>>> The count is equal to the sum of sentence ids, which of >>>>>>>> course make the >>>>>>>> phrase probability useless. >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Moses-support mailing list >>>>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- Hieu Hoang >>>>>>>> Research Associate >>>>>>>> University of Edinburgh >>>>>>>> http://www.hoang.co.uk/hieu >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Moses-support mailing list >>>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Moses-support mailing list >>>>>> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Moses-support mailing list >>>>>> Moses-support@mit.edu >>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>> >>>>> >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> Moses-support@mit.edu >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>> >>> >>> >> > > -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support