Re: [Moses-support] Phrase extraction with --IncludeSentenceId messes up phrase table counts

Hieu Hoang Wed, 23 Jul 2014 09:26:46 -0700

i was doing it it, but mine was a more holistic approach but it would have
broken compability.


so i can't be bothered



On 23 July 2014 16:56, Marcin Junczys-Dowmunt <junc...@amu.edu.pl> wrote:

>  So, adding "--IgnoreSentenceId" to "score" might fix that without
> messing up your stuff? I guess I can do that if you can't be bothered,
> Hieu.
>
> W dniu 23.07.2014 17:53, Philipp Koehn pisze:
>
> Hi,
>
> this is how extract is called:
> extract corpus.en corpus.fr align extract  5 --IncludeSentenceId
>
> this is how score is called:
> score extract lex.f2e phrase-table.half --GoodTuring --DomainIndicator
> domains.5
>
>  phrase table looks fine to me
>
>  -phi
>
>
> On Wed, Jul 23, 2014 at 11:42 AM, Marcin Junczys-Dowmunt <
> junc...@amu.edu.pl> wrote:
>
>> In a corpus sorted with sentences sorted by release date this could
>> actually make sense :)
>>
>> W dniu 23.07.2014 17:40, Barry Haddow pisze:
>>
>>  Because calculating translation probabilities from sentence ids is
>>> unexpectedly beneficial?
>>>
>>> On 23/07/14 16:34, Marcin Junczys-Dowmunt wrote:
>>>
>>>>
>>>> So, how come this is not damaging the Edinburgh system?
>>>>
>>>> W dniu 23.07.2014 17:32, Hieu Hoang pisze:
>>>>
>>>>> ah ok.
>>>>>
>>>>> I thought it was just for debugging. I'm not gonna change it since
>>>>> it's gonna involve months of debugging.
>>>>>
>>>>> Ideally, the extract format should be fixed like the phrase-table,
>>>>> with the last column being key-value pairs. Also, way the key-value pairs
>>>>> are processed should be automatic like in the decoder.
>>>>>
>>>>> marcin - sorry mate. you're on your own
>>>>>
>>>>> On 23/07/14 16:20, Philipp Koehn wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> the sentence ID is being used for the domain indicator features.
>>>>>>
>>>>>> If you run phrase-extract's score with specifying a domain file,
>>>>>> it then it uses the sentence IDs to find out which domain the
>>>>>> phrase pair was found in.
>>>>>>
>>>>>> This is a standard features in Edinburgh's phrase-based system
>>>>>> for the last 1-2 years, so if you want to make changes, make
>>>>>> sure that this functionality still works (see [1381-5] for an example
>>>>>> with extract* files still in place).
>>>>>>
>>>>>> -phi
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 23, 2014 at 7:15 AM, Marcin Junczys-Dowmunt <
>>>>>> junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote:
>>>>>>
>>>>>>     Key-value format would actually be fine.
>>>>>>
>>>>>>     W dniu 23.07.2014 13:12, Marcin Junczys-Dowmunt pisze:
>>>>>>
>>>>>>>     I was planning to use it for a custom feature function later.
>>>>>>>
>>>>>>>     W dniu 23.07.2014 13:11, Hieu Hoang pisze:
>>>>>>>
>>>>>>>>     i can change it so that the sentence id is put into a
>>>>>>>>     key-value field in the last column.
>>>>>>>>
>>>>>>>>     what is the sentence id used for? is it just for debugging
>>>>>>>>     purposes?
>>>>>>>>
>>>>>>>>
>>>>>>>>     On 23 July 2014 11:36, Marcin Junczys-Dowmunt
>>>>>>>>     <junc...@amu.edu.pl <mailto:junc...@amu.edu.pl>> wrote:
>>>>>>>>
>>>>>>>>         Hi,
>>>>>>>>         I am using train-model.perl with
>>>>>>>>
>>>>>>>>         --extract-options="--IncludeSentenceId"
>>>>>>>>
>>>>>>>>         and it seems that the sentence id is somehow getting into
>>>>>>>>         the phrase
>>>>>>>>         table as a count and later used for phrase translation
>>>>>>>> weight
>>>>>>>>         calculation, for instance the extract (last column is the
>>>>>>>> Id):
>>>>>>>>
>>>>>>>>         #c the compound or process ||| #c verbindung oder
>>>>>>>>         verfahren ||| 0-0 2-1
>>>>>>>>         3-2 4-3 ||| 1374618
>>>>>>>>         #c the compound or process ||| #c verbindung oder
>>>>>>>>         verfahren ||| 0-0 2-1
>>>>>>>>         3-2 4-3 ||| 1374619
>>>>>>>>         #c the compound or process ||| #c verbindung oder
>>>>>>>>         verfahren ||| 0-0 2-1
>>>>>>>>         3-2 4-3 ||| 1374620
>>>>>>>>         #c the compound or process ||| #c verbindung oder
>>>>>>>>         verfahren ||| 0-0 2-1
>>>>>>>>         3-2 4-3 ||| 1374621
>>>>>>>>         #c the compound or process ||| #c verbindung oder
>>>>>>>>         verfahren ||| 0-0 2-1
>>>>>>>>         3-2 4-3 ||| 1374622
>>>>>>>>         #c the compound or process ||| #c verbindung oder
>>>>>>>>         verfahren ||| 0-0 2-1
>>>>>>>>         3-2 4-3 ||| 4587318
>>>>>>>>
>>>>>>>>         results in a phrase table entry like this:
>>>>>>>>
>>>>>>>>         #c the compound or process ||| #c verbindung oder
>>>>>>>>         verfahren ||| 1
>>>>>>>>         0.0100206 5.23542e-07 0.524577 ||| 0-0 2-1 3-2 4-3 ||| 6
>>>>>>>>         1.14604e+07 6
>>>>>>>>         ||| |||
>>>>>>>>
>>>>>>>>         The count is equal to the sum of sentence ids, which of
>>>>>>>>         course make the
>>>>>>>>         phrase probability useless.
>>>>>>>>
>>>>>>>>         _______________________________________________
>>>>>>>>         Moses-support mailing list
>>>>>>>>         Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>     --     Hieu Hoang
>>>>>>>>     Research Associate
>>>>>>>>     University of Edinburgh
>>>>>>>>     http://www.hoang.co.uk/hieu
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>     _______________________________________________
>>>>>>>     Moses-support mailing list
>>>>>>>     Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>>>     http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>
>>>>>>
>>>>>>
>>>>>>     _______________________________________________
>>>>>>     Moses-support mailing list
>>>>>>     Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>>     http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> Moses-support@mit.edu
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> Moses-support@mit.edu
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>
>>>
>>>
>>
>
>


-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Phrase extraction with --IncludeSentenceId messes up phrase table counts

Reply via email to