Hi Henry,
If you are willing to use XLIFF-style placeholders M4Loc/Okapi already
provide support for placeholder preservation with XML input and plain text
translation with placeholder reinsertion http://code.google.com/p/m4loc . 

The later method uses phrase-alignment info from the decoder. Not yet the
word-alignment information that Philipp was talking about (which requires
special configuration during training
http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc7 ).

Achim 


-----Original Message-----
From: moses-support-boun...@mit.edu [mailto:moses-support-boun...@mit.edu]
On Behalf Of Henry Hu
Sent: Wednesday, August 01, 2012 2:46 AM
To: Tom Hoar
Cc: moses-support@mit.edu
Subject: Re: [Moses-support] Placeholder drift

Thank you all very much.

Yes, the original text was split into 3 tokes:

{} Processor {}

The translated text includes also 3 tokens:

{} {} processeur

Then, I took another test according to Daniel's suggestion. Input text is:

{33} Processor {34}

I still got the result of translation:

{33} {34} processeur

After reading your replies carefully, I guess that the XML markup may be a
solution to the issue. I cannot remove placeholders from text, because there
is no way to re-populate tags into translated text without those
placeholders in original text.

Thanks,
Henry

On Tue, Jul 31, 2012 at 10:58 PM, Tom Hoar
<tah...@precisiontranslationtools.com> wrote:
> John, this is true if there were three tokens, but {}Processor{} has 
> no spaces. Assuming that the target language should be {}processeur{} 
> without spaces in both the parallel and LM data, the tables and the 
> language model will treat it as one token and not break break it up.
>
> Henry, I suspect your corpus preparation inserts spaces between to 
> create {} Processor {} (3 tokens). John's description is much more 
> viable if this is the case.
>
> One oddity is the output {}{} token because it's one token, not two. 
> Moses won't remove the space to splice the two. It would seem your 
> target data contains this as a token from somewhere in the tables or LM.
>
> I suggest you double-check your tokenization and other preparation to 
> ensure source and target are still one token when you start training.
>
> Tom
>
>
>
> On Tue, 31 Jul 2012 10:08:43 -0400, John D Burger <j...@mitre.org> wrote:
>>
>> Are there any such placeholders in your language modeling data and 
>> your parallel training data?  If not, all the models are going to 
>> treat them as unknown words.  In the case of the language model, it 
>> doesn't surprise me too much that the placeholders all get pushed 
>> together, as that will produce fewer discontiguous subsequences, 
>> which the language model will prefer.
>>
>> - John Burger
>>   MITRE
>>
>> On Jul 31, 2012, at 03:05 , Henry Hu wrote:
>>
>>> Hi,
>>>
>>> I use a model to translate English to French. First, I replaced HTML 
>>> tags such as <a>, <b>, with the placeholder {}, like this:
>>>
>>> {}Processor{}
>>>
>>> Then decoding. To my confusion, I got the result:
>>>
>>> {}{} processeur
>>>
>>> instead of {}processeur{}. Why did the placeholder move? How can I 
>>> make it fixed? Thanks for any suggestion.
>>>
>>> Henry
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to