Hi Henry, If you are willing to use XLIFF-style placeholders M4Loc/Okapi already provide support for placeholder preservation with XML input and plain text translation with placeholder reinsertion http://code.google.com/p/m4loc .
The later method uses phrase-alignment info from the decoder. Not yet the word-alignment information that Philipp was talking about (which requires special configuration during training http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc7 ). Achim -----Original Message----- From: moses-support-boun...@mit.edu [mailto:moses-support-boun...@mit.edu] On Behalf Of Henry Hu Sent: Wednesday, August 01, 2012 2:46 AM To: Tom Hoar Cc: moses-support@mit.edu Subject: Re: [Moses-support] Placeholder drift Thank you all very much. Yes, the original text was split into 3 tokes: {} Processor {} The translated text includes also 3 tokens: {} {} processeur Then, I took another test according to Daniel's suggestion. Input text is: {33} Processor {34} I still got the result of translation: {33} {34} processeur After reading your replies carefully, I guess that the XML markup may be a solution to the issue. I cannot remove placeholders from text, because there is no way to re-populate tags into translated text without those placeholders in original text. Thanks, Henry On Tue, Jul 31, 2012 at 10:58 PM, Tom Hoar <tah...@precisiontranslationtools.com> wrote: > John, this is true if there were three tokens, but {}Processor{} has > no spaces. Assuming that the target language should be {}processeur{} > without spaces in both the parallel and LM data, the tables and the > language model will treat it as one token and not break break it up. > > Henry, I suspect your corpus preparation inserts spaces between to > create {} Processor {} (3 tokens). John's description is much more > viable if this is the case. > > One oddity is the output {}{} token because it's one token, not two. > Moses won't remove the space to splice the two. It would seem your > target data contains this as a token from somewhere in the tables or LM. > > I suggest you double-check your tokenization and other preparation to > ensure source and target are still one token when you start training. > > Tom > > > > On Tue, 31 Jul 2012 10:08:43 -0400, John D Burger <j...@mitre.org> wrote: >> >> Are there any such placeholders in your language modeling data and >> your parallel training data? If not, all the models are going to >> treat them as unknown words. In the case of the language model, it >> doesn't surprise me too much that the placeholders all get pushed >> together, as that will produce fewer discontiguous subsequences, >> which the language model will prefer. >> >> - John Burger >> MITRE >> >> On Jul 31, 2012, at 03:05 , Henry Hu wrote: >> >>> Hi, >>> >>> I use a model to translate English to French. First, I replaced HTML >>> tags such as <a>, <b>, with the placeholder {}, like this: >>> >>> {}Processor{} >>> >>> Then decoding. To my confusion, I got the result: >>> >>> {}{} processeur >>> >>> instead of {}processeur{}. Why did the placeholder move? How can I >>> make it fixed? Thanks for any suggestion. >>> >>> Henry >>> _______________________________________________ >>> Moses-support mailing list >>> Moses-support@mit.edu >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >> >> _______________________________________________ >> Moses-support mailing list >> Moses-support@mit.edu >> http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support