i agree with you that Moses should only work with text, and that it should be up to the user strip out XML from the input.
however, escaping xml stuff is really a defensive strategy, so that the decoder doesn't choke in case input hasn't been cleaned of xml. I think this has made decoding a bit more reliable. what else do you suggest? On 19 February 2013 17:10, Nicholas Ruiz <[email protected]> wrote: > Hi everyone, > > Question/comment/feature request regarding tokenizer.perl. > > Question: Why does tokenizer.perl version 1.1 provide html/xml encodings > of special characters, such as the apostrophe? > e.g. Please rise , then , for this minute 's silence . > https://webmail.fbk.eu/owa/?ae=PreFormAction&t=AddressBook&a=Done&ctx=2# > Comment: If this is for XML compatibility, couldn't any relevant XML > markup be annotated with CDATA to ignore parsing? Why should this be done > within Moses? In my opinion**, Moses should just work with text. Otherwise, > it's up to the user to decode the text in order to use POS taggers, etc > that typically use the same tokenization strategies as tokenizer.perl 1.0. > (Of course, we still need a "|" encoding) > > ** My opinion as a PhD student -- the value of that is left to the reader. > > Feature request: There's already a -x flag that skips XML fields. What do > you think about a flag to enable/disable encodings? (In my opinion, it > should default to being disabled.) > > Thanks for your time, > Nick Ruiz > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
