Hi everyone,

Question/comment/feature request regarding tokenizer.perl.

Question: Why does tokenizer.perl version 1.1 provide html/xml encodings of 
special characters, such as the apostrophe?
e.g. Please rise , then , for this minute 's silence .
https://webmail.fbk.eu/owa/?ae=PreFormAction&t=AddressBook&a=Done&ctx=2#
Comment: If this is for XML compatibility, couldn't any relevant XML markup be 
annotated with CDATA to ignore parsing? Why should this be done within Moses? 
In my opinion**, Moses should just work with text. Otherwise, it's up to the 
user to decode the text in order to use POS taggers, etc that typically use the 
same tokenization strategies as tokenizer.perl 1.0. (Of course, we still need a 
"|" encoding)

** My opinion as a PhD student -- the value of that is left to the reader.

Feature request: There's already a -x flag that skips XML fields. What do you 
think about a flag to enable/disable encodings? (In my opinion, it should 
default to being disabled.)

Thanks for your time,
Nick Ruiz
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to