Hi, Ah, in that case it can actually cause problems: your training data should always be formatted in the same way as your dev/test data.
2 possibilities: - re-tokenize training data with the actual tokenizer script to have the same mark-up (then retrain your system) - re-tokenize your dev/test data with the same (possibly older) tokenizer script as was used for your training data (then run tuning/decoding) HTH, Thomas On 21 February 2014 14:49, cyrine.na...@univ-lorraine.fr < cyrine.na...@gmail.com> wrote: > Thank you Thomas, > > So, i keep the text with these Special characters, it will not cause > problems? beacuse the training corpus is without these characters but only > the development and test corpus are like this. > > Thank you :) > > Bets > > > 2014-02-21 14:40 GMT+01:00 Thomas Meyer <ithurts...@gmail.com>: > >> >> >> Hi, >> >> That is not a 'problem' but XML >> entities<http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references> >> mark-up >> for special characters. You don't have to worry about this, as the >> tokenizer script does it for all characters in a consistent way. >> >> Best, >> Thomas >> >> >> On 21 February 2014 14:20, cyrine.na...@univ-lorraine.fr < >> cyrine.na...@gmail.com> wrote: >> >>> >>> Hello all, >>> >>> I have a problem with the tokenizer.pl script. i get as a result a text >>> ith some special punctuation , like this for example : >>> >>> EU 's Luxembourg-based statistical office reported >>> >>> The input file is a .txt file >>> >>> Is there any solution for this problem >>> >>> Thank you in advance >>> >>> >>> Bests >>> -- >>> *Cyrine* >>> >>> _______________________________________________ >>> Moses-support mailing list >>> Moses-support@mit.edu >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> >> > > > -- > > *Cyrine NASRIPh.D. Student in Computer Science* >
_______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support