Hi,

Ah, in that case it can actually cause problems: your training data should
always be formatted in the same way as your dev/test data.

2 possibilities:

- re-tokenize training data with the actual tokenizer script to have the
same mark-up (then retrain your system)
- re-tokenize your dev/test data with the same (possibly older) tokenizer
script as was used for your training data (then run tuning/decoding)

HTH,
Thomas


On 21 February 2014 14:49, cyrine.na...@univ-lorraine.fr <
cyrine.na...@gmail.com> wrote:

> Thank you Thomas,
>
> So, i keep the text with these Special characters, it will not cause
> problems? beacuse the training corpus is without these characters but only
> the development and test corpus are like this.
>
> Thank you :)
>
> Bets
>
>
> 2014-02-21 14:40 GMT+01:00 Thomas Meyer <ithurts...@gmail.com>:
>
>>
>>
>> Hi,
>>
>> That is not a 'problem' but XML 
>> entities<http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references>
>>  mark-up
>> for special characters. You don't have to worry about this, as the
>> tokenizer script does it for all characters in a consistent way.
>>
>> Best,
>> Thomas
>>
>>
>> On 21 February 2014 14:20, cyrine.na...@univ-lorraine.fr <
>> cyrine.na...@gmail.com> wrote:
>>
>>>
>>> Hello all,
>>>
>>> I have a problem with the tokenizer.pl script. i get as a result a text
>>> ith some special punctuation , like this for example :
>>>
>>> EU &apos;s Luxembourg-based statistical office reported
>>>
>>> The input file is a .txt file
>>>
>>> Is there any solution for this problem
>>>
>>> Thank you in advance
>>>
>>>
>>> Bests
>>> --
>>> *Cyrine*
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> Moses-support@mit.edu
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>
>
>
> --
>
> *Cyrine NASRIPh.D. Student in Computer Science*
>
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to