I just ran the same sentence through the newest github clone (today).

corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$ ./tokenizer.perl -no-escape -q -l en < test.txt which will guide you through connecting and configuring your printer 's wireless connection . which will guide you through connecting and configuring your printer 's wireless connection . which will guide you through connecting and configuring your printer 's wireless connection . which will guide you through connecting and configuring your printer 's wireless connection . which will guide you through connecting and configuring your printer 's wireless connection .

This is not a Perl script problem. What shell and command line are you using for your "in the file" results? You'll find the problem in either your shell or your custom tool chain(s) before you run tokenizer.perl.



On 01/14/2015 04:13 PM, Ihab Ramadan wrote:

Dears,

I still have this problem, for not confusing the decoder I used the “–no-escape” parameter in the tokenizer.perl script but still have the problem of adding extra space after quotations for tokenizing files however in tokenizing a segment it comes without the extra space

For example

In the file

“which will guide you through connecting and configuring your printer's wireless connection. “ à“which will guide you through connecting and configuring your printer ' s wireless connection .”

As a segment

“which will guide you through connecting and configuring your printer's wireless connection. “ à“which will guide you through connecting and configuring your printer 's wireless connection .”

I wonder if it is the same script why it generated two different outputs

I have no experience in perl so I could not get the line of code which differ between if the segment in a file or just one segment passed as a parameter to the script

Please help

*From:*Ihab Ramadan [mailto:i.rama...@saudisoft.com]
*Sent:* Monday, January 5, 2015 10:09 AM
*To:* moses-support@mit.edu
*Subject:* Tokenization problem

Dears,

Using the tokenizer on the training files replaces the apostrophes with “&apos; s” (with space) but if I use the same script to tokenize a sentence it makes the apostrophes to be “&apos;s” (without a space)

This problem confuse the decoder while translation

How to solve this peoblem

Thanks

Best Regards

/Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/> - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 | Fax+20233032036 | *Follow us on *linked <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* | **ZA102637861* <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* | **ZA102637858* <https://twitter.com/Saudisoft>



_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to