Re: [Moses-support] Tokenization problem

Tom Hoar Wed, 14 Jan 2015 17:36:07 -0800

I just ran the same sentence through the newest github clone (today).

corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$./tokenizer.perl -no-escape -q -l en < test.txtwhich will guide you through connecting and configuring your printer 'swireless connection .which will guide you through connecting and configuring your printer 'swireless connection .which will guide you through connecting and configuring your printer 'swireless connection .which will guide you through connecting and configuring your printer 'swireless connection .which will guide you through connecting and configuring your printer 'swireless connection .

This is not a Perl script problem. What shell and command line are youusing for your "in the file" results? You'll find the problem in eitheryour shell or your custom tool chain(s) before you run tokenizer.perl.




On 01/14/2015 04:13 PM, Ihab Ramadan wrote:

Dears,
I still have this problem, for not confusing the decoder I used the“–no-escape” parameter in the tokenizer.perl script but still have theproblem of adding extra space after quotations for tokenizing fileshowever in tokenizing a segment it comes without the extra space
For example

In the file
“which will guide you through connecting and configuring yourprinter's wireless connection. “ à“which will guide you throughconnecting and configuring your printer ' s wireless connection .”
As a segment
“which will guide you through connecting and configuring yourprinter's wireless connection. “ à“which will guide you throughconnecting and configuring your printer 's wireless connection .”
I wonder if it is the same script why it generated two different outputs
I have no experience in perl so I could not get the line of code whichdiffer between if the segment in a file or just one segment passed asa parameter to the script
Please help

*From:*Ihab Ramadan [mailto:i.rama...@saudisoft.com]
*Sent:* Monday, January 5, 2015 10:09 AM
*To:* moses-support@mit.edu
*Subject:* Tokenization problem

Dears,
Using the tokenizer on the training files replaces the apostropheswith “' s” (with space) but if I use the same script to tokenizea sentence it makes the apostrophes to be “'s” (without a space)
This problem confuse the decoder while translation

How to solve this peoblem

Thanks

Best Regards
/Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/>- Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 |Fax+20233032036 | *Follow us on *linked<http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* |**ZA102637861*<https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* |**ZA102637858* <https://twitter.com/Saudisoft>
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Tokenization problem

Reply via email to