I just ran the same sentence through the newest github clone (today).
corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$
./tokenizer.perl -no-escape -q -l en < test.txt
which will guide you through connecting and configuring your printer 's
wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .
This is not a Perl script problem. What shell and command line are you
using for your "in the file" results? You'll find the problem in either
your shell or your custom tool chain(s) before you run tokenizer.perl.
On 01/14/2015 04:13 PM, Ihab Ramadan wrote:
Dears,
I still have this problem, for not confusing the decoder I used the
“–no-escape” parameter in the tokenizer.perl script but still have the
problem of adding extra space after quotations for tokenizing files
however in tokenizing a segment it comes without the extra space
For example
In the file
“which will guide you through connecting and configuring your
printer's wireless connection. “ à“which will guide you through
connecting and configuring your printer ' s wireless connection .”
As a segment
“which will guide you through connecting and configuring your
printer's wireless connection. “ à“which will guide you through
connecting and configuring your printer 's wireless connection .”
I wonder if it is the same script why it generated two different outputs
I have no experience in perl so I could not get the line of code which
differ between if the segment in a file or just one segment passed as
a parameter to the script
Please help
*From:*Ihab Ramadan [mailto:i.rama...@saudisoft.com]
*Sent:* Monday, January 5, 2015 10:09 AM
*To:* moses-support@mit.edu
*Subject:* Tokenization problem
Dears,
Using the tokenizer on the training files replaces the apostrophes
with “' s” (with space) but if I use the same script to tokenize
a sentence it makes the apostrophes to be “'s” (without a space)
This problem confuse the decoder while translation
How to solve this peoblem
Thanks
Best Regards
/Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/>
- Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 |
Fax+20233032036 | *Follow us on *linked
<http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* |
**ZA102637861*
<https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* |
**ZA102637858* <https://twitter.com/Saudisoft>
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support