subject:"\[Moses\-support\] Tokenization"

Re: [Moses-support] Tokenization

2020-04-12 Thread Justin Cunningham

Thanks for replying! It actually ended up being a spelling error in the code.

Thanks,
Justin


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Tokenization

2020-04-12 Thread Hieu Hoang

the moses tokenizer expects the input from standard in

Hieu Hoang
http://statmt.org/hieu


On Sun, 12 Apr 2020 at 10:27, Justin Cunningham 
wrote:

> Hi,
>
> I’m currently working on a Neural Machine Translator but I am quite new to
> it all. I am trying to tokenise my files in Linux using the following shell
> script (https://github.com/JustCunn/IrishNMT/blob/master/GaeilgePrepare.sh)
> and these files:
>
> http://opus.nlpl.eu/download.php?f=EUbookshop/v2/moses/en-ga.txt.zip
> 
> http://opus.nlpl.eu/download.php?f=QED/v2.0a/moses/en-ga.txt.zip
>
> But it just won’t work. Sometimes it will skip it, others it will just be
> stuck on the ‘Tokenizer... number of threads...”. For context, they are all
> plain text files. Am I not formatting the text correctly?
>
> I’d appreciate if someone could help me with this as it would be a huge
> help in my understanding of it all.
>
> Thanks,
> Justin
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] Tokenization

2020-04-12 Thread Justin Cunningham

Hi,

I’m currently working on a Neural Machine Translator but I am quite new to it 
all. I am trying to tokenise my files in Linux using the following shell script 
(https://github.com/JustCunn/IrishNMT/blob/master/GaeilgePrepare.sh) and these 
files:

http://opus.nlpl.eu/download.php?f=EUbookshop/v2/moses/en-ga.txt.zip
http://opus.nlpl.eu/download.php?f=QED/v2.0a/moses/en-ga.txt.zip

But it just won’t work. Sometimes it will skip it, others it will just be stuck 
on the ‘Tokenizer... number of threads...”. For context, they are all plain 
text files. Am I not formatting the text correctly?

I’d appreciate if someone could help me with this as it would be a huge help in 
my understanding of it all.

Thanks,
Justin
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Tokenization problem

2015-01-15 Thread Ihab Ramadan

Many thanks for all of you
As you mentioned the problem is not in the script it was in the text sent to
the terminal from my web app, I found that some characters does not goes as
it with weird Unicode  
Thanks everybody

-Original Message-
From: moses-support-boun...@mit.edu [mailto:moses-support-boun...@mit.edu]
On Behalf Of moses-support-requ...@mit.edu
Sent: Thursday, January 15, 2015 3:39 AM
To: moses-support@mit.edu
Subject: Moses-support Digest, Vol 99, Issue 28

Send Moses-support mailing list submissions to
moses-support@mit.edu

To subscribe or unsubscribe via the World Wide Web, visit
http://mailman.mit.edu/mailman/listinfo/moses-support
or, via email, send a message with subject or body 'help' to
moses-support-requ...@mit.edu

You can reach the person managing the list at
moses-support-ow...@mit.edu

When replying, please edit your Subject line so it is more specific than
Re: Contents of Moses-support digest...


Today's Topics:

   1. how to align some new parallel sentences using a  trained
  model (iamzcy_hit iamzcy_hit)
   2. Re: Tokenization problem (Tom Hoar)
   3. Re: Tokenization problem (Kenneth Heafield)


--

Message: 1
Date: Thu, 15 Jan 2015 08:54:06 +0800
From: iamzcy_hit iamzcy_hit iamzcy...@gmail.com
Subject: [Moses-support] how to align some new parallel sentences
using a trained model
To: moses-support@mit.edu moses-support@mit.edu
Message-ID:
CAGLowvLWHXb_J+=vZqMeOVCOD7Z=Uzyz_Sn=yjv+ptsfsyv...@mail.gmail.com
Content-Type: text/plain; charset=utf-8

Hi,all
  If I've train a alignment model using a huge parallel corpus with the
help of giga++,mgiga or fast-align, now I am given some new sentences pairs
and want to align the words in the sentence, how should I do ?
  Best regards

--
???.
-- next part --
An HTML attachment was scrubbed...
URL:
http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/9f
3850f8/attachment-0001.htm

--

Message: 2
Date: Thu, 15 Jan 2015 08:33:17 +0700
From: Tom Hoar tah...@precisiontranslationtools.com
Subject: Re: [Moses-support] Tokenization problem
To: moses-support@mit.edu
Message-ID: 54b718dd.4030...@precisiontranslationtools.com
Content-Type: text/plain; charset=windows-1252

I just ran the same sentence through the newest github clone (today).

corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$
./tokenizer.perl -no-escape -q -l en  test.txt which will guide you through
connecting and configuring your printer 's wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .
which will guide you through connecting and configuring your printer 's
wireless connection .

This is not a Perl script problem. What shell and command line are you using
for your in the file results? You'll find the problem in either your shell
or your custom tool chain(s) before you run tokenizer.perl.



On 01/14/2015 04:13 PM, Ihab Ramadan wrote:

 Dears,

 I still have this problem, for not confusing the decoder I used the 
 ??no-escape? parameter in the tokenizer.perl script but still have the 
 problem of adding extra space after quotations for tokenizing files 
 however in tokenizing a segment it comes without the extra space

 For example

 In the file

 ?which will guide you through connecting and configuring your 
 printer's wireless connection. ? ??which will guide you through 
 connecting and configuring your printer ' s wireless connection .?

 As a segment

 ?which will guide you through connecting and configuring your 
 printer's wireless connection. ? ??which will guide you through 
 connecting and configuring your printer 's wireless connection .?

 I wonder if it is the same script why it generated two different 
 outputs

 I have no experience in perl so I could not get the line of code which 
 differ between if the segment in a file or just one segment passed as 
 a parameter to the script

 Please help

 *From:*Ihab Ramadan [mailto:i.rama...@saudisoft.com]
 *Sent:* Monday, January 5, 2015 10:09 AM
 *To:* moses-support@mit.edu
 *Subject:* Tokenization problem

 Dears,

 Using the tokenizer on the training files replaces the apostrophes 
 with ?apos; s? (with space) but if I use the same script to tokenize 
 a sentence it makes the apostrophes to be ?apos;s? (without a space)

 This problem confuse the decoder while translation

 How to solve this peoblem

 Thanks

 Best Regards

 /Ihab Ramadan/| Senior Developer|Saudisoft http://www.saudisoft.com/
 - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 |
 Fax+20233032036 | *Follow us on *linked
 http://www.linkedin.com/company/77017

Re: [Moses-support] Tokenization problem

2015-01-14 Thread Tom Hoar

Good catch, Ken. I see your point, For example, considering the likely 
language pair (EN-AR), there could be some non-printing characters in 
the text file that the copy/paste clipboard drops.


On 01/15/2015 08:39 AM, Kenneth Heafield wrote:
 I'll inject that it is plausible there is some weird Unicode going on
 there and copy-paste on Linux sometimes canonicalized graphemes.  Whilst
 I'm inclined to side with Tom, the only way to sort this out is with the
 raw file from Ihab as e.g. a gzipped attachment.

 Kenneth

 On 01/14/2015 08:33 PM, Tom Hoar wrote:
 I just ran the same sentence through the newest github clone (today).

 corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$
 ./tokenizer.perl -no-escape -q -l en  test.txt
 which will guide you through connecting and configuring your printer 's
 wireless connection .
 which will guide you through connecting and configuring your printer 's
 wireless connection .
 which will guide you through connecting and configuring your printer 's
 wireless connection .
 which will guide you through connecting and configuring your printer 's
 wireless connection .
 which will guide you through connecting and configuring your printer 's
 wireless connection .

 This is not a Perl script problem. What shell and command line are you
 using for your in the file results? You'll find the problem in either
 your shell or your custom tool chain(s) before you run tokenizer.perl.



 On 01/14/2015 04:13 PM, Ihab Ramadan wrote:
 Dears,

 I still have this problem, for not confusing the decoder I used the
 “–no-escape” parameter in the tokenizer.perl script but still have the
 problem of adding extra space after quotations for tokenizing files
 however in tokenizing a segment it comes without the extra space

 For example

 In the file

 “which will guide you through connecting and configuring your
 printer's wireless connection. “ à“which will guide you through
 connecting and configuring your printer ' s wireless connection .”

 As a segment

 “which will guide you through connecting and configuring your
 printer's wireless connection. “ à“which will guide you through
 connecting and configuring your printer 's wireless connection .”

 I wonder if it is the same script why it generated two different outputs

 I have no experience in perl so I could not get the line of code which
 differ between if the segment in a file or just one segment passed as
 a parameter to the script

 Please help

   

   

   

 *From:*Ihab Ramadan [mailto:i.rama...@saudisoft.com]
 *Sent:* Monday, January 5, 2015 10:09 AM
 *To:* moses-support@mit.edu
 *Subject:* Tokenization problem

   

 Dears,

 Using the tokenizer on the training files replaces the apostrophes
 with “apos; s” (with space) but if I use the same script to tokenize
 a sentence it makes the apostrophes to be “apos;s” (without a space)

 This problem confuse the decoder while translation

 How to solve this peoblem

 Thanks

   

 Best Regards

 /Ihab Ramadan/| Senior Developer|Saudisoft http://www.saudisoft.com/
 - Egypt| *Tel * +2 02 330 320 37  Ext- 0| Mob+201007570826 |
 Fax+20233032036 | *Follow us on *linked
 http://www.linkedin.com/company/77017?trk=vsrp_companies_res_nametrkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary*
  |
 **ZA102637861*
 https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark*
  |
 **ZA102637858* https://twitter.com/Saudisoft

   



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Tokenization problem

2015-01-14 Thread Kenneth Heafield

I'll inject that it is plausible there is some weird Unicode going on
there and copy-paste on Linux sometimes canonicalized graphemes.  Whilst
I'm inclined to side with Tom, the only way to sort this out is with the
raw file from Ihab as e.g. a gzipped attachment.

Kenneth

On 01/14/2015 08:33 PM, Tom Hoar wrote:
 I just ran the same sentence through the newest github clone (today).
 
 corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$
 ./tokenizer.perl -no-escape -q -l en  test.txt
 which will guide you through connecting and configuring your printer 's
 wireless connection .
 which will guide you through connecting and configuring your printer 's
 wireless connection .
 which will guide you through connecting and configuring your printer 's
 wireless connection .
 which will guide you through connecting and configuring your printer 's
 wireless connection .
 which will guide you through connecting and configuring your printer 's
 wireless connection .
 
 This is not a Perl script problem. What shell and command line are you
 using for your in the file results? You'll find the problem in either
 your shell or your custom tool chain(s) before you run tokenizer.perl.
 
 
 
 On 01/14/2015 04:13 PM, Ihab Ramadan wrote:

 Dears,

 I still have this problem, for not confusing the decoder I used the
 “–no-escape” parameter in the tokenizer.perl script but still have the
 problem of adding extra space after quotations for tokenizing files
 however in tokenizing a segment it comes without the extra space

 For example

 In the file

 “which will guide you through connecting and configuring your
 printer's wireless connection. “ à“which will guide you through
 connecting and configuring your printer ' s wireless connection .”

 As a segment

 “which will guide you through connecting and configuring your
 printer's wireless connection. “ à“which will guide you through
 connecting and configuring your printer 's wireless connection .”

 I wonder if it is the same script why it generated two different outputs

 I have no experience in perl so I could not get the line of code which
 differ between if the segment in a file or just one segment passed as
 a parameter to the script

 Please help

  

  

  

 *From:*Ihab Ramadan [mailto:i.rama...@saudisoft.com]
 *Sent:* Monday, January 5, 2015 10:09 AM
 *To:* moses-support@mit.edu
 *Subject:* Tokenization problem

  

 Dears,

 Using the tokenizer on the training files replaces the apostrophes
 with “apos; s” (with space) but if I use the same script to tokenize
 a sentence it makes the apostrophes to be “apos;s” (without a space)

 This problem confuse the decoder while translation

 How to solve this peoblem

 Thanks  

  

 Best Regards

 /Ihab Ramadan/| Senior Developer|Saudisoft http://www.saudisoft.com/
 - Egypt| *Tel * +2 02 330 320 37  Ext- 0| Mob+201007570826 |
 Fax+20233032036 | *Follow us on *linked
 http://www.linkedin.com/company/77017?trk=vsrp_companies_res_nametrkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary*
  |
 **ZA102637861*
 https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark*
  |
 **ZA102637858* https://twitter.com/Saudisoft

  



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support
 
 
 
 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support
 
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Tokenization problem

2015-01-14 Thread Tom Hoar


I just ran the same sentence through the newest github clone (today).

corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$ 
./tokenizer.perl -no-escape -q -l en  test.txt
which will guide you through connecting and configuring your printer 's 
wireless connection .
which will guide you through connecting and configuring your printer 's 
wireless connection .
which will guide you through connecting and configuring your printer 's 
wireless connection .
which will guide you through connecting and configuring your printer 's 
wireless connection .
which will guide you through connecting and configuring your printer 's 
wireless connection .


This is not a Perl script problem. What shell and command line are you 
using for your in the file results? You'll find the problem in either 
your shell or your custom tool chain(s) before you run tokenizer.perl.




On 01/14/2015 04:13 PM, Ihab Ramadan wrote:


Dears,

I still have this problem, for not confusing the decoder I used the 
“–no-escape” parameter in the tokenizer.perl script but still have the 
problem of adding extra space after quotations for tokenizing files 
however in tokenizing a segment it comes without the extra space


For example

In the file

“which will guide you through connecting and configuring your 
printer's wireless connection. “ à“which will guide you through 
connecting and configuring your printer ' s wireless connection .”


As a segment

“which will guide you through connecting and configuring your 
printer's wireless connection. “ à“which will guide you through 
connecting and configuring your printer 's wireless connection .”


I wonder if it is the same script why it generated two different outputs

I have no experience in perl so I could not get the line of code which 
differ between if the segment in a file or just one segment passed as 
a parameter to the script


Please help

*From:*Ihab Ramadan [mailto:i.rama...@saudisoft.com]
*Sent:* Monday, January 5, 2015 10:09 AM
*To:* moses-support@mit.edu
*Subject:* Tokenization problem

Dears,

Using the tokenizer on the training files replaces the apostrophes 
with “apos; s” (with space) but if I use the same script to tokenize 
a sentence it makes the apostrophes to be “apos;s” (without a space)


This problem confuse the decoder while translation

How to solve this peoblem

Thanks

Best Regards

/Ihab Ramadan/| Senior Developer|Saudisoft http://www.saudisoft.com/ 
- Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 | 
Fax+20233032036 | *Follow us on *linked 
http://www.linkedin.com/company/77017?trk=vsrp_companies_res_nametrkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary* | 
**ZA102637861* 
https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark* | 
**ZA102637858* https://twitter.com/Saudisoft




___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Tokenization problem

2015-01-14 Thread Ihab Ramadan

Dears,

I still have this problem, for not confusing the decoder I used the
no-escape parameter in the tokenizer.perl script but still have the
problem of adding extra space after quotations for tokenizing files however
in tokenizing a segment it comes without the extra space

For example

In the file 

which will guide you through connecting and configuring your printer's
wireless connection.  à which will guide you through connecting and
configuring your printer ' s wireless connection .

As a segment

which will guide you through connecting and configuring your printer's
wireless connection.  à which will guide you through connecting and
configuring your printer 's wireless connection .

I wonder if it is the same script why it generated two different outputs 

I have no experience in perl so I could not get the line of code which
differ between if the segment in a file or just one segment passed as a
parameter to the script

Please help

 

 

 

From: Ihab Ramadan [mailto:i.rama...@saudisoft.com] 
Sent: Monday, January 5, 2015 10:09 AM
To: moses-support@mit.edu
Subject: Tokenization problem

 

Dears,

Using the tokenizer on the training files replaces the apostrophes with
apos; s (with space) but if I use the same script to tokenize a sentence
it makes the apostrophes to be apos;s (without a space)

This problem confuse the decoder while translation 

How to solve this peoblem

Thanks  

 

Best Regards

Ihab Ramadan| Senior Developer|  http://www.saudisoft.com/ Saudisoft -
Egypt | Tel  +2 02 330 320 37  Ext- 0 | Mob+201007570826 | Fax+20233032036 |
Follow us on
http://www.linkedin.com/company/77017?trk=vsrp_companies_res_nametrkInfo=V
SRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Apri
mary linked |
https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bo
okmark ZA102637861 |  https://twitter.com/Saudisoft ZA102637858

 

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Tokenization problem

2015-01-14 Thread Ihab Ramadan

Dears,

I found the problem

At the line number 289 in the tokenizer.perl script just add a space like
that

The original code

$text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1 ' $2/g;

The modified one 

$text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1 '  $2/g;

By this modification tokenization of files will be the same as tokenizing
one segment

Thanks

 

From: Ihab Ramadan [mailto:i.rama...@saudisoft.com] 
Sent: Wednesday, January 14, 2015 11:14 AM
To: moses-support@mit.edu
Subject: RE: Tokenization problem

 

Dears,

I still have this problem, for not confusing the decoder I used the
no-escape parameter in the tokenizer.perl script but still have the
problem of adding extra space after quotations for tokenizing files however
in tokenizing a segment it comes without the extra space

For example

In the file 

which will guide you through connecting and configuring your printer's
wireless connection.  à which will guide you through connecting and
configuring your printer ' s wireless connection .

As a segment

which will guide you through connecting and configuring your printer's
wireless connection.  à which will guide you through connecting and
configuring your printer 's wireless connection .

I wonder if it is the same script why it generated two different outputs 

I have no experience in perl so I could not get the line of code which
differ between if the segment in a file or just one segment passed as a
parameter to the script

Please help

 

 

 

From: Ihab Ramadan [mailto:i.rama...@saudisoft.com] 
Sent: Monday, January 5, 2015 10:09 AM
To: moses-support@mit.edu
Subject: Tokenization problem

 

Dears,

Using the tokenizer on the training files replaces the apostrophes with
apos; s (with space) but if I use the same script to tokenize a sentence
it makes the apostrophes to be apos;s (without a space)

This problem confuse the decoder while translation 

How to solve this peoblem

Thanks  

 

Best Regards

Ihab Ramadan| Senior Developer|  http://www.saudisoft.com/ Saudisoft -
Egypt | Tel  +2 02 330 320 37  Ext- 0 | Mob+201007570826 | Fax+20233032036 |
Follow us on
http://www.linkedin.com/company/77017?trk=vsrp_companies_res_nametrkInfo=V
SRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Apri
mary linked |
https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bo
okmark ZA102637861 |  https://twitter.com/Saudisoft ZA102637858

 

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Tokenization problem

2015-01-14 Thread Tom Hoar

I don't see the problem. I get the same results with the original 
tokenizer.perl script with the command line echo or piping from a 
file. I.e. no space between the apostrophe and s


tahoar@asus-notebook:~$ echo which will guide you through connecting 
and configuring your printer's wireless connection. | tokenizer.perl -q 
-l en
which will guide you through connecting and configuring your printer 
apos;s wireless connection .


tahoar@asus-notebook:~$ tokenizer.perl -q -l en  test.txt
which will guide you through connecting and configuring your printer 
apos;s wireless connection .
which will guide you through connecting and configuring your printer 
apos;s wireless connection .
which will guide you through connecting and configuring your printer 
apos;s wireless connection .
which will guide you through connecting and configuring your printer 
apos;s wireless connection .
which will guide you through connecting and configuring your printer 
apos;s wireless connection .


(five copies of your sentence in test.txt)



On 01/14/2015 04:37 PM, Ihab Ramadan wrote:


Dears,

I found the problem

At the line number 289 in the tokenizer.perl script just add a space 
like that


The original code

$text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1 ' $2/g;

The modified one

$text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1 '  $2/g;

By this modification tokenization of files will be the same as 
tokenizing one segment


Thanks

*From:*Ihab Ramadan [mailto:i.rama...@saudisoft.com]
*Sent:* Wednesday, January 14, 2015 11:14 AM
*To:* moses-support@mit.edu
*Subject:* RE: Tokenization problem

Dears,

I still have this problem, for not confusing the decoder I used the 
“–no-escape” parameter in the tokenizer.perl script but still have the 
problem of adding extra space after quotations for tokenizing files 
however in tokenizing a segment it comes without the extra space


For example

In the file

“which will guide you through connecting and configuring your 
printer's wireless connection. “ à“which will guide you through 
connecting and configuring your printer ' s wireless connection .”


As a segment

“which will guide you through connecting and configuring your 
printer's wireless connection. “ à“which will guide you through 
connecting and configuring your printer 's wireless connection .”


I wonder if it is the same script why it generated two different outputs

I have no experience in perl so I could not get the line of code which 
differ between if the segment in a file or just one segment passed as 
a parameter to the script


Please help

*From:*Ihab Ramadan [mailto:i.rama...@saudisoft.com]
*Sent:* Monday, January 5, 2015 10:09 AM
*To:* moses-support@mit.edu mailto:moses-support@mit.edu
*Subject:* Tokenization problem

Dears,

Using the tokenizer on the training files replaces the apostrophes 
with “apos; s” (with space) but if I use the same script to tokenize 
a sentence it makes the apostrophes to be “apos;s” (without a space)


This problem confuse the decoder while translation

How to solve this peoblem

Thanks

Best Regards

/Ihab Ramadan/| Senior Developer|Saudisoft http://www.saudisoft.com/ 
- Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 | 
Fax+20233032036 | *Follow us on *linked 
http://www.linkedin.com/company/77017?trk=vsrp_companies_res_nametrkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary* | 
**ZA102637861* 
https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark* | 
**ZA102637858* https://twitter.com/Saudisoft




___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] Tokenization problem

2015-01-05 Thread Ihab Ramadan

Dears,

Using the tokenizer on the training files replaces the apostrophes with
apos; s (with space) but if I use the same script to tokenize a sentence
it makes the apostrophes to be apos;s (without a space)

This problem confuse the decoder while translation 

How to solve this peoblem

Thanks  

 

Best Regards

Ihab Ramadan| Senior Developer|  http://www.saudisoft.com/ Saudisoft -
Egypt | Tel  +2 02 330 320 37  Ext- 0 | Mob+201007570826 | Fax+20233032036 |
Follow us on
http://www.linkedin.com/company/77017?trk=vsrp_companies_res_nametrkInfo=V
SRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Apri
mary linked |
https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bo
okmark ZA102637861 |  https://twitter.com/Saudisoft ZA102637858

 

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Tokenization problem

2015-01-05 Thread Barry Haddow

Hi Ihab

If you run the tokeniser with the same arguments then it should give the 
same results in test as in training. The spaces around the apostrophe 
depend on the context - maybe if you post the full sentences someone can 
explain why they are handled differently,

cheers - Barry

On 05/01/15 08:09, Ihab Ramadan wrote:

 Dears,

 Using the tokenizer on the training files replaces the apostrophes 
 with “apos; s” (with space) but if I use the same script to tokenize 
 a sentence it makes the apostrophes to be “apos;s” (without a space)

 This problem confuse the decoder while translation

 How to solve this peoblem

 Thanks

 Best Regards

 /Ihab Ramadan/| Senior Developer|Saudisoft http://www.saudisoft.com/ 
 - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 | 
 Fax+20233032036 | *Follow us on *linked 
 http://www.linkedin.com/company/77017?trk=vsrp_companies_res_nametrkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary*
  | 
 **ZA102637861* 
 https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark*
  | 
 **ZA102637858* https://twitter.com/Saudisoft



 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Tokenization issue

2014-11-10 Thread Ihab Ramadan

Hi Hieu,

Should I make tokenization and truecasing for both corpus file and parallel 
files or just for parallel files only?

Thanks 

 

From: hieuho...@gmail.com [mailto:hieuho...@gmail.com] On Behalf Of Hieu Hoang
Sent: Monday, November 3, 2014 8:18 PM
To: i.rama...@saudisoft.com
Cc: moses-support
Subject: Re: [Moses-support] Tokenization issue

 

hi ihab

at it's most basic, tokenization separates punctuations from words. However, it 
can also be used to separate a word into it's morphemes to make it easier to 
process.

Moses doesn't include a very good Arabic tokeniser. Each language needs a 
nonbreaking_prefix file, located in 
   scripts/share/nonbreaking_prefixes

This doesn't exist for arabic, so the tokenizer uses the English file instead.

If you create a nonbreaking_prefixes for arabic, please share it with us. Or 
use a tool like MADA to tokenizer your arabic data

 

On 28 October 2014 14:40, Ihab Ramadan i.rama...@saudisoft.com wrote:

Dears,

I have misunderstanding on what tokenization really do 

What I think that It makes the translation of  text like translated text gives 
the same output as “translated” text or translated.text or translated text . 
which ignores any punctuations in the translated text

Am I right ?

I did the tokenization on my data but this is not happening 

Note : in the tokenizer script I should feed it with the language and it could 
not recognize the arabic language (ar) which is my target language 

 

Best Regards

Ihab Ramadan| Senior Developer|  http://www.saudisoft.com/ Saudisoft - Egypt 
| Tel  +2 02 330 320 37  Ext- 0 | Mob+201007570826 tel:%2B201007570826  | 
Fax+20233032036 tel:%2B20233032036  | Follow us on  
http://www.linkedin.com/company/77017?trk=vsrp_companies_res_nametrkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary
 linked |  
https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark
 ZA102637861 |  https://twitter.com/Saudisoft ZA102637858

 


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support




-- 

Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Tokenization issue

2014-11-04 Thread Ihab Ramadan

Thank Hieu

Sure if made a nonbreaking_prefix file for Arabic language I will share it

 

 

From: hieuho...@gmail.com [mailto:hieuho...@gmail.com] On Behalf Of Hieu Hoang
Sent: Monday, November 3, 2014 8:18 PM
To: i.rama...@saudisoft.com
Cc: moses-support
Subject: Re: [Moses-support] Tokenization issue

 

hi ihab

at it's most basic, tokenization separates punctuations from words. However, it 
can also be used to separate a word into it's morphemes to make it easier to 
process.

Moses doesn't include a very good Arabic tokeniser. Each language needs a 
nonbreaking_prefix file, located in 
   scripts/share/nonbreaking_prefixes

This doesn't exist for arabic, so the tokenizer uses the English file instead.

If you create a nonbreaking_prefixes for arabic, please share it with us. Or 
use a tool like MADA to tokenizer your arabic data

 

On 28 October 2014 14:40, Ihab Ramadan i.rama...@saudisoft.com wrote:

Dears,

I have misunderstanding on what tokenization really do 

What I think that It makes the translation of  text like translated text gives 
the same output as “translated” text or translated.text or translated text . 
which ignores any punctuations in the translated text

Am I right ?

I did the tokenization on my data but this is not happening 

Note : in the tokenizer script I should feed it with the language and it could 
not recognize the arabic language (ar) which is my target language 

 

Best Regards

Ihab Ramadan| Senior Developer|  http://www.saudisoft.com/ Saudisoft - Egypt 
| Tel  +2 02 330 320 37  Ext- 0 | Mob+201007570826 tel:%2B201007570826  | 
Fax+20233032036 tel:%2B20233032036  | Follow us on  
http://www.linkedin.com/company/77017?trk=vsrp_companies_res_nametrkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary
 linked |  
https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark
 ZA102637861 |  https://twitter.com/Saudisoft ZA102637858

 


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support




-- 

Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] Tokenization issue

2014-10-28 Thread Ihab Ramadan

Dears,

I have misunderstanding on what tokenization really do 

What I think that It makes the translation of  text like translated text
gives the same output as translated text or translated.text or translated
text . which ignores any punctuations in the translated text

Am I right ?

I did the tokenization on my data but this is not happening 

Note : in the tokenizer script I should feed it with the language and it
could not recognize the arabic language (ar) which is my target language 

 

Best Regards

Ihab Ramadan| Senior Developer|  http://www.saudisoft.com/ Saudisoft -
Egypt | Tel  +2 02 330 320 37  Ext- 0 | Mob+201007570826 | Fax+20233032036 |
Follow us on
http://www.linkedin.com/company/77017?trk=vsrp_companies_res_nametrkInfo=V
SRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Apri
mary linked |
https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bo
okmark ZA102637861 |  https://twitter.com/Saudisoft ZA102637858

 

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Tokenization

Re: [Moses-support] Tokenization

[Moses-support] Tokenization

Re: [Moses-support] Tokenization problem

Re: [Moses-support] Tokenization problem

Re: [Moses-support] Tokenization problem

Re: [Moses-support] Tokenization problem

Re: [Moses-support] Tokenization problem

Re: [Moses-support] Tokenization problem

Re: [Moses-support] Tokenization problem

[Moses-support] Tokenization problem

Re: [Moses-support] Tokenization problem

Re: [Moses-support] Tokenization issue

Re: [Moses-support] Tokenization issue

[Moses-support] Tokenization issue

15 matches

Site Navigation

Mail list logo

Footer information