Re: [Moses-support] word alignment-words' indexes and sentences' length

2014-01-25 Thread Tom Hoar
You might check tokenizer.perl's new argument: -protected. This option 
reads simple regex search patterns from a file and protects the patterns 
from tokenization. I've never used it so you'll need to study how it works.




On 01/24/2014 03:58 PM, amir haghighi wrote:

I use the built-in tokenizer in the Moses.
how can I change this tokenizer? should I change the  source code?

Regards
Amir


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] word alignment-words' indexes and sentences' length

2014-01-24 Thread amir haghighi
Thank you Barry for your help.

Hi Amin,
I can't see the link. could you please attach it to your email?

Regrads
Amir
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] word alignment-words' indexes and sentences' length

2014-01-24 Thread Amin Farajian

  
  
Dear Amir,
  
  I recently developed a normalizer and tokenizer for Persian
  language. We used the first version of this tool in our IWSLT
  submission this year, and got around 1.5 Bleu point improvement
  over the baseline (which was tokenized using Moses built-in
  tokenizer).
  I am going to make it publicly available soon, but in case you are
  interested and want to use it in your experiments now, I can share
  the code with you.
  
  Bests,
  Amin
  
  On 01/24/2014 10:21 AM, Barry Haddow wrote:


  Hi Amir

You can use this tokeniser as a basis for creating your own tokeniser, 
or you can swap in your own tokeniser. For EMS a tokeniser should read 
from stdin and write to stdout, so you can run it like this

tokeniser [options] < input > output

cheers - Barry

On 24/01/14 08:58, amir haghighi wrote:

  
I use the built-in tokenizer in the Moses.
how can I change this tokenizer? should I change the  source code?

Regards
Amir


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

  
  




  

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] word alignment-words' indexes and sentences' length

2014-01-24 Thread Barry Haddow
Hi Amir

You can use this tokeniser as a basis for creating your own tokeniser, 
or you can swap in your own tokeniser. For EMS a tokeniser should read 
from stdin and write to stdout, so you can run it like this

tokeniser [options] < input > output

cheers - Barry

On 24/01/14 08:58, amir haghighi wrote:
> I use the built-in tokenizer in the Moses.
> how can I change this tokenizer? should I change the  source code?
>
> Regards
> Amir
>
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] word alignment-words' indexes and sentences' length

2014-01-24 Thread amir haghighi
I use the built-in tokenizer in the Moses.
how can I change this tokenizer? should I change the  source code?

Regards
Amir
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] word alignment-words' indexes and sentences' length

2014-01-23 Thread Tom Hoar
What tokenizer are you using? You can either edit/configure the 
tokenizer to treat them as non-whitespace, or escape them before passing 
them to the tokenizer.




On 01/24/2014 12:36 PM, amir haghighi wrote:


I removed all of the double spaces from the corpus but there are some 
double spaces in the tokenised file yet.
My source language is Persian and I have half-spaces in my corpus. I 
noticed that after the tokenisation step,these half-spaces are 
converted to double-spaces. this conversion disturb the sentence's 
length and the alignment.

How can I prevent from this conversion?

Thank you again
Amir


On Wed, Jan 22, 2014 at 2:10 PM, Hieu Hoang > wrote:


yes, remove the double space. Sometimes, the double space is
ignored, sometimes it's counted as a 'word' with no characters,
depending on exactly how the program tokenizes the line.




On 22 January 2014 10:09, amir haghighi
mailto:amir.haghighi...@gmail.com>>
wrote:

Thank you Hieu,

The corpus is utf8, but there is a double space in this line.
are double spaces regarded as a word?
should I remove double spaces from the lines manually to get
the correct sentence's length?



On Tue, Jan 21, 2014 at 4:12 AM, Hieu Hoang
mailto:hieuho...@gmail.com>> wrote:


On 20/01/2014 13:45, amir haghighi wrote:

Hello

I've some questions about the giza word alignment.

1-where is the final alignment file?Is it the
aligned.1.grow in the model folder?

yes.



2-do indexes of the words of both target and source
sentences start from 0?

yes



3- how does giza calculate the length of a sentence?

the number of words


I have a sentence with 11 tokens that are separated with
space, but in the alignment file it length is 13.

strange. Are you sure your corpus file is encoded as UTF8?
Are there double spaces in the line?


Regards
Amir



___
Moses-support mailing list
Moses-support@mit.edu  
http://mailman.mit.edu/mailman/listinfo/moses-support




___
Moses-support mailing list
Moses-support@mit.edu 
http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
Hieu Hoang

Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu




___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] word alignment-words' indexes and sentences' length

2014-01-23 Thread amir haghighi
I removed all of the double spaces from the corpus but there are some
double spaces in the tokenised file yet.
My source language is Persian and I have half-spaces in my corpus. I
noticed that after the tokenisation step,these half-spaces are converted to
double-spaces. this conversion disturb the sentence's length and the
alignment.
How can I prevent from this conversion?

Thank you again
Amir


On Wed, Jan 22, 2014 at 2:10 PM, Hieu Hoang  wrote:

> yes, remove the double space. Sometimes, the double space is ignored,
> sometimes it's counted as a 'word' with no characters, depending on exactly
> how the program tokenizes the line.
>
>
>
>
> On 22 January 2014 10:09, amir haghighi wrote:
>
>> Thank you Hieu,
>>
>> The corpus is utf8, but there is a double space in this line. are double
>> spaces regarded as a word?
>> should I remove double spaces from the lines manually to get the correct
>> sentence's length?
>>
>>
>>
>> On Tue, Jan 21, 2014 at 4:12 AM, Hieu Hoang  wrote:
>>
>>>
>>> On 20/01/2014 13:45, amir haghighi wrote:
>>>
>>>   Hello
>>>
>>>  I've some questions about the giza word alignment.
>>>
>>>  1-where is the final alignment file?Is it the aligned.1.grow in
>>> the model folder?
>>>
>>> yes.
>>>
>>>
>>>  2-do indexes of the words of both target and source sentences start
>>> from 0?
>>>
>>> yes
>>>
>>>
>>>  3- how does giza calculate the length of a sentence?
>>>
>>> the number of words
>>>
>>>  I have a sentence with 11 tokens that are separated with space, but in
>>> the alignment file it length is 13.
>>>
>>> strange. Are you sure your corpus file is encoded as UTF8? Are there
>>> double spaces in the line?
>>>
>>>
>>>  Regards
>>>  Amir
>>>
>>>
>>>
>>> ___
>>> Moses-support mailing 
>>> listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>>
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] word alignment-words' indexes and sentences' length

2014-01-22 Thread Hieu Hoang
yes, remove the double space. Sometimes, the double space is ignored,
sometimes it's counted as a 'word' with no characters, depending on exactly
how the program tokenizes the line.




On 22 January 2014 10:09, amir haghighi  wrote:

> Thank you Hieu,
>
> The corpus is utf8, but there is a double space in this line. are double
> spaces regarded as a word?
> should I remove double spaces from the lines manually to get the correct
> sentence's length?
>
>
>
> On Tue, Jan 21, 2014 at 4:12 AM, Hieu Hoang  wrote:
>
>>
>> On 20/01/2014 13:45, amir haghighi wrote:
>>
>>   Hello
>>
>>  I've some questions about the giza word alignment.
>>
>>  1-where is the final alignment file?Is it the aligned.1.grow in the
>> model folder?
>>
>> yes.
>>
>>
>>  2-do indexes of the words of both target and source sentences start
>> from 0?
>>
>> yes
>>
>>
>>  3- how does giza calculate the length of a sentence?
>>
>> the number of words
>>
>>  I have a sentence with 11 tokens that are separated with space, but in
>> the alignment file it length is 13.
>>
>> strange. Are you sure your corpus file is encoded as UTF8? Are there
>> double spaces in the line?
>>
>>
>>  Regards
>>  Amir
>>
>>
>>
>> ___
>> Moses-support mailing 
>> listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] word alignment-words' indexes and sentences' length

2014-01-22 Thread amir haghighi
Thank you Hieu,

The corpus is utf8, but there is a double space in this line. are double
spaces regarded as a word?
should I remove double spaces from the lines manually to get the correct
sentence's length?



On Tue, Jan 21, 2014 at 4:12 AM, Hieu Hoang  wrote:

>
> On 20/01/2014 13:45, amir haghighi wrote:
>
>   Hello
>
>  I've some questions about the giza word alignment.
>
>  1-where is the final alignment file?Is it the aligned.1.grow in the
> model folder?
>
> yes.
>
>
>  2-do indexes of the words of both target and source sentences start from
> 0?
>
> yes
>
>
>  3- how does giza calculate the length of a sentence?
>
> the number of words
>
>  I have a sentence with 11 tokens that are separated with space, but in
> the alignment file it length is 13.
>
> strange. Are you sure your corpus file is encoded as UTF8? Are there
> double spaces in the line?
>
>
>  Regards
>  Amir
>
>
>
> ___
> Moses-support mailing 
> listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] word alignment-words' indexes and sentences' length

2014-01-20 Thread Hieu Hoang


On 20/01/2014 13:45, amir haghighi wrote:

Hello

I've some questions about the giza word alignment.

1-where is the final alignment file?Is it the aligned.1.grow in 
the model folder?

yes.


2-do indexes of the words of both target and source sentences start 
from 0?

yes


3- how does giza calculate the length of a sentence?

the number of words
I have a sentence with 11 tokens that are separated with space, but in 
the alignment file it length is 13.
strange. Are you sure your corpus file is encoded as UTF8? Are there 
double spaces in the line?


Regards
Amir



___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] word alignment-words' indexes and sentences' length

2014-01-20 Thread amir haghighi
Hello

I've some questions about the giza word alignment.

1-where is the final alignment file?Is it the aligned.1.grow in the
model folder?

2-do indexes of the words of both target and source sentences start from 0?

3- how does giza calculate the length of a sentence? I have a sentence with
11 tokens that are separated with space, but in the alignment file it
length is 13.

Regards
Amir
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support