the simplest approach would be to use another character to join words
together.  the tokeniser thinks you have hyphenated words, which is
probably what you don't want.

Miles

On 13 June 2011 18:39, Anna c <annac...@hotmail.com> wrote:
> Hi,
> I've tried what you suggested, but I'm not sure if I'm doing it right...
> I've replaced all the occurrences in the input files as you said, adding a
> '~' between the words (as in "the~man"), but when I see the file
> training.tok.en or training.tok.es (resulting of the first steps in the
> guide), the words have been separated and it appears as "the ~ man". Should
> I change the tokenizer.perl to ignore the '~' or should I skip that steps?
> Or it is correct in that way?
>
> Thank you very much!
> Best regards,
> Anna
>
>
>
>
>> Date: Fri, 10 Jun 2011 10:48:07 +0100
>> Subject: Re: [Moses-support] How to change phrase representation
>> From: pko...@inf.ed.ac.uk
>> To: annac...@hotmail.com
>> CC: moses-support@mit.edu
>>
>> Hi,
>>
>> I am not entirely sure if I fully understand your question,
>> but let me try to answer.
>>
>> the phrase-based model implementation considers tokens
>> separated by a white space as a word. It does also learn
>> translation entries for sequences of words ("phrases").
>>
>> If you want to group words into larger tokens, then you
>> have to replace the white spaces.
>>
>> For instance, if you want to force the training setup and decoder
>> to treat "the man" as a unit, then you should replace all
>> occurrences (in training data and decoder input) with "the~man".
>>
>> -phi
>>
>> On Fri, Jun 10, 2011 at 10:38 AM, Anna c <annac...@hotmail.com> wrote:
>> > Hi!
>> > I'm doing a master's degree and I need some help with one of my
>> > subjects.
>> > I've already installed GIZA++ and Moses correctly, and made the step by
>> > step
>> > guide of the web, checking that everything was ok. But I'm a newbie in
>> > this
>> > and I'm a bit lost. What I have to do is to change the representation so
>> > the
>> > basic unit won't be the word, but pairs or triplets of words, and
>> > compare it
>> > with the normal representation. How do I do that? Do I have to change
>> > the
>> > preparation step in the training?
>> >
>> > Thank you very much!
>> > Best regards,
>> > Anna
>> >
>> > _______________________________________________
>> > Moses-support mailing list
>> > Moses-support@mit.edu
>> > http://mailman.mit.edu/mailman/listinfo/moses-support
>> >
>> >
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to