Hi.

the fix was checked in a few hours ago.

-phi


On Tue, Aug 5, 2014 at 2:35 PM, Judah Schvimer <judah.schvi...@mongodb.com>
wrote:

> Hi,
>
> I've been playing around with this and I noticed that the protected flag
> only "protects" the first example of a regex in a line. Is there any way to
> fix this so that it protects every occurrence?
>
> Thanks,
> Judah
>
>
> On Thu, Jul 31, 2014 at 9:32 AM, Philipp Koehn <pko...@inf.ed.ac.uk>
> wrote:
>
>> Hi,
>>
>> -no-escape turns off this:
>>
>>     if (!$NO_ESCAPING)
>>       {
>>         $text =~ s/\&/\&amp;/g;   # escape escape
>>         $text =~ s/\|/\&#124;/g;  # factor separator
>>         $text =~ s/\</\&lt;/g;    # xml
>>         $text =~ s/\>/\&gt;/g;    # xml
>>         $text =~ s/\'/\&apos;/g;  # xml
>>         $text =~ s/\"/\&quot;/g;  # xml
>>         $text =~ s/\[/\&#91;/g;   # syntax non-terminal
>>         $text =~ s/\]/\&#93;/g;   # syntax non-terminal
>>       }
>>
>> Especially not escaping the "|" will cause trouble.
>>
>> So, you should not turn this off -- it is completely reversible by the
>> detokenizer anyway.
>>
>> -phi
>>
>>
>>
>> On Thu, Jul 31, 2014 at 9:09 AM, Judah Schvimer <
>> judah.schvi...@mongodb.com> wrote:
>>
>>> Thanks, that makes sense. One more question. If I use the -no-escape
>>> flag will that cause any problems to moses, or does that still escape the
>>> special characters that break moses?
>>>
>>> Judah
>>>
>>>
>>> On Thu, Jul 31, 2014 at 8:52 AM, Philipp Koehn <pko...@inf.ed.ac.uk>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> this is done deliberately:
>>>>
>>>>     # turn `into '
>>>>     $text =~ s/\`/\'/g;
>>>>
>>>>     #turn '' into "
>>>>     $text =~ s/\'\'/ \" /g;
>>>>
>>>> The motivation is to normalize corpora who used more ``creative'' ways
>>>> of quoting. You may want to remove these lines from the tokenizer or
>>>> create a switch for the script to optionally turn it off.
>>>>
>>>> -phi
>>>>
>>>>
>>>> On Wed, Jul 30, 2014 at 5:38 PM, Judah Schvimer <
>>>> judah.schvi...@mongodb.com> wrote:
>>>>
>>>>> It seems that back ticks(`) are being tokenized to apostrophes(') so
>>>>> when they get detokenized they show up as an apostrophe and not a 
>>>>> backtick.
>>>>> Additionally, "-no-escape" seems to turn backticks into apostrophes as
>>>>> well.  I think this is a bug in the tokenizer. Let me know if you think 
>>>>> I'm
>>>>> doing something wrong.
>>>>>
>>>>> Thanks,
>>>>> Judah
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> Moses-support@mit.edu
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>>
>>>>
>>>
>>
>
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to