Hi,

should that be part of the tokenizer and/or the
escape-special-characters script?

-phi

On Sat, May 31, 2014 at 8:04 PM, Hieu Hoang <hieu.ho...@ed.ac.uk> wrote:
> thanks everybody.
>
> I took marcin's suggestion and wrote a wrapper script. It seems to be doing
> ok. It's gotten past  the previous step that it failed on, BLEU scores
> hasn't been affected
>
> i've added it to moses if anyone wants it
>
> https://github.com/moses-smt/mosesdecoder/commit/57235268323f97c53a9f214e3bec6e722437230f
>
>
> On 30 May 2014 18:07, Marcin Junczys-Dowmunt <junc...@amu.edu.pl> wrote:
>>
>> How's this?
>>
>> cat baa | perl -C -pe 'chomp; s/\p{C}/ /g; $_="$_\n"'
>>
>>
>> W dniu 30.05.2014 18:01, Hieu Hoang pisze:
>>
>> in the attached file, there are 2 or more non-printing chars on the 1st
>> line, between the words 'place' and 'binding'. They should be
>> removed/replaced with a space. Those chars are deleted by parsers, making
>> the word alignments incorrect and crashing extract
>>
>> The 2nd line is perfectly good utf8. It shouldn't be touched.
>>
>> just another friday nlp malaise
>>
>>
>>
>> On 30 May 2014 17:51, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>>>
>>> it is trivial to change it to say a ? mark.
>>>
>>> but I'm not sure what you want as output now.  the original request
>>> was for removing non-printable characters, which the Perl does,
>>>
>>> Miles
>>>
>>> On 30 May 2014 12:43, Hieu Hoang <hieu.ho...@ed.ac.uk> wrote:
>>> > forgot to say. The input is utf8. The snippet turns
>>> >    gonzález
>>> > to
>>> >    gonz lez
>>> >
>>> >
>>> > On 30 May 2014 17:22, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>>> >>
>>> >> this perl snippet:
>>> >>
>>> >> $line =~ tr/\040-\176/ /c;
>>> >>
>>> >> On 30 May 2014 12:17,  <moses-support-requ...@mit.edu> wrote:
>>> >> > Send Moses-support mailing list submissions to
>>> >> >         moses-support@mit.edu
>>> >> >
>>> >> > To subscribe or unsubscribe via the World Wide Web, visit
>>> >> >         http://mailman.mit.edu/mailman/listinfo/moses-support
>>> >> > or, via email, send a message with subject or body 'help' to
>>> >> >         moses-support-requ...@mit.edu
>>> >> >
>>> >> > You can reach the person managing the list at
>>> >> >         moses-support-ow...@mit.edu
>>> >> >
>>> >> > When replying, please edit your Subject line so it is more specific
>>> >> > than "Re: Contents of Moses-support digest..."
>>> >> >
>>> >> >
>>> >> > Today's Topics:
>>> >> >
>>> >> >    1. removing non-printing character (Hieu Hoang)
>>> >> >
>>> >> >
>>> >> >
>>> >> > ----------------------------------------------------------------------
>>> >> >
>>> >> > Message: 1
>>> >> > Date: Fri, 30 May 2014 16:24:30 +0100
>>> >> > From: Hieu Hoang <hieu.ho...@ed.ac.uk>
>>> >> > Subject: [Moses-support] removing non-printing character
>>> >> > To: moses-support <moses-support@mit.edu>
>>> >> > Message-ID:
>>> >> >
>>> >> > <caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com>
>>> >> > Content-Type: text/plain; charset="utf-8"
>>> >> >
>>> >> > does anyone have a script/program that can remove all non-printing
>>> >> > characters?
>>> >> >
>>> >> > I don't care if it's fast or slow, as long as it's ABSOLUTELY
>>> >> > removes
>>> >> > all
>>> >> > non-printing chars
>>> >> >
>>> >> > --
>>> >> > Hieu Hoang
>>> >> > Research Associate
>>> >> > University of Edinburgh
>>> >> > http://www.hoang.co.uk/hieu
>>> >> > -------------- next part --------------
>>> >> > An HTML attachment was scrubbed...
>>> >> > URL:
>>> >> >
>>> >> > http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm
>>> >> >
>>> >> > ------------------------------
>>> >> >
>>> >> > _______________________________________________
>>> >> > Moses-support mailing list
>>> >> > Moses-support@mit.edu
>>> >> > http://mailman.mit.edu/mailman/listinfo/moses-support
>>> >> >
>>> >> >
>>> >> > End of Moses-support Digest, Vol 91, Issue 52
>>> >> > *********************************************
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> The University of Edinburgh is a charitable body, registered in
>>> >> Scotland, with registration number SC005336.
>>> >> _______________________________________________
>>> >> Moses-support mailing list
>>> >> Moses-support@mit.edu
>>> >> http://mailman.mit.edu/mailman/listinfo/moses-support
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Hieu Hoang
>>> > Research Associate
>>> > University of Edinburgh
>>> > http://www.hoang.co.uk/hieu
>>> >
>>> >
>>> > The University of Edinburgh is a charitable body, registered in
>>> > Scotland, with registration number SC005336.
>>> >
>>>
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>
>>
>>
>> --
>> Hieu Hoang
>> Research Associate
>> University of Edinburgh
>> http://www.hoang.co.uk/hieu
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to