As far as I know, no such general purpose tool exists. We wrote a
custom in-house script that removes many, but not all, possible
non-printing Unicode characters as part of our WMT submission.

I am interested in  writing one, though.

I think the right way to do this would be to parse the Unicode
character database for all characters of certain classes, and build
the tool from that data.

Lane


On Fri, May 30, 2014 at 1:01 PM, Hieu Hoang <hieu.ho...@ed.ac.uk> wrote:
> in the attached file, there are 2 or more non-printing chars on the 1st
> line, between the words 'place' and 'binding'. They should be
> removed/replaced with a space. Those chars are deleted by parsers, making
> the word alignments incorrect and crashing extract
>
> The 2nd line is perfectly good utf8. It shouldn't be touched.
>
> just another friday nlp malaise
>
>
>
> On 30 May 2014 17:51, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>>
>> it is trivial to change it to say a ? mark.
>>
>> but I'm not sure what you want as output now.  the original request
>> was for removing non-printable characters, which the Perl does,
>>
>> Miles
>>
>> On 30 May 2014 12:43, Hieu Hoang <hieu.ho...@ed.ac.uk> wrote:
>> > forgot to say. The input is utf8. The snippet turns
>> >    gonzález
>> > to
>> >    gonz lez
>> >
>> >
>> > On 30 May 2014 17:22, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
>> >>
>> >> this perl snippet:
>> >>
>> >> $line =~ tr/\040-\176/ /c;
>> >>
>> >> On 30 May 2014 12:17,  <moses-support-requ...@mit.edu> wrote:
>> >> > Send Moses-support mailing list submissions to
>> >> >         moses-support@mit.edu
>> >> >
>> >> > To subscribe or unsubscribe via the World Wide Web, visit
>> >> >         http://mailman.mit.edu/mailman/listinfo/moses-support
>> >> > or, via email, send a message with subject or body 'help' to
>> >> >         moses-support-requ...@mit.edu
>> >> >
>> >> > You can reach the person managing the list at
>> >> >         moses-support-ow...@mit.edu
>> >> >
>> >> > When replying, please edit your Subject line so it is more specific
>> >> > than "Re: Contents of Moses-support digest..."
>> >> >
>> >> >
>> >> > Today's Topics:
>> >> >
>> >> >    1. removing non-printing character (Hieu Hoang)
>> >> >
>> >> >
>> >> >
>> >> > ----------------------------------------------------------------------
>> >> >
>> >> > Message: 1
>> >> > Date: Fri, 30 May 2014 16:24:30 +0100
>> >> > From: Hieu Hoang <hieu.ho...@ed.ac.uk>
>> >> > Subject: [Moses-support] removing non-printing character
>> >> > To: moses-support <moses-support@mit.edu>
>> >> > Message-ID:
>> >> >
>> >> > <caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com>
>> >> > Content-Type: text/plain; charset="utf-8"
>> >> >
>> >> > does anyone have a script/program that can remove all non-printing
>> >> > characters?
>> >> >
>> >> > I don't care if it's fast or slow, as long as it's ABSOLUTELY removes
>> >> > all
>> >> > non-printing chars
>> >> >
>> >> > --
>> >> > Hieu Hoang
>> >> > Research Associate
>> >> > University of Edinburgh
>> >> > http://www.hoang.co.uk/hieu
>> >> > -------------- next part --------------
>> >> > An HTML attachment was scrubbed...
>> >> > URL:
>> >> >
>> >> > http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm
>> >> >
>> >> > ------------------------------
>> >> >
>> >> > _______________________________________________
>> >> > Moses-support mailing list
>> >> > Moses-support@mit.edu
>> >> > http://mailman.mit.edu/mailman/listinfo/moses-support
>> >> >
>> >> >
>> >> > End of Moses-support Digest, Vol 91, Issue 52
>> >> > *********************************************
>> >>
>> >>
>> >>
>> >> --
>> >> The University of Edinburgh is a charitable body, registered in
>> >> Scotland, with registration number SC005336.
>> >> _______________________________________________
>> >> Moses-support mailing list
>> >> Moses-support@mit.edu
>> >> http://mailman.mit.edu/mailman/listinfo/moses-support
>> >
>> >
>> >
>> >
>> > --
>> > Hieu Hoang
>> > Research Associate
>> > University of Edinburgh
>> > http://www.hoang.co.uk/hieu
>> >
>> >
>> > The University of Edinburgh is a charitable body, registered in
>> > Scotland, with registration number SC005336.
>> >
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



-- 
When a place gets crowded enough to require ID's, social collapse is not
far away.  It is time to go elsewhere.  The best thing about space travel
is that it made it possible to go elsewhere.
                -- R.A. Heinlein, "Time Enough For Love"

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to