How's this?

cat baa | perl -C -pe 'chomp; s/\p{C}/ /g; $_="$_\n"'


W dniu 30.05.2014 18:01, Hieu Hoang pisze:
in the attached file, there are 2 or more non-printing chars on the 1st line, between the words 'place' and 'binding'. They should be removed/replaced with a space. Those chars are deleted by parsers, making the word alignments incorrect and crashing extract

The 2nd line is perfectly good utf8. It shouldn't be touched.

just another friday nlp malaise



On 30 May 2014 17:51, Miles Osborne <mi...@inf.ed.ac.uk <mailto:mi...@inf.ed.ac.uk>> wrote:

    it is trivial to change it to say a ? mark.

    but I'm not sure what you want as output now.  the original request
    was for removing non-printable characters, which the Perl does,

    Miles

    On 30 May 2014 12:43, Hieu Hoang <hieu.ho...@ed.ac.uk
    <mailto:hieu.ho...@ed.ac.uk>> wrote:
    > forgot to say. The input is utf8. The snippet turns
    >    gonzález
    > to
    >    gonz lez
    >
    >
    > On 30 May 2014 17:22, Miles Osborne <mi...@inf.ed.ac.uk
    <mailto:mi...@inf.ed.ac.uk>> wrote:
    >>
    >> this perl snippet:
    >>
    >> $line =~ tr/\040-\176/ /c;
    >>
    >> On 30 May 2014 12:17,  <moses-support-requ...@mit.edu
    <mailto:moses-support-requ...@mit.edu>> wrote:
    >> > Send Moses-support mailing list submissions to
    >> > moses-support@mit.edu <mailto:moses-support@mit.edu>
    >> >
    >> > To subscribe or unsubscribe via the World Wide Web, visit
    >> > http://mailman.mit.edu/mailman/listinfo/moses-support
    >> > or, via email, send a message with subject or body 'help' to
    >> > moses-support-requ...@mit.edu
    <mailto:moses-support-requ...@mit.edu>
    >> >
    >> > You can reach the person managing the list at
    >> > moses-support-ow...@mit.edu <mailto:moses-support-ow...@mit.edu>
    >> >
    >> > When replying, please edit your Subject line so it is more
    specific
    >> > than "Re: Contents of Moses-support digest..."
    >> >
    >> >
    >> > Today's Topics:
    >> >
    >> >    1. removing non-printing character (Hieu Hoang)
    >> >
    >> >
    >> >
    ----------------------------------------------------------------------
    >> >
    >> > Message: 1
    >> > Date: Fri, 30 May 2014 16:24:30 +0100
    >> > From: Hieu Hoang <hieu.ho...@ed.ac.uk
    <mailto:hieu.ho...@ed.ac.uk>>
    >> > Subject: [Moses-support] removing non-printing character
    >> > To: moses-support <moses-support@mit.edu
    <mailto:moses-support@mit.edu>>
    >> > Message-ID:
    >> >
    >> >
    <caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com 
<mailto:caekmkbj4tedzyvgeastmg51%2bw-5sye5ygrmibcypc2j8ybk...@mail.gmail.com>>
    >> > Content-Type: text/plain; charset="utf-8"
    >> >
    >> > does anyone have a script/program that can remove all
    non-printing
    >> > characters?
    >> >
    >> > I don't care if it's fast or slow, as long as it's ABSOLUTELY
    removes
    >> > all
    >> > non-printing chars
    >> >
    >> > --
    >> > Hieu Hoang
    >> > Research Associate
    >> > University of Edinburgh
    >> > http://www.hoang.co.uk/hieu
    >> > -------------- next part --------------
    >> > An HTML attachment was scrubbed...
    >> > URL:
    >> >
    
http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm
    >> >
    >> > ------------------------------
    >> >
    >> > _______________________________________________
    >> > Moses-support mailing list
    >> > Moses-support@mit.edu <mailto:Moses-support@mit.edu>
    >> > http://mailman.mit.edu/mailman/listinfo/moses-support
    >> >
    >> >
    >> > End of Moses-support Digest, Vol 91, Issue 52
    >> > *********************************************
    >>
    >>
    >>
    >> --
    >> The University of Edinburgh is a charitable body, registered in
    >> Scotland, with registration number SC005336.
    >> _______________________________________________
    >> Moses-support mailing list
    >> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
    >> http://mailman.mit.edu/mailman/listinfo/moses-support
    >
    >
    >
    >
    > --
    > Hieu Hoang
    > Research Associate
    > University of Edinburgh
    > http://www.hoang.co.uk/hieu
    >
    >
    > The University of Edinburgh is a charitable body, registered in
    > Scotland, with registration number SC005336.
    >



    --
    The University of Edinburgh is a charitable body, registered in
    Scotland, with registration number SC005336.




--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu



_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to