Hi, should that be part of the tokenizer and/or the escape-special-characters script?
-phi On Sat, May 31, 2014 at 8:04 PM, Hieu Hoang <hieu.ho...@ed.ac.uk> wrote: > thanks everybody. > > I took marcin's suggestion and wrote a wrapper script. It seems to be doing > ok. It's gotten past the previous step that it failed on, BLEU scores > hasn't been affected > > i've added it to moses if anyone wants it > > https://github.com/moses-smt/mosesdecoder/commit/57235268323f97c53a9f214e3bec6e722437230f > > > On 30 May 2014 18:07, Marcin Junczys-Dowmunt <junc...@amu.edu.pl> wrote: >> >> How's this? >> >> cat baa | perl -C -pe 'chomp; s/\p{C}/ /g; $_="$_\n"' >> >> >> W dniu 30.05.2014 18:01, Hieu Hoang pisze: >> >> in the attached file, there are 2 or more non-printing chars on the 1st >> line, between the words 'place' and 'binding'. They should be >> removed/replaced with a space. Those chars are deleted by parsers, making >> the word alignments incorrect and crashing extract >> >> The 2nd line is perfectly good utf8. It shouldn't be touched. >> >> just another friday nlp malaise >> >> >> >> On 30 May 2014 17:51, Miles Osborne <mi...@inf.ed.ac.uk> wrote: >>> >>> it is trivial to change it to say a ? mark. >>> >>> but I'm not sure what you want as output now. the original request >>> was for removing non-printable characters, which the Perl does, >>> >>> Miles >>> >>> On 30 May 2014 12:43, Hieu Hoang <hieu.ho...@ed.ac.uk> wrote: >>> > forgot to say. The input is utf8. The snippet turns >>> > gonzález >>> > to >>> > gonz lez >>> > >>> > >>> > On 30 May 2014 17:22, Miles Osborne <mi...@inf.ed.ac.uk> wrote: >>> >> >>> >> this perl snippet: >>> >> >>> >> $line =~ tr/\040-\176/ /c; >>> >> >>> >> On 30 May 2014 12:17, <moses-support-requ...@mit.edu> wrote: >>> >> > Send Moses-support mailing list submissions to >>> >> > moses-support@mit.edu >>> >> > >>> >> > To subscribe or unsubscribe via the World Wide Web, visit >>> >> > http://mailman.mit.edu/mailman/listinfo/moses-support >>> >> > or, via email, send a message with subject or body 'help' to >>> >> > moses-support-requ...@mit.edu >>> >> > >>> >> > You can reach the person managing the list at >>> >> > moses-support-ow...@mit.edu >>> >> > >>> >> > When replying, please edit your Subject line so it is more specific >>> >> > than "Re: Contents of Moses-support digest..." >>> >> > >>> >> > >>> >> > Today's Topics: >>> >> > >>> >> > 1. removing non-printing character (Hieu Hoang) >>> >> > >>> >> > >>> >> > >>> >> > ---------------------------------------------------------------------- >>> >> > >>> >> > Message: 1 >>> >> > Date: Fri, 30 May 2014 16:24:30 +0100 >>> >> > From: Hieu Hoang <hieu.ho...@ed.ac.uk> >>> >> > Subject: [Moses-support] removing non-printing character >>> >> > To: moses-support <moses-support@mit.edu> >>> >> > Message-ID: >>> >> > >>> >> > <caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com> >>> >> > Content-Type: text/plain; charset="utf-8" >>> >> > >>> >> > does anyone have a script/program that can remove all non-printing >>> >> > characters? >>> >> > >>> >> > I don't care if it's fast or slow, as long as it's ABSOLUTELY >>> >> > removes >>> >> > all >>> >> > non-printing chars >>> >> > >>> >> > -- >>> >> > Hieu Hoang >>> >> > Research Associate >>> >> > University of Edinburgh >>> >> > http://www.hoang.co.uk/hieu >>> >> > -------------- next part -------------- >>> >> > An HTML attachment was scrubbed... >>> >> > URL: >>> >> > >>> >> > http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm >>> >> > >>> >> > ------------------------------ >>> >> > >>> >> > _______________________________________________ >>> >> > Moses-support mailing list >>> >> > Moses-support@mit.edu >>> >> > http://mailman.mit.edu/mailman/listinfo/moses-support >>> >> > >>> >> > >>> >> > End of Moses-support Digest, Vol 91, Issue 52 >>> >> > ********************************************* >>> >> >>> >> >>> >> >>> >> -- >>> >> The University of Edinburgh is a charitable body, registered in >>> >> Scotland, with registration number SC005336. >>> >> _______________________________________________ >>> >> Moses-support mailing list >>> >> Moses-support@mit.edu >>> >> http://mailman.mit.edu/mailman/listinfo/moses-support >>> > >>> > >>> > >>> > >>> > -- >>> > Hieu Hoang >>> > Research Associate >>> > University of Edinburgh >>> > http://www.hoang.co.uk/hieu >>> > >>> > >>> > The University of Edinburgh is a charitable body, registered in >>> > Scotland, with registration number SC005336. >>> > >>> >>> >>> >>> -- >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> >> >> >> >> -- >> Hieu Hoang >> Research Associate >> University of Edinburgh >> http://www.hoang.co.uk/hieu >> >> >> >> _______________________________________________ >> Moses-support mailing list >> Moses-support@mit.edu >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >> >> _______________________________________________ >> Moses-support mailing list >> Moses-support@mit.edu >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > > > > -- > Hieu Hoang > Research Associate > University of Edinburgh > http://www.hoang.co.uk/hieu > > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support