Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
The tokenizer.perl is getting too large and unwieldy. It duplicates escape-special-chars and i don't want it to duplicate this standalone functionality. On 01/06/14 06:02, Philipp Koehn wrote: Hi, should that be part of the tokenizer and/or the escape-special-characters script? -phi On Sat, May 31, 2014 at 8:04 PM, Hieu Hoang hieu.ho...@ed.ac.uk wrote: thanks everybody. I took marcin's suggestion and wrote a wrapper script. It seems to be doingrt ok. It's gotten past the previous step that it failed on, BLEU scores hasn't been affected i've added it to moses if anyone wants it https://github.com/moses-smt/mosesdecoder/commit/57235268323f97c53a9f214e3bec6e722437230f On 30 May 2014 18:07, Marcin Junczys-Dowmunt junc...@amu.edu.pl wrote: How's this? cat baa | perl -C -pe 'chomp; s/\p{C}/ /g; $_=$_\n' W dniu 30.05.2014 18:01, Hieu Hoang pisze: in the attached file, there are 2 or more non-printing chars on the 1st line, between the words 'place' and 'binding'. They should be removed/replaced with a space. Those chars are deleted by parsers, making the word alignments incorrect and crashing extract The 2nd line is perfectly good utf8. It shouldn't be touched. just another friday nlp malaise On 30 May 2014 17:51, Miles Osborne mi...@inf.ed.ac.uk wrote: it is trivial to change it to say a ? mark. but I'm not sure what you want as output now. the original request was for removing non-printable characters, which the Perl does, Miles On 30 May 2014 12:43, Hieu Hoang hieu.ho...@ed.ac.uk wrote: forgot to say. The input is utf8. The snippet turns gonzález to gonz lez On 30 May 2014 17:22, Miles Osborne mi...@inf.ed.ac.uk wrote: this perl snippet: $line =~ tr/\040-\176/ /c; On 30 May 2014 12:17, moses-support-requ...@mit.edu wrote: Send Moses-support mailing list submissions to moses-support@mit.edu To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to moses-support-requ...@mit.edu You can reach the person managing the list at moses-support-ow...@mit.edu When replying, please edit your Subject line so it is more specific than Re: Contents of Moses-support digest... Today's Topics: 1. removing non-printing character (Hieu Hoang) -- Message: 1 Date: Fri, 30 May 2014 16:24:30 +0100 From: Hieu Hoang hieu.ho...@ed.ac.uk Subject: [Moses-support] removing non-printing character To: moses-support moses-support@mit.edu Message-ID: caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com Content-Type: text/plain; charset=utf-8 does anyone have a script/program that can remove all non-printing characters? I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all non-printing chars -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu -- next part -- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm -- ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 91, Issue 52 * -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list
Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
Hi, Fair enough - one can always pipe through both. -phi On Jun 1, 2014 3:00 PM, Hieu Hoang hieuho...@gmail.com wrote: The tokenizer.perl is getting too large and unwieldy. It duplicates escape-special-chars and i don't want it to duplicate this standalone functionality. On 01/06/14 06:02, Philipp Koehn wrote: Hi, should that be part of the tokenizer and/or the escape-special-characters script? -phi On Sat, May 31, 2014 at 8:04 PM, Hieu Hoang hieu.ho...@ed.ac.uk wrote: thanks everybody. I took marcin's suggestion and wrote a wrapper script. It seems to be doingrt ok. It's gotten past the previous step that it failed on, BLEU scores hasn't been affected i've added it to moses if anyone wants it https://github.com/moses-smt/mosesdecoder/commit/ 57235268323f97c53a9f214e3bec6e722437230f On 30 May 2014 18:07, Marcin Junczys-Dowmunt junc...@amu.edu.pl wrote: How's this? cat baa | perl -C -pe 'chomp; s/\p{C}/ /g; $_=$_\n' W dniu 30.05.2014 18:01, Hieu Hoang pisze: in the attached file, there are 2 or more non-printing chars on the 1st line, between the words 'place' and 'binding'. They should be removed/replaced with a space. Those chars are deleted by parsers, making the word alignments incorrect and crashing extract The 2nd line is perfectly good utf8. It shouldn't be touched. just another friday nlp malaise On 30 May 2014 17:51, Miles Osborne mi...@inf.ed.ac.uk wrote: it is trivial to change it to say a ? mark. but I'm not sure what you want as output now. the original request was for removing non-printable characters, which the Perl does, Miles On 30 May 2014 12:43, Hieu Hoang hieu.ho...@ed.ac.uk wrote: forgot to say. The input is utf8. The snippet turns gonzález to gonz lez On 30 May 2014 17:22, Miles Osborne mi...@inf.ed.ac.uk wrote: this perl snippet: $line =~ tr/\040-\176/ /c; On 30 May 2014 12:17, moses-support-requ...@mit.edu wrote: Send Moses-support mailing list submissions to moses-support@mit.edu To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to moses-support-requ...@mit.edu You can reach the person managing the list at moses-support-ow...@mit.edu When replying, please edit your Subject line so it is more specific than Re: Contents of Moses-support digest... Today's Topics: 1. removing non-printing character (Hieu Hoang) -- Message: 1 Date: Fri, 30 May 2014 16:24:30 +0100 From: Hieu Hoang hieu.ho...@ed.ac.uk Subject: [Moses-support] removing non-printing character To: moses-support moses-support@mit.edu Message-ID: caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com Content-Type: text/plain; charset=utf-8 does anyone have a script/program that can remove all non-printing characters? I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all non-printing chars -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu -- next part -- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/ attachments/20140530/daee61ea/attachment-0001.htm -- ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 91, Issue 52 * -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support The University of Edinburgh is a charitable body, registered in
Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
thanks everybody. I took marcin's suggestion and wrote a wrapper script. It seems to be doing ok. It's gotten past the previous step that it failed on, BLEU scores hasn't been affected i've added it to moses if anyone wants it https://github.com/moses-smt/mosesdecoder/commit/57235268323f97c53a9f214e3bec6e722437230f On 30 May 2014 18:07, Marcin Junczys-Dowmunt junc...@amu.edu.pl wrote: How's this? cat baa | perl -C -pe 'chomp; s/\p{C}/ /g; $_=$_\n' W dniu 30.05.2014 18:01, Hieu Hoang pisze: in the attached file, there are 2 or more non-printing chars on the 1st line, between the words 'place' and 'binding'. They should be removed/replaced with a space. Those chars are deleted by parsers, making the word alignments incorrect and crashing extract The 2nd line is perfectly good utf8. It shouldn't be touched. just another friday nlp malaise On 30 May 2014 17:51, Miles Osborne mi...@inf.ed.ac.uk wrote: it is trivial to change it to say a ? mark. but I'm not sure what you want as output now. the original request was for removing non-printable characters, which the Perl does, Miles On 30 May 2014 12:43, Hieu Hoang hieu.ho...@ed.ac.uk wrote: forgot to say. The input is utf8. The snippet turns gonzález to gonz lez On 30 May 2014 17:22, Miles Osborne mi...@inf.ed.ac.uk wrote: this perl snippet: $line =~ tr/\040-\176/ /c; On 30 May 2014 12:17, moses-support-requ...@mit.edu wrote: Send Moses-support mailing list submissions to moses-support@mit.edu To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to moses-support-requ...@mit.edu You can reach the person managing the list at moses-support-ow...@mit.edu When replying, please edit your Subject line so it is more specific than Re: Contents of Moses-support digest... Today's Topics: 1. removing non-printing character (Hieu Hoang) -- Message: 1 Date: Fri, 30 May 2014 16:24:30 +0100 From: Hieu Hoang hieu.ho...@ed.ac.uk Subject: [Moses-support] removing non-printing character To: moses-support moses-support@mit.edu Message-ID: caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com Content-Type: text/plain; charset=utf-8 does anyone have a script/program that can remove all non-printing characters? I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all non-printing chars -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu -- next part -- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm -- ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 91, Issue 52 * -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing listMoses-support@mit.eduhttp://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
Hi, should that be part of the tokenizer and/or the escape-special-characters script? -phi On Sat, May 31, 2014 at 8:04 PM, Hieu Hoang hieu.ho...@ed.ac.uk wrote: thanks everybody. I took marcin's suggestion and wrote a wrapper script. It seems to be doing ok. It's gotten past the previous step that it failed on, BLEU scores hasn't been affected i've added it to moses if anyone wants it https://github.com/moses-smt/mosesdecoder/commit/57235268323f97c53a9f214e3bec6e722437230f On 30 May 2014 18:07, Marcin Junczys-Dowmunt junc...@amu.edu.pl wrote: How's this? cat baa | perl -C -pe 'chomp; s/\p{C}/ /g; $_=$_\n' W dniu 30.05.2014 18:01, Hieu Hoang pisze: in the attached file, there are 2 or more non-printing chars on the 1st line, between the words 'place' and 'binding'. They should be removed/replaced with a space. Those chars are deleted by parsers, making the word alignments incorrect and crashing extract The 2nd line is perfectly good utf8. It shouldn't be touched. just another friday nlp malaise On 30 May 2014 17:51, Miles Osborne mi...@inf.ed.ac.uk wrote: it is trivial to change it to say a ? mark. but I'm not sure what you want as output now. the original request was for removing non-printable characters, which the Perl does, Miles On 30 May 2014 12:43, Hieu Hoang hieu.ho...@ed.ac.uk wrote: forgot to say. The input is utf8. The snippet turns gonzález to gonz lez On 30 May 2014 17:22, Miles Osborne mi...@inf.ed.ac.uk wrote: this perl snippet: $line =~ tr/\040-\176/ /c; On 30 May 2014 12:17, moses-support-requ...@mit.edu wrote: Send Moses-support mailing list submissions to moses-support@mit.edu To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to moses-support-requ...@mit.edu You can reach the person managing the list at moses-support-ow...@mit.edu When replying, please edit your Subject line so it is more specific than Re: Contents of Moses-support digest... Today's Topics: 1. removing non-printing character (Hieu Hoang) -- Message: 1 Date: Fri, 30 May 2014 16:24:30 +0100 From: Hieu Hoang hieu.ho...@ed.ac.uk Subject: [Moses-support] removing non-printing character To: moses-support moses-support@mit.edu Message-ID: caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com Content-Type: text/plain; charset=utf-8 does anyone have a script/program that can remove all non-printing characters? I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all non-printing chars -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu -- next part -- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm -- ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 91, Issue 52 * -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu
Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
this perl snippet: $line =~ tr/\040-\176/ /c; On 30 May 2014 12:17, moses-support-requ...@mit.edu wrote: Send Moses-support mailing list submissions to moses-support@mit.edu To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to moses-support-requ...@mit.edu You can reach the person managing the list at moses-support-ow...@mit.edu When replying, please edit your Subject line so it is more specific than Re: Contents of Moses-support digest... Today's Topics: 1. removing non-printing character (Hieu Hoang) -- Message: 1 Date: Fri, 30 May 2014 16:24:30 +0100 From: Hieu Hoang hieu.ho...@ed.ac.uk Subject: [Moses-support] removing non-printing character To: moses-support moses-support@mit.edu Message-ID: caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com Content-Type: text/plain; charset=utf-8 does anyone have a script/program that can remove all non-printing characters? I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all non-printing chars -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu -- next part -- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm -- ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 91, Issue 52 * -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
forgot to say. The input is utf8. The snippet turns gonzález to gonz lez On 30 May 2014 17:22, Miles Osborne mi...@inf.ed.ac.uk wrote: this perl snippet: $line =~ tr/\040-\176/ /c; On 30 May 2014 12:17, moses-support-requ...@mit.edu wrote: Send Moses-support mailing list submissions to moses-support@mit.edu To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to moses-support-requ...@mit.edu You can reach the person managing the list at moses-support-ow...@mit.edu When replying, please edit your Subject line so it is more specific than Re: Contents of Moses-support digest... Today's Topics: 1. removing non-printing character (Hieu Hoang) -- Message: 1 Date: Fri, 30 May 2014 16:24:30 +0100 From: Hieu Hoang hieu.ho...@ed.ac.uk Subject: [Moses-support] removing non-printing character To: moses-support moses-support@mit.edu Message-ID: caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com Content-Type: text/plain; charset=utf-8 does anyone have a script/program that can remove all non-printing characters? I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all non-printing chars -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu -- next part -- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm -- ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 91, Issue 52 * -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
it is trivial to change it to say a ? mark. but I'm not sure what you want as output now. the original request was for removing non-printable characters, which the Perl does, Miles On 30 May 2014 12:43, Hieu Hoang hieu.ho...@ed.ac.uk wrote: forgot to say. The input is utf8. The snippet turns gonzález to gonz lez On 30 May 2014 17:22, Miles Osborne mi...@inf.ed.ac.uk wrote: this perl snippet: $line =~ tr/\040-\176/ /c; On 30 May 2014 12:17, moses-support-requ...@mit.edu wrote: Send Moses-support mailing list submissions to moses-support@mit.edu To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to moses-support-requ...@mit.edu You can reach the person managing the list at moses-support-ow...@mit.edu When replying, please edit your Subject line so it is more specific than Re: Contents of Moses-support digest... Today's Topics: 1. removing non-printing character (Hieu Hoang) -- Message: 1 Date: Fri, 30 May 2014 16:24:30 +0100 From: Hieu Hoang hieu.ho...@ed.ac.uk Subject: [Moses-support] removing non-printing character To: moses-support moses-support@mit.edu Message-ID: caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com Content-Type: text/plain; charset=utf-8 does anyone have a script/program that can remove all non-printing characters? I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all non-printing chars -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu -- next part -- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm -- ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 91, Issue 52 * -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
in the attached file, there are 2 or more non-printing chars on the 1st line, between the words 'place' and 'binding'. They should be removed/replaced with a space. Those chars are deleted by parsers, making the word alignments incorrect and crashing extract The 2nd line is perfectly good utf8. It shouldn't be touched. just another friday nlp malaise On 30 May 2014 17:51, Miles Osborne mi...@inf.ed.ac.uk wrote: it is trivial to change it to say a ? mark. but I'm not sure what you want as output now. the original request was for removing non-printable characters, which the Perl does, Miles On 30 May 2014 12:43, Hieu Hoang hieu.ho...@ed.ac.uk wrote: forgot to say. The input is utf8. The snippet turns gonzález to gonz lez On 30 May 2014 17:22, Miles Osborne mi...@inf.ed.ac.uk wrote: this perl snippet: $line =~ tr/\040-\176/ /c; On 30 May 2014 12:17, moses-support-requ...@mit.edu wrote: Send Moses-support mailing list submissions to moses-support@mit.edu To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to moses-support-requ...@mit.edu You can reach the person managing the list at moses-support-ow...@mit.edu When replying, please edit your Subject line so it is more specific than Re: Contents of Moses-support digest... Today's Topics: 1. removing non-printing character (Hieu Hoang) -- Message: 1 Date: Fri, 30 May 2014 16:24:30 +0100 From: Hieu Hoang hieu.ho...@ed.ac.uk Subject: [Moses-support] removing non-printing character To: moses-support moses-support@mit.edu Message-ID: caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com Content-Type: text/plain; charset=utf-8 does anyone have a script/program that can remove all non-printing characters? I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all non-printing chars -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu -- next part -- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm -- ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 91, Issue 52 * -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu baa Description: Binary data ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
How's this? cat baa | perl -C -pe 'chomp; s/\p{C}/ /g; $_=$_\n' W dniu 30.05.2014 18:01, Hieu Hoang pisze: in the attached file, there are 2 or more non-printing chars on the 1st line, between the words 'place' and 'binding'. They should be removed/replaced with a space. Those chars are deleted by parsers, making the word alignments incorrect and crashing extract The 2nd line is perfectly good utf8. It shouldn't be touched. just another friday nlp malaise On 30 May 2014 17:51, Miles Osborne mi...@inf.ed.ac.uk mailto:mi...@inf.ed.ac.uk wrote: it is trivial to change it to say a ? mark. but I'm not sure what you want as output now. the original request was for removing non-printable characters, which the Perl does, Miles On 30 May 2014 12:43, Hieu Hoang hieu.ho...@ed.ac.uk mailto:hieu.ho...@ed.ac.uk wrote: forgot to say. The input is utf8. The snippet turns gonzález to gonz lez On 30 May 2014 17:22, Miles Osborne mi...@inf.ed.ac.uk mailto:mi...@inf.ed.ac.uk wrote: this perl snippet: $line =~ tr/\040-\176/ /c; On 30 May 2014 12:17, moses-support-requ...@mit.edu mailto:moses-support-requ...@mit.edu wrote: Send Moses-support mailing list submissions to moses-support@mit.edu mailto:moses-support@mit.edu To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to moses-support-requ...@mit.edu mailto:moses-support-requ...@mit.edu You can reach the person managing the list at moses-support-ow...@mit.edu mailto:moses-support-ow...@mit.edu When replying, please edit your Subject line so it is more specific than Re: Contents of Moses-support digest... Today's Topics: 1. removing non-printing character (Hieu Hoang) -- Message: 1 Date: Fri, 30 May 2014 16:24:30 +0100 From: Hieu Hoang hieu.ho...@ed.ac.uk mailto:hieu.ho...@ed.ac.uk Subject: [Moses-support] removing non-printing character To: moses-support moses-support@mit.edu mailto:moses-support@mit.edu Message-ID: caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com mailto:caekmkbj4tedzyvgeastmg51%2bw-5sye5ygrmibcypc2j8ybk...@mail.gmail.com Content-Type: text/plain; charset=utf-8 does anyone have a script/program that can remove all non-printing characters? I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all non-printing chars -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu -- next part -- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm -- ___ Moses-support mailing list Moses-support@mit.edu mailto:Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 91, Issue 52 * -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu mailto:Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
As far as I know, no such general purpose tool exists. We wrote a custom in-house script that removes many, but not all, possible non-printing Unicode characters as part of our WMT submission. I am interested in writing one, though. I think the right way to do this would be to parse the Unicode character database for all characters of certain classes, and build the tool from that data. Lane On Fri, May 30, 2014 at 1:01 PM, Hieu Hoang hieu.ho...@ed.ac.uk wrote: in the attached file, there are 2 or more non-printing chars on the 1st line, between the words 'place' and 'binding'. They should be removed/replaced with a space. Those chars are deleted by parsers, making the word alignments incorrect and crashing extract The 2nd line is perfectly good utf8. It shouldn't be touched. just another friday nlp malaise On 30 May 2014 17:51, Miles Osborne mi...@inf.ed.ac.uk wrote: it is trivial to change it to say a ? mark. but I'm not sure what you want as output now. the original request was for removing non-printable characters, which the Perl does, Miles On 30 May 2014 12:43, Hieu Hoang hieu.ho...@ed.ac.uk wrote: forgot to say. The input is utf8. The snippet turns gonzález to gonz lez On 30 May 2014 17:22, Miles Osborne mi...@inf.ed.ac.uk wrote: this perl snippet: $line =~ tr/\040-\176/ /c; On 30 May 2014 12:17, moses-support-requ...@mit.edu wrote: Send Moses-support mailing list submissions to moses-support@mit.edu To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to moses-support-requ...@mit.edu You can reach the person managing the list at moses-support-ow...@mit.edu When replying, please edit your Subject line so it is more specific than Re: Contents of Moses-support digest... Today's Topics: 1. removing non-printing character (Hieu Hoang) -- Message: 1 Date: Fri, 30 May 2014 16:24:30 +0100 From: Hieu Hoang hieu.ho...@ed.ac.uk Subject: [Moses-support] removing non-printing character To: moses-support moses-support@mit.edu Message-ID: caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com Content-Type: text/plain; charset=utf-8 does anyone have a script/program that can remove all non-printing characters? I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all non-printing chars -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu -- next part -- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm -- ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 91, Issue 52 * -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, Time Enough For Love ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
We also used charlint. It might do what you want. On Fri, May 30, 2014 at 1:21 PM, Lane Schwartz dowob...@gmail.com wrote: As far as I know, no such general purpose tool exists. We wrote a custom in-house script that removes many, but not all, possible non-printing Unicode characters as part of our WMT submission. I am interested in writing one, though. I think the right way to do this would be to parse the Unicode character database for all characters of certain classes, and build the tool from that data. Lane On Fri, May 30, 2014 at 1:01 PM, Hieu Hoang hieu.ho...@ed.ac.uk wrote: in the attached file, there are 2 or more non-printing chars on the 1st line, between the words 'place' and 'binding'. They should be removed/replaced with a space. Those chars are deleted by parsers, making the word alignments incorrect and crashing extract The 2nd line is perfectly good utf8. It shouldn't be touched. just another friday nlp malaise On 30 May 2014 17:51, Miles Osborne mi...@inf.ed.ac.uk wrote: it is trivial to change it to say a ? mark. but I'm not sure what you want as output now. the original request was for removing non-printable characters, which the Perl does, Miles On 30 May 2014 12:43, Hieu Hoang hieu.ho...@ed.ac.uk wrote: forgot to say. The input is utf8. The snippet turns gonzález to gonz lez On 30 May 2014 17:22, Miles Osborne mi...@inf.ed.ac.uk wrote: this perl snippet: $line =~ tr/\040-\176/ /c; On 30 May 2014 12:17, moses-support-requ...@mit.edu wrote: Send Moses-support mailing list submissions to moses-support@mit.edu To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to moses-support-requ...@mit.edu You can reach the person managing the list at moses-support-ow...@mit.edu When replying, please edit your Subject line so it is more specific than Re: Contents of Moses-support digest... Today's Topics: 1. removing non-printing character (Hieu Hoang) -- Message: 1 Date: Fri, 30 May 2014 16:24:30 +0100 From: Hieu Hoang hieu.ho...@ed.ac.uk Subject: [Moses-support] removing non-printing character To: moses-support moses-support@mit.edu Message-ID: caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com Content-Type: text/plain; charset=utf-8 does anyone have a script/program that can remove all non-printing characters? I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all non-printing chars -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu -- next part -- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm -- ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 91, Issue 52 * -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, Time Enough For Love -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, Time Enough For Love ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52
for those specific characters: perl -C -pe 's/\x{200B}//g' tmp/baa but as Lane mentions, you probably need to somehow specify the set of naughty characters you need to deal with. Miles On 30 May 2014 13:23, Lane Schwartz dowob...@gmail.com wrote: We also used charlint. It might do what you want. On Fri, May 30, 2014 at 1:21 PM, Lane Schwartz dowob...@gmail.com wrote: As far as I know, no such general purpose tool exists. We wrote a custom in-house script that removes many, but not all, possible non-printing Unicode characters as part of our WMT submission. I am interested in writing one, though. I think the right way to do this would be to parse the Unicode character database for all characters of certain classes, and build the tool from that data. Lane On Fri, May 30, 2014 at 1:01 PM, Hieu Hoang hieu.ho...@ed.ac.uk wrote: in the attached file, there are 2 or more non-printing chars on the 1st line, between the words 'place' and 'binding'. They should be removed/replaced with a space. Those chars are deleted by parsers, making the word alignments incorrect and crashing extract The 2nd line is perfectly good utf8. It shouldn't be touched. just another friday nlp malaise On 30 May 2014 17:51, Miles Osborne mi...@inf.ed.ac.uk wrote: it is trivial to change it to say a ? mark. but I'm not sure what you want as output now. the original request was for removing non-printable characters, which the Perl does, Miles On 30 May 2014 12:43, Hieu Hoang hieu.ho...@ed.ac.uk wrote: forgot to say. The input is utf8. The snippet turns gonzález to gonz lez On 30 May 2014 17:22, Miles Osborne mi...@inf.ed.ac.uk wrote: this perl snippet: $line =~ tr/\040-\176/ /c; On 30 May 2014 12:17, moses-support-requ...@mit.edu wrote: Send Moses-support mailing list submissions to moses-support@mit.edu To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to moses-support-requ...@mit.edu You can reach the person managing the list at moses-support-ow...@mit.edu When replying, please edit your Subject line so it is more specific than Re: Contents of Moses-support digest... Today's Topics: 1. removing non-printing character (Hieu Hoang) -- Message: 1 Date: Fri, 30 May 2014 16:24:30 +0100 From: Hieu Hoang hieu.ho...@ed.ac.uk Subject: [Moses-support] removing non-printing character To: moses-support moses-support@mit.edu Message-ID: caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com Content-Type: text/plain; charset=utf-8 does anyone have a script/program that can remove all non-printing characters? I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all non-printing chars -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu -- next part -- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm -- ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 91, Issue 52 * -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, Time Enough For Love -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A.