Re: [Moses-support] Get the probability of a given n-gram in a language model

2014-05-30 Thread Albert Llorens
Excellent. Thanks a lot Kenneth.

Albert


-Original Message-
From: Kenneth Heafield [mailto:mo...@kheafield.com] 
Sent: lunes, 26 de mayo de 2014 20:05
To: Albert Llorens; moses-support@mit.edu
Subject: Re: [Moses-support] Get the probability of a given n-gram in a 
language model

Hi,

Here's a cheap server for fragment scoring.

socat TCP4-LISTEN:2000,fork EXEC:"bin/fragment lm/test.arpa"

Then in another terminal

socat TCP4-CONNECT:localhost:2000 STDIO  and append  for 
translation.

Kenneth

On 05/26/14 02:04, Albert Llorens wrote:
> Thanks, Kenneth.
> 
> Yes, I want to score sentence fragments. I want to use Moses for fragment 
> translation, but only for frequent or probable fragments. I'll try what you 
> suggest. Any chance the query could be done remotely, using mosesserver or 
> anything else?
> 
> Kind regards.
> 
> Albert
> 
> 
> -Original Message-
> From: moses-support-boun...@mit.edu 
> [mailto:moses-support-boun...@mit.edu] On Behalf Of Kenneth Heafield
> Sent: viernes, 23 de mayo de 2014 17:34
> To: moses-support@mit.edu
> Subject: Re: [Moses-support] Get the probability of a given n-gram in 
> a language model
> 
> Hi,
> 
>   You can use bin/query on an ARPA or KenLM file.  Then just type 
> sentences at it (or use a file as stdin).  By default it will assume you are 
> scoring sentences.  You can pass -n to not wrap in  and .
> 
>   It appears that you are asking to score sentence fragments.  The 
> leading words will be scored using unigrams, bigrams, etc. from, say, 
> a 5-gram model.  If you are using Kneser-Ney, these lower-order 
> probabilities (unigrams through 4-grams) are conditioned on having 
> backed off to them.  If you want accurate scores for sentence 
> fragments, build a model of order 1, order 2, order 3, etc. then 
> combine them using
> 
> build_binary -r "1.arpa 2.arpa 3.arpa 4.arpa" 5.arpa 5.rest
> 
> You can then use
> 
> bin/fragment 5.rest  
> to attain log10 frequencies.  For more on this rant, read
> 
> http://kheafield.com/professional/edinburgh/rest_paper.pdf
> 
> Kenneth   
> 
> On 05/23/14 05:13, Albert Llorens wrote:
>> Hi,
>>
>>  
>>
>> Is there a straightforward way I can ask Moses for the probability 
>> (or the frequency) of a given n-gram in a given language model? If 
>> so, can I do the query through mosesserver?
>>
>>  
>>
>> Thanks.
>>
>>  
>>
>> Kind regards.
>>
>>  
>>
>> Albert
>>
>>  
>>
>>
>>
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] Coling 2014 - List of Accepted Papers & 1 week to Early Registration Deadline

2014-05-30 Thread COLING 2014 - Registration
Trouble viewing this email? View in 
Browser


[http://www.coling-2014.org/img/email_header.png]


[http://www.coling-2014.org/img/coling_logo.png]


View List of
Accepted Papers Here

Early Registration Deadline
June 6th 2014
Click here to Register
Register for Main Conference,
1 or 2 day Workshops and half day 
Tutorials!


[http://www.coling-2014.org/img/email_1.jpg]


Our Sponsors

[http://www.coling-2014.org/img/sponsors_email.jpg]


Ireland Inspires!
Click here to see the Ireland Inspires 
Video





Coling 2014

www.coling2014.org | 
coling2014...@keynotepco.ie | View in 
Browser



___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


[Moses-support] removing non-printing character

2014-05-30 Thread Hieu Hoang
does anyone have a script/program that can remove all non-printing
characters?

I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all
non-printing chars

-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52

2014-05-30 Thread Miles Osborne
this perl snippet:

$line =~ tr/\040-\176/ /c;

On 30 May 2014 12:17,   wrote:
> Send Moses-support mailing list submissions to
> moses-support@mit.edu
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://mailman.mit.edu/mailman/listinfo/moses-support
> or, via email, send a message with subject or body 'help' to
> moses-support-requ...@mit.edu
>
> You can reach the person managing the list at
> moses-support-ow...@mit.edu
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Moses-support digest..."
>
>
> Today's Topics:
>
>1. removing non-printing character (Hieu Hoang)
>
>
> --
>
> Message: 1
> Date: Fri, 30 May 2014 16:24:30 +0100
> From: Hieu Hoang 
> Subject: [Moses-support] removing non-printing character
> To: moses-support 
> Message-ID:
> 
> Content-Type: text/plain; charset="utf-8"
>
> does anyone have a script/program that can remove all non-printing
> characters?
>
> I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all
> non-printing chars
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
> -- next part --
> An HTML attachment was scrubbed...
> URL: 
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm
>
> --
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> End of Moses-support Digest, Vol 91, Issue 52
> *



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52

2014-05-30 Thread Hieu Hoang
forgot to say. The input is utf8. The snippet turns
   gonzález
to
   gonz lez


On 30 May 2014 17:22, Miles Osborne  wrote:

> this perl snippet:
>
> $line =~ tr/\040-\176/ /c;
>
> On 30 May 2014 12:17,   wrote:
> > Send Moses-support mailing list submissions to
> > moses-support@mit.edu
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> > or, via email, send a message with subject or body 'help' to
> > moses-support-requ...@mit.edu
> >
> > You can reach the person managing the list at
> > moses-support-ow...@mit.edu
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of Moses-support digest..."
> >
> >
> > Today's Topics:
> >
> >1. removing non-printing character (Hieu Hoang)
> >
> >
> > --
> >
> > Message: 1
> > Date: Fri, 30 May 2014 16:24:30 +0100
> > From: Hieu Hoang 
> > Subject: [Moses-support] removing non-printing character
> > To: moses-support 
> > Message-ID:
> > <
> caekmkbj4tedzyvgeastmg51+w-5sye5ygrmibcypc2j8ybk...@mail.gmail.com>
> > Content-Type: text/plain; charset="utf-8"
> >
> > does anyone have a script/program that can remove all non-printing
> > characters?
> >
> > I don't care if it's fast or slow, as long as it's ABSOLUTELY removes all
> > non-printing chars
> >
> > --
> > Hieu Hoang
> > Research Associate
> > University of Edinburgh
> > http://www.hoang.co.uk/hieu
> > -- next part --
> > An HTML attachment was scrubbed...
> > URL:
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm
> >
> > --
> >
> > ___
> > Moses-support mailing list
> > Moses-support@mit.edu
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
> > End of Moses-support Digest, Vol 91, Issue 52
> > *
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52

2014-05-30 Thread Miles Osborne
it is trivial to change it to say a ? mark.

but I'm not sure what you want as output now.  the original request
was for removing non-printable characters, which the Perl does,

Miles

On 30 May 2014 12:43, Hieu Hoang  wrote:
> forgot to say. The input is utf8. The snippet turns
>gonzález
> to
>gonz lez
>
>
> On 30 May 2014 17:22, Miles Osborne  wrote:
>>
>> this perl snippet:
>>
>> $line =~ tr/\040-\176/ /c;
>>
>> On 30 May 2014 12:17,   wrote:
>> > Send Moses-support mailing list submissions to
>> > moses-support@mit.edu
>> >
>> > To subscribe or unsubscribe via the World Wide Web, visit
>> > http://mailman.mit.edu/mailman/listinfo/moses-support
>> > or, via email, send a message with subject or body 'help' to
>> > moses-support-requ...@mit.edu
>> >
>> > You can reach the person managing the list at
>> > moses-support-ow...@mit.edu
>> >
>> > When replying, please edit your Subject line so it is more specific
>> > than "Re: Contents of Moses-support digest..."
>> >
>> >
>> > Today's Topics:
>> >
>> >1. removing non-printing character (Hieu Hoang)
>> >
>> >
>> > --
>> >
>> > Message: 1
>> > Date: Fri, 30 May 2014 16:24:30 +0100
>> > From: Hieu Hoang 
>> > Subject: [Moses-support] removing non-printing character
>> > To: moses-support 
>> > Message-ID:
>> >
>> > 
>> > Content-Type: text/plain; charset="utf-8"
>> >
>> > does anyone have a script/program that can remove all non-printing
>> > characters?
>> >
>> > I don't care if it's fast or slow, as long as it's ABSOLUTELY removes
>> > all
>> > non-printing chars
>> >
>> > --
>> > Hieu Hoang
>> > Research Associate
>> > University of Edinburgh
>> > http://www.hoang.co.uk/hieu
>> > -- next part --
>> > An HTML attachment was scrubbed...
>> > URL:
>> > http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm
>> >
>> > --
>> >
>> > ___
>> > Moses-support mailing list
>> > Moses-support@mit.edu
>> > http://mailman.mit.edu/mailman/listinfo/moses-support
>> >
>> >
>> > End of Moses-support Digest, Vol 91, Issue 52
>> > *
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52

2014-05-30 Thread Hieu Hoang
in the attached file, there are 2 or more non-printing chars on the 1st
line, between the words 'place' and 'binding'. They should be
removed/replaced with a space. Those chars are deleted by parsers, making
the word alignments incorrect and crashing extract

The 2nd line is perfectly good utf8. It shouldn't be touched.

just another friday nlp malaise



On 30 May 2014 17:51, Miles Osborne  wrote:

> it is trivial to change it to say a ? mark.
>
> but I'm not sure what you want as output now.  the original request
> was for removing non-printable characters, which the Perl does,
>
> Miles
>
> On 30 May 2014 12:43, Hieu Hoang  wrote:
> > forgot to say. The input is utf8. The snippet turns
> >gonzález
> > to
> >gonz lez
> >
> >
> > On 30 May 2014 17:22, Miles Osborne  wrote:
> >>
> >> this perl snippet:
> >>
> >> $line =~ tr/\040-\176/ /c;
> >>
> >> On 30 May 2014 12:17,   wrote:
> >> > Send Moses-support mailing list submissions to
> >> > moses-support@mit.edu
> >> >
> >> > To subscribe or unsubscribe via the World Wide Web, visit
> >> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >> > or, via email, send a message with subject or body 'help' to
> >> > moses-support-requ...@mit.edu
> >> >
> >> > You can reach the person managing the list at
> >> > moses-support-ow...@mit.edu
> >> >
> >> > When replying, please edit your Subject line so it is more specific
> >> > than "Re: Contents of Moses-support digest..."
> >> >
> >> >
> >> > Today's Topics:
> >> >
> >> >1. removing non-printing character (Hieu Hoang)
> >> >
> >> >
> >> > --
> >> >
> >> > Message: 1
> >> > Date: Fri, 30 May 2014 16:24:30 +0100
> >> > From: Hieu Hoang 
> >> > Subject: [Moses-support] removing non-printing character
> >> > To: moses-support 
> >> > Message-ID:
> >> >
> >> > 
> >> > Content-Type: text/plain; charset="utf-8"
> >> >
> >> > does anyone have a script/program that can remove all non-printing
> >> > characters?
> >> >
> >> > I don't care if it's fast or slow, as long as it's ABSOLUTELY removes
> >> > all
> >> > non-printing chars
> >> >
> >> > --
> >> > Hieu Hoang
> >> > Research Associate
> >> > University of Edinburgh
> >> > http://www.hoang.co.uk/hieu
> >> > -- next part --
> >> > An HTML attachment was scrubbed...
> >> > URL:
> >> >
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm
> >> >
> >> > --
> >> >
> >> > ___
> >> > Moses-support mailing list
> >> > Moses-support@mit.edu
> >> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >> >
> >> >
> >> > End of Moses-support Digest, Vol 91, Issue 52
> >> > *
> >>
> >>
> >>
> >> --
> >> The University of Edinburgh is a charitable body, registered in
> >> Scotland, with registration number SC005336.
> >> ___
> >> Moses-support mailing list
> >> Moses-support@mit.edu
> >> http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
> >
> >
> > --
> > Hieu Hoang
> > Research Associate
> > University of Edinburgh
> > http://www.hoang.co.uk/hieu
> >
> >
> > The University of Edinburgh is a charitable body, registered in
> > Scotland, with registration number SC005336.
> >
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>


-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu


baa
Description: Binary data
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52

2014-05-30 Thread Marcin Junczys-Dowmunt

How's this?

cat baa | perl -C -pe 'chomp; s/\p{C}/ /g; $_="$_\n"'


W dniu 30.05.2014 18:01, Hieu Hoang pisze:
in the attached file, there are 2 or more non-printing chars on the 
1st line, between the words 'place' and 'binding'. They should be 
removed/replaced with a space. Those chars are deleted by parsers, 
making the word alignments incorrect and crashing extract


The 2nd line is perfectly good utf8. It shouldn't be touched.

just another friday nlp malaise



On 30 May 2014 17:51, Miles Osborne <mailto:mi...@inf.ed.ac.uk>> wrote:


it is trivial to change it to say a ? mark.

but I'm not sure what you want as output now.  the original request
was for removing non-printable characters, which the Perl does,

Miles

On 30 May 2014 12:43, Hieu Hoang mailto:hieu.ho...@ed.ac.uk>> wrote:
> forgot to say. The input is utf8. The snippet turns
>gonzález
> to
>gonz lez
>
>
> On 30 May 2014 17:22, Miles Osborne mailto:mi...@inf.ed.ac.uk>> wrote:
>>
>> this perl snippet:
>>
>> $line =~ tr/\040-\176/ /c;
>>
>> On 30 May 2014 12:17,  mailto:moses-support-requ...@mit.edu>> wrote:
>> > Send Moses-support mailing list submissions to
>> > moses-support@mit.edu <mailto:moses-support@mit.edu>
>> >
>> > To subscribe or unsubscribe via the World Wide Web, visit
>> > http://mailman.mit.edu/mailman/listinfo/moses-support
>> > or, via email, send a message with subject or body 'help' to
>> > moses-support-requ...@mit.edu
<mailto:moses-support-requ...@mit.edu>
>> >
>> > You can reach the person managing the list at
>> > moses-support-ow...@mit.edu <mailto:moses-support-ow...@mit.edu>
>> >
>> > When replying, please edit your Subject line so it is more
specific
>> > than "Re: Contents of Moses-support digest..."
>> >
>> >
>> > Today's Topics:
>> >
>> >1. removing non-printing character (Hieu Hoang)
>> >
>> >
>> >
--
>> >
>> > Message: 1
>> > Date: Fri, 30 May 2014 16:24:30 +0100
>> > From: Hieu Hoang mailto:hieu.ho...@ed.ac.uk>>
>> > Subject: [Moses-support] removing non-printing character
>> > To: moses-support mailto:moses-support@mit.edu>>
>> > Message-ID:
>> >
>> >
mailto:caekmkbj4tedzyvgeastmg51%2bw-5sye5ygrmibcypc2j8ybk...@mail.gmail.com>>
>> > Content-Type: text/plain; charset="utf-8"
>> >
>> > does anyone have a script/program that can remove all
non-printing
>> > characters?
>> >
>> > I don't care if it's fast or slow, as long as it's ABSOLUTELY
removes
>> > all
>> > non-printing chars
>> >
>> > --
>> > Hieu Hoang
>> > Research Associate
>> > University of Edinburgh
>> > http://www.hoang.co.uk/hieu
>> > -- next part --
>> > An HTML attachment was scrubbed...
>> > URL:
>> >

http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm
>> >
>> > --
>> >
>> > ___
>> > Moses-support mailing list
>> > Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> > http://mailman.mit.edu/mailman/listinfo/moses-support
>> >
>> >
>> > End of Moses-support Digest, Vol 91, Issue 52
>> > *
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>> ___
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>



--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu



___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52

2014-05-30 Thread Lane Schwartz
As far as I know, no such general purpose tool exists. We wrote a
custom in-house script that removes many, but not all, possible
non-printing Unicode characters as part of our WMT submission.

I am interested in  writing one, though.

I think the right way to do this would be to parse the Unicode
character database for all characters of certain classes, and build
the tool from that data.

Lane


On Fri, May 30, 2014 at 1:01 PM, Hieu Hoang  wrote:
> in the attached file, there are 2 or more non-printing chars on the 1st
> line, between the words 'place' and 'binding'. They should be
> removed/replaced with a space. Those chars are deleted by parsers, making
> the word alignments incorrect and crashing extract
>
> The 2nd line is perfectly good utf8. It shouldn't be touched.
>
> just another friday nlp malaise
>
>
>
> On 30 May 2014 17:51, Miles Osborne  wrote:
>>
>> it is trivial to change it to say a ? mark.
>>
>> but I'm not sure what you want as output now.  the original request
>> was for removing non-printable characters, which the Perl does,
>>
>> Miles
>>
>> On 30 May 2014 12:43, Hieu Hoang  wrote:
>> > forgot to say. The input is utf8. The snippet turns
>> >gonzález
>> > to
>> >gonz lez
>> >
>> >
>> > On 30 May 2014 17:22, Miles Osborne  wrote:
>> >>
>> >> this perl snippet:
>> >>
>> >> $line =~ tr/\040-\176/ /c;
>> >>
>> >> On 30 May 2014 12:17,   wrote:
>> >> > Send Moses-support mailing list submissions to
>> >> > moses-support@mit.edu
>> >> >
>> >> > To subscribe or unsubscribe via the World Wide Web, visit
>> >> > http://mailman.mit.edu/mailman/listinfo/moses-support
>> >> > or, via email, send a message with subject or body 'help' to
>> >> > moses-support-requ...@mit.edu
>> >> >
>> >> > You can reach the person managing the list at
>> >> > moses-support-ow...@mit.edu
>> >> >
>> >> > When replying, please edit your Subject line so it is more specific
>> >> > than "Re: Contents of Moses-support digest..."
>> >> >
>> >> >
>> >> > Today's Topics:
>> >> >
>> >> >1. removing non-printing character (Hieu Hoang)
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> >
>> >> > Message: 1
>> >> > Date: Fri, 30 May 2014 16:24:30 +0100
>> >> > From: Hieu Hoang 
>> >> > Subject: [Moses-support] removing non-printing character
>> >> > To: moses-support 
>> >> > Message-ID:
>> >> >
>> >> > 
>> >> > Content-Type: text/plain; charset="utf-8"
>> >> >
>> >> > does anyone have a script/program that can remove all non-printing
>> >> > characters?
>> >> >
>> >> > I don't care if it's fast or slow, as long as it's ABSOLUTELY removes
>> >> > all
>> >> > non-printing chars
>> >> >
>> >> > --
>> >> > Hieu Hoang
>> >> > Research Associate
>> >> > University of Edinburgh
>> >> > http://www.hoang.co.uk/hieu
>> >> > -- next part --
>> >> > An HTML attachment was scrubbed...
>> >> > URL:
>> >> >
>> >> > http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm
>> >> >
>> >> > --
>> >> >
>> >> > ___
>> >> > Moses-support mailing list
>> >> > Moses-support@mit.edu
>> >> > http://mailman.mit.edu/mailman/listinfo/moses-support
>> >> >
>> >> >
>> >> > End of Moses-support Digest, Vol 91, Issue 52
>> >> > *
>> >>
>> >>
>> >>
>> >> --
>> >> The University of Edinburgh is a charitable body, registered in
>> >> Scotland, with registration number SC005336.
>> >> ___
>> >> Moses-support mailing list
>> >> Moses-support@mit.edu
>> >> http://mailman.mit.edu/mailman/listinfo/moses-support
>> >
>> >
>> >
>> >
>> > --
>> > Hieu Hoang
>> > Research Associate
>> > University of Edinburgh
>> > http://www.hoang.co.uk/hieu
>> >
>> >
>> > The University of Edinburgh is a charitable body, registered in
>> > Scotland, with registration number SC005336.
>> >
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
> ___
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



-- 
When a place gets crowded enough to require ID's, social collapse is not
far away.  It is time to go elsewhere.  The best thing about space travel
is that it made it possible to go elsewhere.
-- R.A. Heinlein, "Time Enough For Love"

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52

2014-05-30 Thread Lane Schwartz
We also used charlint. It might do what you want.

On Fri, May 30, 2014 at 1:21 PM, Lane Schwartz  wrote:
> As far as I know, no such general purpose tool exists. We wrote a
> custom in-house script that removes many, but not all, possible
> non-printing Unicode characters as part of our WMT submission.
>
> I am interested in  writing one, though.
>
> I think the right way to do this would be to parse the Unicode
> character database for all characters of certain classes, and build
> the tool from that data.
>
> Lane
>
>
> On Fri, May 30, 2014 at 1:01 PM, Hieu Hoang  wrote:
>> in the attached file, there are 2 or more non-printing chars on the 1st
>> line, between the words 'place' and 'binding'. They should be
>> removed/replaced with a space. Those chars are deleted by parsers, making
>> the word alignments incorrect and crashing extract
>>
>> The 2nd line is perfectly good utf8. It shouldn't be touched.
>>
>> just another friday nlp malaise
>>
>>
>>
>> On 30 May 2014 17:51, Miles Osborne  wrote:
>>>
>>> it is trivial to change it to say a ? mark.
>>>
>>> but I'm not sure what you want as output now.  the original request
>>> was for removing non-printable characters, which the Perl does,
>>>
>>> Miles
>>>
>>> On 30 May 2014 12:43, Hieu Hoang  wrote:
>>> > forgot to say. The input is utf8. The snippet turns
>>> >gonzález
>>> > to
>>> >gonz lez
>>> >
>>> >
>>> > On 30 May 2014 17:22, Miles Osborne  wrote:
>>> >>
>>> >> this perl snippet:
>>> >>
>>> >> $line =~ tr/\040-\176/ /c;
>>> >>
>>> >> On 30 May 2014 12:17,   wrote:
>>> >> > Send Moses-support mailing list submissions to
>>> >> > moses-support@mit.edu
>>> >> >
>>> >> > To subscribe or unsubscribe via the World Wide Web, visit
>>> >> > http://mailman.mit.edu/mailman/listinfo/moses-support
>>> >> > or, via email, send a message with subject or body 'help' to
>>> >> > moses-support-requ...@mit.edu
>>> >> >
>>> >> > You can reach the person managing the list at
>>> >> > moses-support-ow...@mit.edu
>>> >> >
>>> >> > When replying, please edit your Subject line so it is more specific
>>> >> > than "Re: Contents of Moses-support digest..."
>>> >> >
>>> >> >
>>> >> > Today's Topics:
>>> >> >
>>> >> >1. removing non-printing character (Hieu Hoang)
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> >
>>> >> > Message: 1
>>> >> > Date: Fri, 30 May 2014 16:24:30 +0100
>>> >> > From: Hieu Hoang 
>>> >> > Subject: [Moses-support] removing non-printing character
>>> >> > To: moses-support 
>>> >> > Message-ID:
>>> >> >
>>> >> > 
>>> >> > Content-Type: text/plain; charset="utf-8"
>>> >> >
>>> >> > does anyone have a script/program that can remove all non-printing
>>> >> > characters?
>>> >> >
>>> >> > I don't care if it's fast or slow, as long as it's ABSOLUTELY removes
>>> >> > all
>>> >> > non-printing chars
>>> >> >
>>> >> > --
>>> >> > Hieu Hoang
>>> >> > Research Associate
>>> >> > University of Edinburgh
>>> >> > http://www.hoang.co.uk/hieu
>>> >> > -- next part --
>>> >> > An HTML attachment was scrubbed...
>>> >> > URL:
>>> >> >
>>> >> > http://mailman.mit.edu/mailman/private/moses-support/attachments/20140530/daee61ea/attachment-0001.htm
>>> >> >
>>> >> > --
>>> >> >
>>> >> > ___
>>> >> > Moses-support mailing list
>>> >> > Moses-support@mit.edu
>>> >> > http://mailman.mit.edu/mailman/listinfo/moses-sup

Re: [Moses-support] Moses-support Digest, Vol 91, Issue 52

2014-05-30 Thread Miles Osborne
for those specific characters:

perl -C -pe 's/\x{200B}//g'< tmp/baa

but as Lane mentions, you probably need to somehow specify the set of
naughty characters you need to deal with.

Miles

On 30 May 2014 13:23, Lane Schwartz  wrote:
> We also used charlint. It might do what you want.
>
> On Fri, May 30, 2014 at 1:21 PM, Lane Schwartz  wrote:
>> As far as I know, no such general purpose tool exists. We wrote a
>> custom in-house script that removes many, but not all, possible
>> non-printing Unicode characters as part of our WMT submission.
>>
>> I am interested in  writing one, though.
>>
>> I think the right way to do this would be to parse the Unicode
>> character database for all characters of certain classes, and build
>> the tool from that data.
>>
>> Lane
>>
>>
>> On Fri, May 30, 2014 at 1:01 PM, Hieu Hoang  wrote:
>>> in the attached file, there are 2 or more non-printing chars on the 1st
>>> line, between the words 'place' and 'binding'. They should be
>>> removed/replaced with a space. Those chars are deleted by parsers, making
>>> the word alignments incorrect and crashing extract
>>>
>>> The 2nd line is perfectly good utf8. It shouldn't be touched.
>>>
>>> just another friday nlp malaise
>>>
>>>
>>>
>>> On 30 May 2014 17:51, Miles Osborne  wrote:
>>>>
>>>> it is trivial to change it to say a ? mark.
>>>>
>>>> but I'm not sure what you want as output now.  the original request
>>>> was for removing non-printable characters, which the Perl does,
>>>>
>>>> Miles
>>>>
>>>> On 30 May 2014 12:43, Hieu Hoang  wrote:
>>>> > forgot to say. The input is utf8. The snippet turns
>>>> >gonzález
>>>> > to
>>>> >gonz lez
>>>> >
>>>> >
>>>> > On 30 May 2014 17:22, Miles Osborne  wrote:
>>>> >>
>>>> >> this perl snippet:
>>>> >>
>>>> >> $line =~ tr/\040-\176/ /c;
>>>> >>
>>>> >> On 30 May 2014 12:17,   wrote:
>>>> >> > Send Moses-support mailing list submissions to
>>>> >> > moses-support@mit.edu
>>>> >> >
>>>> >> > To subscribe or unsubscribe via the World Wide Web, visit
>>>> >> > http://mailman.mit.edu/mailman/listinfo/moses-support
>>>> >> > or, via email, send a message with subject or body 'help' to
>>>> >> > moses-support-requ...@mit.edu
>>>> >> >
>>>> >> > You can reach the person managing the list at
>>>> >> > moses-support-ow...@mit.edu
>>>> >> >
>>>> >> > When replying, please edit your Subject line so it is more specific
>>>> >> > than "Re: Contents of Moses-support digest..."
>>>> >> >
>>>> >> >
>>>> >> > Today's Topics:
>>>> >> >
>>>> >> >1. removing non-printing character (Hieu Hoang)
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > --
>>>> >> >
>>>> >> > Message: 1
>>>> >> > Date: Fri, 30 May 2014 16:24:30 +0100
>>>> >> > From: Hieu Hoang 
>>>> >> > Subject: [Moses-support] removing non-printing character
>>>> >> > To: moses-support 
>>>> >> > Message-ID:
>>>> >> >
>>>> >> > 
>>>> >> > Content-Type: text/plain; charset="utf-8"
>>>> >> >
>>>> >> > does anyone have a script/program that can remove all non-printing
>>>> >> > characters?
>>>> >> >
>>>> >> > I don't care if it's fast or slow, as long as it's ABSOLUTELY removes
>>>> >> > all
>>>> >> > non-printing chars
>>>> >> >
>>>> >> > --
>>>> >> > Hieu Hoang
>>>> >> > Research Associate
>>>> >> > University of Edinburgh
>>>> >> > http://www.hoang.co.uk/hieu
>>>> >> > -- next part