Re: Stripping out Unicode combining characters (diacritics) -

2008-05-07 Thread Brad Baxter
Just to throw this out there: you may be interested in Text::Unidecode
(http://search.cpan.org/~sburke/Text-Unidecode-0.04/) if your ultimate
goal is to try to represent a unicode character with its closest ascii
(or perhaps I should say, "romanized") equivalent.

-- Brad

On Wed, May 7, 2008 at 9:51 AM, Doran, Michael D <[EMAIL PROTECTED]> wrote:

> I received a number of helpful suggestions and solutions.  The approach I
> decided to adopt in my larger script is to 'decode' all the incoming form
> input as UTF-8 as well as the input from the database that I'll be matching
> the form input against.  This seems to allow the '\p{M}' syntax to work as
> expected in a Perl regexp.  In my test.cgi script for form input it would
> like like this:
>
> #!/usr/local/bin/perl
> use strict;
> use CGI;
> use Encode;
> my $query = CGI::new();
> my $search_term = decode('UTF-8',$query->param('text'));
> my $sans_diacritics  = $search_term;
> $sans_diacritics =~ s/\pM*//g;
> print qq(Content-type: text/plain; charset=utf-8
>
> search_term is $search_term
> sans_diacritics is $sans_diacritics
> );
> exit(0);
>
> I'm slowly figuring out how to work with Unicode in my web scripts, but
> still have a lot to learn.  Thanks for all the help. :-)
>
> -- Michael
>
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
>
>
> > -Original Message-
> > From: Doran, Michael D [mailto:[EMAIL PROTECTED]
> > Sent: Monday, May 05, 2008 7:27 PM
> > To: [EMAIL PROTECTED]
> > Cc: Perl4lib
> > Subject: Stripping out Unicode combining characters (diacritics)
> >
> > I'm trying to strip out combining diacritics from some form
> > input using this code:
> >
> > 
> >   
> >   
> > 
> > 
> >   
> > 
> > 
> >
> > #!/usr/local/bin/perl
> > use CGI;
> > $query = CGI::new();
> > $search_term = $query->param('text');
> > $sans_diacritics  = $search_term;
> > $sans_diacritics  =~ s/\p{M}*//g;
> > #$sans_diacritics  =~ s/o//g;
> > print qq(Content-type: text/plain; charset=utf-8
> >
> > $sans_diacritics
> > );
> > exit(0);
> >
> >
> > In the form, I'm inputting the string "Bartók" with the
> > accented character being a base character (small Latin letter
> > "o") followed by a combining acute accent.  However, when I
> > print (to the web) $sans_diacritics, I get my input with no
> > change -- the combining diacritic is still there.  I know
> > that my input is not a precomposed accented character,
> > because I can strip out the base "o" and the combining accent
> > either stands alone or jumps to another character [2].
> >
> > The "\p{M}" is a Unicode class name for the character class
> > of Unicode 'marks', for example accent marks [1].  I've tried
> > these variations (and many others) and none seem to be doing
> > what I want:
> >
> >$sans_diacritics =~ s#[\p{Mark}]*##g;
> >$sans_diacritics =~ tr#[\p{InCombiningDiacriticalMarks}]##;
> >$sans_diacritics =~ tr#[\p{M}]##;
> >$sans_diacritics =~ s/\p{M}*//g;
> >$sans_diacritics =~ s#[\p{M}]##g;
> >$sans_diacritics =~ s#\x{0301}##g;
> >$sans_diacritics =~ s#\x{006F}\x{0301}##g;
> >$sans_diacritics =~ s#[\x{0300}-\x{036F}]*##g;
> >
> > I'm pulling my hair out on this... so any help would be
> > appreciated.  If there's any other info I can provide, let me know.
> >
> > My Perl version is 5.8.8 and the script is running on a
> > server running Solaris 9.
> >
> > -- Michael
> >
> > [1] per http://perldoc.perl.org/perlretut.html and other documentation
> >
> > [2] using $sans_diacritics  =~ s/o//g;
> >
> > # Michael Doran, Systems Librarian
> > # University of Texas at Arlington
> > # 817-272-5326 office
> > # 817-688-1926 mobile
> > # [EMAIL PROTECTED]
> > # http://rocky.uta.edu/doran/
> >
>


RE: Stripping out Unicode combining characters (diacritics) -

2008-05-07 Thread Doran, Michael D
I received a number of helpful suggestions and solutions.  The approach I 
decided to adopt in my larger script is to 'decode' all the incoming form input 
as UTF-8 as well as the input from the database that I'll be matching the form 
input against.  This seems to allow the '\p{M}' syntax to work as expected in a 
Perl regexp.  In my test.cgi script for form input it would like like this:

#!/usr/local/bin/perl
use strict;
use CGI;
use Encode;
my $query = CGI::new();
my $search_term = decode('UTF-8',$query->param('text'));
my $sans_diacritics  = $search_term;
$sans_diacritics =~ s/\pM*//g;
print qq(Content-type: text/plain; charset=utf-8

search_term is $search_term
sans_diacritics is $sans_diacritics
);
exit(0);

I'm slowly figuring out how to work with Unicode in my web scripts, but still 
have a lot to learn.  Thanks for all the help. :-)

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Doran, Michael D [mailto:[EMAIL PROTECTED] 
> Sent: Monday, May 05, 2008 7:27 PM
> To: [EMAIL PROTECTED]
> Cc: Perl4lib
> Subject: Stripping out Unicode combining characters (diacritics)
> 
> I'm trying to strip out combining diacritics from some form 
> input using this code:
> 
> 
>   
>   
> 
> 
>   
> 
> 
> 
> #!/usr/local/bin/perl
> use CGI;
> $query = CGI::new();
> $search_term = $query->param('text');
> $sans_diacritics  = $search_term;
> $sans_diacritics  =~ s/\p{M}*//g;
> #$sans_diacritics  =~ s/o//g;
> print qq(Content-type: text/plain; charset=utf-8
> 
> $sans_diacritics
> );
> exit(0);
> 
> 
> In the form, I'm inputting the string "Bartók" with the 
> accented character being a base character (small Latin letter 
> "o") followed by a combining acute accent.  However, when I 
> print (to the web) $sans_diacritics, I get my input with no 
> change -- the combining diacritic is still there.  I know 
> that my input is not a precomposed accented character, 
> because I can strip out the base "o" and the combining accent 
> either stands alone or jumps to another character [2].
> 
> The "\p{M}" is a Unicode class name for the character class 
> of Unicode 'marks', for example accent marks [1].  I've tried 
> these variations (and many others) and none seem to be doing 
> what I want:
> 
>$sans_diacritics =~ s#[\p{Mark}]*##g;
>$sans_diacritics =~ tr#[\p{InCombiningDiacriticalMarks}]##;
>$sans_diacritics =~ tr#[\p{M}]##;
>$sans_diacritics =~ s/\p{M}*//g;
>$sans_diacritics =~ s#[\p{M}]##g;
>$sans_diacritics =~ s#\x{0301}##g;
>$sans_diacritics =~ s#\x{006F}\x{0301}##g;
>$sans_diacritics =~ s#[\x{0300}-\x{036F}]*##g;
> 
> I'm pulling my hair out on this... so any help would be 
> appreciated.  If there's any other info I can provide, let me know.
> 
> My Perl version is 5.8.8 and the script is running on a 
> server running Solaris 9.
> 
> -- Michael
> 
> [1] per http://perldoc.perl.org/perlretut.html and other documentation
> 
> [2] using $sans_diacritics  =~ s/o//g;
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
> 


Re: Stripping out Unicode combining characters (diacritics)

2008-05-07 Thread David Kaufman
Hi Michael,

"Doran, Michael D" <[EMAIL PROTECTED]> wrote:

> I'm trying to strip out combining diacritics from some form input using 
> this code:
> [...]
> $sans_diacritics  =~ s/\p{M}*//g;

I do it like this:

use Encode;
use Unicode::Normalize qw(normalize);

my $ascii = encode('ascii', normalize('KD', $utf8), sub { $_[0]='' });





RE: Stripping out Unicode combining characters (diacritics)

2008-05-06 Thread Doran, Michael D
Hi Leif,

> This is what I do. You can try that.
> See if it helps:
> 
> Encode::_utf8_on($str);  # <<<
> $str =~ s/\pM*//g;

That works!  I will gladly buy the beers Leif, should we ever meet in person.

> I mean - have you for instance tried running your cgi scripts 
> in tainted mode (-T)?

No, I do not run my CGI scripts in tainted mode (although I realize that I 
probably should).  

Thanks (once again) for your help.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
 

> -Original Message-
> From: Leif Andersson [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, May 06, 2008 3:33 AM
> To: Doran, Michael D
> Subject: Re: Stripping out Unicode combining characters (diacritics)
> 
> Oh, now I see your REAL question.
> 
> This is what I do. You can try that.
> See if it helps:
> 
> Encode::_utf8_on($str);  # <<<
> $str =~ s/\pM*//g;
> 
> You are not the only one having problems with Unicode.
> Esp. in web programming it can be very confusing.
> 
> I am quite surprised that there are not more discussions of this kind.
> Not even in the "official" channels.
> 
> I mean - have you for instance tried running your cgi scripts 
> in tainted mode (-T)?
> 
> I had all my scripts set up that way. Before Unicode.
> But basic Unicode stuff became broken with -T enabled.
> Have they fixed that now?
> I have at least seen no mentioning of it.
> 
> And screen scraping. If you want to mess around with 
> javascript embedded in an HTML page, you may find that the 
> content encoding is mixed. And Perl gets very confused 
> getting mixed character encodings.
> And so do I.
> 
> You may also have to deal with mixed encodings doing SQL 
> against the Voyager database.
> 
> What would we do if we could not fall back on "use bytes"
> every now and then! ;-)
> 
> Leif
> 
> ==
> Leif Andersson, Systems Librarian
> Stockholm University Library
> SE-106 91 Stockholm
> SWEDEN
> Phone : +46 8 162769
> Mobile: +46 70 6904281
> 
> 
> -Ursprungligt meddelande-
> Från: Doran, Michael D [mailto:[EMAIL PROTECTED]
> Skickat: den 6 maj 2008 04:13
> Till: Mike Rylander
> Kopia: [EMAIL PROTECTED]; Perl4lib
> Ämne: RE: Stripping out Unicode combining characters (diacritics)
> 
> Hi Mike,
> 
> I appreciate the quick reply.  I am familiar with the 
> Unicode::Normalize module (and will also be using that), but 
> I left it out of this question because it's not relevant to 
> the problem I'm currently trying to solve.  The text I'm 
> trying to strip diacritics out of does not have precomposed 
> accented characters.
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 cell
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
> 
> 
> 
> -Original Message-
> From: Mike Rylander [mailto:[EMAIL PROTECTED]
> Sent: Mon 5/5/2008 8:52 PM
> To: Doran, Michael D
> Cc: [EMAIL PROTECTED]; Perl4lib
> Subject: Re: Stripping out Unicode combining characters (diacritics)
>  
> On Mon, May 5, 2008 at 8:26 PM, Doran, Michael D 
> <[EMAIL PROTECTED]> wrote:
> [snip]
> >
> >  I'm pulling my hair out on this... so any help would be 
> appreciated.  If there's any other info I can provide, let me know.
> >
> 
> You'll want to transform the text to NFD format (nominally, 
> base characters plus combining marks) instead of NFC (precombined
> characters) using Unicode::Normalize:
> 
>  use Unicode::Normalize;
> 
>  my $text = NFD($original);
>  $text =~ s/\pM+//go;
> 
> Hope that helps.
> 
> --
> Mike Rylander
>  | VP, Research and Design
>  | Equinox Software, Inc. / The Evergreen Experts  | phone: 
> 1-877-OPEN-ILS (673-6457)  | email: [EMAIL PROTECTED]  | 
> web: http://www.esilibrary.com
> 
> 


Re: Stripping out Unicode combining characters (diacritics)

2008-05-06 Thread Leif Andersson
I've been doing it like Mike R suggested for quite some while.
But some characters do not map nicely into this scheme.

So you may want to manually take care of stuff like german eszet, ligature oe 
etc, etc.

s/\x{00df}/ss/g;
s/\x{0152}/Oe/g;
s/\x{0153}/oe/g;
...to be continued...

Leif
==
Leif Andersson, Systems Librarian
Stockholm University Library
SE-106 91 Stockholm
SWEDEN
Phone : +46 8 162769
Mobile: +46 70 6904281

-Ursprungligt meddelande-
Från: Doran, Michael D [mailto:[EMAIL PROTECTED] 
Skickat: den 6 maj 2008 04:13
Till: Mike Rylander
Kopia: [EMAIL PROTECTED]; Perl4lib
Ämne: RE: Stripping out Unicode combining characters (diacritics)

Hi Mike,

I appreciate the quick reply.  I am familiar with the Unicode::Normalize module 
(and will also be using that), but I left it out of this question because it's 
not relevant to the problem I'm currently trying to solve.  The text I'm trying 
to strip diacritics out of does not have precomposed accented characters.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/



-Original Message-
From: Mike Rylander [mailto:[EMAIL PROTECTED]
Sent: Mon 5/5/2008 8:52 PM
To: Doran, Michael D
Cc: [EMAIL PROTECTED]; Perl4lib
Subject: Re: Stripping out Unicode combining characters (diacritics)
 
On Mon, May 5, 2008 at 8:26 PM, Doran, Michael D <[EMAIL PROTECTED]> wrote:
[snip]
>
>  I'm pulling my hair out on this... so any help would be appreciated.  If 
> there's any other info I can provide, let me know.
>

You'll want to transform the text to NFD format (nominally, base
characters plus combining marks) instead of NFC (precombined
characters) using Unicode::Normalize:

 use Unicode::Normalize;

 my $text = NFD($original);
 $text =~ s/\pM+//go;

Hope that helps.

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone: 1-877-OPEN-ILS (673-6457)
 | email: [EMAIL PROTECTED]
 | web: http://www.esilibrary.com



RE: Stripping out Unicode combining characters (diacritics)

2008-05-05 Thread Doran, Michael D
Hi Mike,

I appreciate the quick reply.  I am familiar with the Unicode::Normalize module 
(and will also be using that), but I left it out of this question because it's 
not relevant to the problem I'm currently trying to solve.  The text I'm trying 
to strip diacritics out of does not have precomposed accented characters.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/



-Original Message-
From: Mike Rylander [mailto:[EMAIL PROTECTED]
Sent: Mon 5/5/2008 8:52 PM
To: Doran, Michael D
Cc: [EMAIL PROTECTED]; Perl4lib
Subject: Re: Stripping out Unicode combining characters (diacritics)
 
On Mon, May 5, 2008 at 8:26 PM, Doran, Michael D <[EMAIL PROTECTED]> wrote:
[snip]
>
>  I'm pulling my hair out on this... so any help would be appreciated.  If 
> there's any other info I can provide, let me know.
>

You'll want to transform the text to NFD format (nominally, base
characters plus combining marks) instead of NFC (precombined
characters) using Unicode::Normalize:

 use Unicode::Normalize;

 my $text = NFD($original);
 $text =~ s/\pM+//go;

Hope that helps.

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone: 1-877-OPEN-ILS (673-6457)
 | email: [EMAIL PROTECTED]
 | web: http://www.esilibrary.com



Re: Stripping out Unicode combining characters (diacritics)

2008-05-05 Thread Mike Rylander
On Mon, May 5, 2008 at 8:26 PM, Doran, Michael D <[EMAIL PROTECTED]> wrote:
[snip]
>
>  I'm pulling my hair out on this... so any help would be appreciated.  If 
> there's any other info I can provide, let me know.
>

You'll want to transform the text to NFD format (nominally, base
characters plus combining marks) instead of NFC (precombined
characters) using Unicode::Normalize:

 use Unicode::Normalize;

 my $text = NFD($original);
 $text =~ s/\pM+//go;

Hope that helps.

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone: 1-877-OPEN-ILS (673-6457)
 | email: [EMAIL PROTECTED]
 | web: http://www.esilibrary.com