I received a number of helpful suggestions and solutions. The approach I decided to adopt in my larger script is to 'decode' all the incoming form input as UTF-8 as well as the input from the database that I'll be matching the form input against. This seems to allow the '\p{M}' syntax to work as expected in a Perl regexp. In my test.cgi script for form input it would like like this:
#!/usr/local/bin/perl use strict; use CGI; use Encode; my $query = CGI::new(); my $search_term = decode('UTF-8',$query->param('text')); my $sans_diacritics = $search_term; $sans_diacritics =~ s/\pM*//g; print qq(Content-type: text/plain; charset=utf-8 search_term is $search_term sans_diacritics is $sans_diacritics ); exit(0); I'm slowly figuring out how to work with Unicode in my web scripts, but still have a lot to learn. Thanks for all the help. :-) -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ > -----Original Message----- > From: Doran, Michael D [mailto:[EMAIL PROTECTED] > Sent: Monday, May 05, 2008 7:27 PM > To: [EMAIL PROTECTED] > Cc: Perl4lib > Subject: Stripping out Unicode combining characters (diacritics) > > I'm trying to strip out combining diacritics from some form > input using this code: > > <head> > <META http-equiv="Content-Type" content="text/html; > charset=UTF-8"> </head> <body> > <form action="test.cgi" accept-charset="UTF-8" method="get"> > <input type="text" name="text" value="" size="10"> > <input type="submit" value="submit"> > </form> > </body> > </html> > > #!/usr/local/bin/perl > use CGI; > $query = CGI::new(); > $search_term = $query->param('text'); > $sans_diacritics = $search_term; > $sans_diacritics =~ s/\p{M}*//g; > #$sans_diacritics =~ s/o//g; > print qq(Content-type: text/plain; charset=utf-8 > > $sans_diacritics > ); > exit(0); > > > In the form, I'm inputting the string "BartoĢk" with the > accented character being a base character (small Latin letter > "o") followed by a combining acute accent. However, when I > print (to the web) $sans_diacritics, I get my input with no > change -- the combining diacritic is still there. I know > that my input is not a precomposed accented character, > because I can strip out the base "o" and the combining accent > either stands alone or jumps to another character [2]. > > The "\p{M}" is a Unicode class name for the character class > of Unicode 'marks', for example accent marks [1]. I've tried > these variations (and many others) and none seem to be doing > what I want: > > $sans_diacritics =~ s#[\p{Mark}]*##g; > $sans_diacritics =~ tr#[\p{InCombiningDiacriticalMarks}]##; > $sans_diacritics =~ tr#[\p{M}]##; > $sans_diacritics =~ s/\p{M}*//g; > $sans_diacritics =~ s#[\p{M}]##g; > $sans_diacritics =~ s#\x{0301}##g; > $sans_diacritics =~ s#\x{006F}\x{0301}##g; > $sans_diacritics =~ s#[\x{0300}-\x{036F}]*##g; > > I'm pulling my hair out on this... so any help would be > appreciated. If there's any other info I can provide, let me know. > > My Perl version is 5.8.8 and the script is running on a > server running Solaris 9. > > -- Michael > > [1] per http://perldoc.perl.org/perlretut.html and other documentation > > [2] using $sans_diacritics =~ s/o//g; > > # Michael Doran, Systems Librarian > # University of Texas at Arlington > # 817-272-5326 office > # 817-688-1926 mobile > # [EMAIL PROTECTED] > # http://rocky.uta.edu/doran/ >