Re: character setts in a regexp

Brandon McCaig Fri, 11 Jan 2013 17:14:12 -0800

On Fri, Jan 11, 2013 at 11:01:45PM +0100, Christer Palm wrote:
> Hi!

Hello,


> I have a perl script that parses RSS streams from different
> news sources and experience problems with national characters
> in a regexp function used for matching a keyword list with the
> RSS data. 
> 
> Everything works fine with a simple regexp for plain english
> i.e. words containing the letters A-Z, a-z, 0-9.    
> 
> if ( $description =~ m/\b$key/i ) {….}
> 
> Keywords or RSS data with national characters don’t work at
> all. I’m not really surprised this was expected as character
> sets used in the different RSS streams are outside my control.

The XML standard provides a way to specify the character set in
the XML document.

<?xml version="1.0" encoding="utf-8"?>
                    ^^^^^^^^^^^^^^^^

Are you parsing the XML unintelligently (e.g., regex) or are you
using an XML parser to do it? I have done limited XML parsing in
Perl, but I would seek an API that supports the XML standards for
encodings and ideally just does the Right Thing(tm). In theory,
it should Just Work(tm) if you can find an appropriate family of
modules.

> I am have the ”use utf8;” function activated but I’m not really
> sure if it is needed. I can’t see any difference used or not. 

As mentioned, the utf8 pragma basically just tells perl that the
source file is UTF-8 encoded (and so literal strings should be
considered UTF-8 text, for example). The Encode module can be
used to manually decode and encode strings between various
encodings. E.g., if you know the text is UTF-16LE then you can do
this:

  use Encode;

  my $input = getRssStream();

  my $text = Encode::decode('UTF-16LE', $input);

Encodings are also supported at the IO layer, so depending on
where you're getting it from you might be able to just inform
said layers of the encoding and have the rest automatic. E.g.,

  # Something like this:
  binmode $socket, ':encoding(UTF-16LE)';

> Do you have suggestions on this character issue? Is it possible
> to determine the character set of a text efficiently? Is it
> other ways to solve the problem?

There are some modules to guess encodings (e.g., File::BOM). Of
course, it's impossible to be certain. It's best to use the
standards in the transport protocol or data format to define the
encoding so that you know for sure what is expected and don't
have to guess (because it isn't always possible to detect it
correctly).

Regards,


-- 
Brandon McCaig <bamcc...@gmail.com> <bamcc...@castopulence.org>
Castopulence Software <https://www.castopulence.org/>
Blog <http://www.bamccaig.com/>
perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }.
q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.};
tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'

signature.asc
Description: Digital signature

Re: character setts in a regexp

Reply via email to