On Fri, Jan 11, 2013 at 11:01:45PM +0100, Christer Palm wrote: > Hi! Hello,
> I have a perl script that parses RSS streams from different > news sources and experience problems with national characters > in a regexp function used for matching a keyword list with the > RSS data. > > Everything works fine with a simple regexp for plain english > i.e. words containing the letters A-Z, a-z, 0-9. > > if ( $description =~ m/\b$key/i ) {….} > > Keywords or RSS data with national characters don’t work at > all. I’m not really surprised this was expected as character > sets used in the different RSS streams are outside my control. The XML standard provides a way to specify the character set in the XML document. <?xml version="1.0" encoding="utf-8"?> ^^^^^^^^^^^^^^^^ Are you parsing the XML unintelligently (e.g., regex) or are you using an XML parser to do it? I have done limited XML parsing in Perl, but I would seek an API that supports the XML standards for encodings and ideally just does the Right Thing(tm). In theory, it should Just Work(tm) if you can find an appropriate family of modules. > I am have the ”use utf8;” function activated but I’m not really > sure if it is needed. I can’t see any difference used or not. As mentioned, the utf8 pragma basically just tells perl that the source file is UTF-8 encoded (and so literal strings should be considered UTF-8 text, for example). The Encode module can be used to manually decode and encode strings between various encodings. E.g., if you know the text is UTF-16LE then you can do this: use Encode; my $input = getRssStream(); my $text = Encode::decode('UTF-16LE', $input); Encodings are also supported at the IO layer, so depending on where you're getting it from you might be able to just inform said layers of the encoding and have the rest automatic. E.g., # Something like this: binmode $socket, ':encoding(UTF-16LE)'; > Do you have suggestions on this character issue? Is it possible > to determine the character set of a text efficiently? Is it > other ways to solve the problem? There are some modules to guess encodings (e.g., File::BOM). Of course, it's impossible to be certain. It's best to use the standards in the transport protocol or data format to define the encoding so that you know for sure what is expected and don't have to guess (because it isn't always possible to detect it correctly). Regards, -- Brandon McCaig <bamcc...@gmail.com> <bamcc...@castopulence.org> Castopulence Software <https://www.castopulence.org/> Blog <http://www.bamccaig.com/> perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }. q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.}; tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'
signature.asc
Description: Digital signature