Re: [pcre-dev] Problems and questions about locales

Patrice Guérin Fri, 14 Feb 2020 11:12:50 -0800

Hello Philip,

Thank you very much for your answer,

My programs run on both Windows and Linux (Debian), so I expect theywork in the same way...


[email protected] a écrit :

On Thu, 13 Feb 2020, Patrice Guérin wrote:

I'm facing some problems with the locale character table definitions.

Locales are a nightmare. We will all be able to rejoice when Unicode is
everywhere. I'm afraid I know very little about locales, and as I'm a
Linux user, I know nothing about Windows versions except that there are
differences.

I agree. But locales are a lot more than just the 256 bytes charactertable, like iso-8859-X or Windows-1252 (aka CP1252 cf.http://www.tachyonsoft.com/cp01252.htm).

    I really don't know who is right or not, and though the following
    characters are not widely used (except euro symbol), I think this
    can lead to inconsistencies.

I'm sure it can, but I suspect there isn't anything that can be done
about it.

      * Windows defines all unassigned characters (in hex) 81, 8D, 8F,
        90, 9D as Ctrl, Linux does not.
      * Linux defines char 0x80 (€ symbol) as Graph, Print and Punct,
        Windows does not.
      * Linux defines char 0x88 (U+02c6) as Alpha, Alnum, Graph and
        Print, Windows does not.

I do not understand what you mean by "0x88 (U+02c6)" because locales
handle only 256 characters.

Windows-1252 is almost the same than iso-8859-1 except for the first 32bytes in the extended table (that is 0x80 to 0x9f).As an exemple, the euro symbol € is 0x80 that correspond to U+20AC inUnicode.

In iso-8859-15, the euro symbol is 0xA4 (it replaces the currency symbol).

      * Linux defines char 0x98 (U+02dc) and 0x99 (U+02122) as Graph,
        Print and Punct, Windows does not.
      * Linux defines char 0xA0 (nbsp) as Graph, Print and Punct,
        Windows defines it as Space, Blank and Print.
      * Windows defines chars 0xAA (ª), 0xB5(µ) and 0xBA (º) as Punct,
        Linux does not.
      * Windows defines char 0xAD (Soft hyphen) as Ctrl, Linux does not.
      * Windows defines chars 0xB2 (²), 0xB3 (³) and 0xB9 (¹) as Alnum
        et Digit, Linux does not.

Now, I've some questions :

1. If I correctly understood the process of the chartables build at
    runtime,
     1. the ctype functions are used only in pcre2_maketables() so the
        locale can be set just before this call at thread level.

Yes.

     2. the char table returned should be freed after the calls to
        pcre2_match()

Yes.

     3. A compilation context is to be created to associate the char table.

Yes.

     4. Can it be freed just after the call to pcre2_compile() ?

The compilation context can be freed, but the tables themselves must be
retained until all matches are done.

2. What do you think about the availability of a function to load the
    char tables as a binary file ?
    This could be useful to get exactly the same tables in different OS.

I suspect that this is a very specialist requirement, and I would rather
encourage people to switch to Unicode. Also, moving binary things
between OS is not that simple because of endian issues. And indeed
8/16/32 bit issues.

The work to switch to utf-8 is in progress.

However, it's not very efficient to translate MiB of external data toutf-8 each time I need to use regexp. I can't modify the sourcespermanently.At my opinion, pcre2_maketables() is independant of 8/16/32 bits sinceit's defined as uint8_t (ie bytes).For the same reason, I think there is no endianness issue in thecomputation of the table.

Saving and loading in binary should be ok.

If I'm wrong (I will return to school...), the same restrictions thanthe serialization of RE can be applied.


Philip

Kind regards,
Patrice.

--

## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

Re: [pcre-dev] Problems and questions about locales

Reply via email to