Hello Philip,
Thank you very much for your answer,
My programs run on both Windows and Linux (Debian), so I expect they
work in the same way...
[email protected] a écrit :
On Thu, 13 Feb 2020, Patrice Guérin wrote:
I'm facing some problems with the locale character table definitions.
Locales are a nightmare. We will all be able to rejoice when Unicode is
everywhere. I'm afraid I know very little about locales, and as I'm a
Linux user, I know nothing about Windows versions except that there are
differences.
I agree. But locales are a lot more than just the 256 bytes character
table, like iso-8859-X or Windows-1252 (aka CP1252 cf.
http://www.tachyonsoft.com/cp01252.htm).
I really don't know who is right or not, and though the following
characters are not widely used (except euro symbol), I think this
can lead to inconsistencies.
I'm sure it can, but I suspect there isn't anything that can be done
about it.
* Windows defines all unassigned characters (in hex) 81, 8D, 8F,
90, 9D as Ctrl, Linux does not.
* Linux defines char 0x80 (€ symbol) as Graph, Print and Punct,
Windows does not.
* Linux defines char 0x88 (U+02c6) as Alpha, Alnum, Graph and
Print, Windows does not.
I do not understand what you mean by "0x88 (U+02c6)" because locales
handle only 256 characters.
Windows-1252 is almost the same than iso-8859-1 except for the first 32
bytes in the extended table (that is 0x80 to 0x9f).
As an exemple, the euro symbol € is 0x80 that correspond to U+20AC in
Unicode.
In iso-8859-15, the euro symbol is 0xA4 (it replaces the currency symbol).
* Linux defines char 0x98 (U+02dc) and 0x99 (U+02122) as Graph,
Print and Punct, Windows does not.
* Linux defines char 0xA0 (nbsp) as Graph, Print and Punct,
Windows defines it as Space, Blank and Print.
* Windows defines chars 0xAA (ª), 0xB5(µ) and 0xBA (º) as Punct,
Linux does not.
* Windows defines char 0xAD (Soft hyphen) as Ctrl, Linux does not.
* Windows defines chars 0xB2 (²), 0xB3 (³) and 0xB9 (¹) as Alnum
et Digit, Linux does not.
Now, I've some questions :
1. If I correctly understood the process of the chartables build at
runtime,
1. the ctype functions are used only in pcre2_maketables() so the
locale can be set just before this call at thread level.
Yes.
2. the char table returned should be freed after the calls to
pcre2_match()
Yes.
3. A compilation context is to be created to associate the char table.
Yes.
4. Can it be freed just after the call to pcre2_compile() ?
The compilation context can be freed, but the tables themselves must be
retained until all matches are done.
2. What do you think about the availability of a function to load the
char tables as a binary file ?
This could be useful to get exactly the same tables in different OS.
I suspect that this is a very specialist requirement, and I would rather
encourage people to switch to Unicode. Also, moving binary things
between OS is not that simple because of endian issues. And indeed
8/16/32 bit issues.
The work to switch to utf-8 is in progress.
However, it's not very efficient to translate MiB of external data to
utf-8 each time I need to use regexp. I can't modify the sources
permanently.
At my opinion, pcre2_maketables() is independant of 8/16/32 bits since
it's defined as uint8_t (ie bytes).
For the same reason, I think there is no endianness issue in the
computation of the table.
Saving and loading in binary should be ok.
If I'm wrong (I will return to school...), the same restrictions than
the serialization of RE can be applied.
Philip
Kind regards,
Patrice.
--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev