Re: [Python-ideas] Windows Best Fit Encodings

2018-01-21 Thread M.-A. Lemburg
On 20.01.2018 08:01, Steve Dower wrote:
> On 20Jan2018 0518, M.-A. Lemburg wrote:
>> do you know of a definite resource for Windows code pages
>> on MSDN or another official MS website ?
> 
> I don't know of anything sorry, and my quick search didn't turn up
> anything public. But I can at least confirm that the internal table for
> cp1252 has the same undefined characters as on unicode.org, so
> presumably if MultiByteToWideChar is mapping those to "best fit"
> characters it's only because the flag has been passed. As far as I can
> tell, Microsoft has not been secretly redefining any encodings.

Thanks for confirming, Steve.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 21 2018)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Windows Best Fit Encodings

2018-01-20 Thread Random832
On Sat, Jan 20, 2018, at 02:01, Steve Dower wrote:
> On 20Jan2018 0518, M.-A. Lemburg wrote:
> > do you know of a definite resource for Windows code pages
> > on MSDN or another official MS website ?

I don't know what happened to this page, but I was able to find better-looking 
codepage tables at
http://web.archive.org/web/20160314211032/https://msdn.microsoft.com/en-us/goglobal/bb964654

Older versions at:
web.archive.org/web/*/http://www.microsoft.com:80/globaldev/reference/WinCP.asp
web.archive.org/web/*/http://www.microsoft.com:80/globaldev/reference/WinCP.mspx

See also, still live:
https://www.microsoft.com/typography/unicode/cscp.htm
(this has 0xCA in the graphical table for cp1255, the other does not)

> 
> I don't know of anything sorry, and my quick search didn't turn up 
> anything public. But I can at least confirm that the internal table for 
> cp1252 has the same undefined characters as on unicode.org
>, so 
> presumably if MultiByteToWideChar is mapping those to "best fit" 
> characters it's only because the flag has been passed.

I'm passing MB_ERR_INVALID_CHARS. And is this just as true for cp1255 0xCA as 
for the control characters? MultiByteToWideChar doesn't even *have* a flag for 
"best fit".

I was not able to identify any combination of flags that can be passed to 
either function on Windows 7 that would cause e.g. 0x81 in cp1252 to be treated 
any differently from any other character.

The C_1252.NLS file appears to consist of:

28 bytes of header
512 bytes WCHAR[256] of mappings e.g.
010c: 7800 7900 7a00 7b00 7c00 7d00 7e00 7f00  x.y.z.{.|.}.~...
011c: ac20 8100 1a20 9201 1e20 2620 2020 2120  . ... ... &   !
012c: c602 3020 6001 3920 5201 8d00 7d01 8f00  ..0 `.9 R...}...
013c: 9000 1820 1920 1c20 1d20 2220 1320 1420  ... . . . " . .
014c: dc02 2221 6101 3a20 5301 9d00 7e01 7801  .."!a.: S...~.x.
015c: a000 a100 a200 a300 a400 a500 a600 a700  
Six zero bytes
BYTE[65536] apparently of the best fit mappings, e.g.
02a2: 3f81 3f3f 3f3f 3f3f 3f3f 3f3f 3f8d 3f8f  ?.???.?.
02b2: 903f 3f3f 3f3f 3f3f 3f3f 3f3f 3f9d 3f3f  ..??
0312: f0f1 f2f3 f4f5 f6f7 f8f9 fafb fcfd feff  
0322: 4161 4161 4161 4363 4363 4363 4363 4464  AaAaAaCcCcCcCcDd

I don't see where the file format even has room to identify characters as 
invalid (or how WideCharToMultiByte disables the best fit mappings, unless it's 
by checking the result against the WCHAR[256] table), though CP1253 and CP1255 
seem to manage it. The ones in those codepages that do return an error are 
mapped (if the flag is not passed in, and in the NLS file tables) to private 
use characters U+F8xx.

> As far as I can 
> tell, Microsoft has not been secretly redefining any encodings.

Not so much redefining as holding back these characters from the published 
definition. I was being a bit overly dramatic with the 'for some unknown 
reason' bit, it seems obvious the reason is they wanted to reserve the ability 
to add new characters in the future, as they did for the Euro sign. And there's 
nothing wrong with that, per se, though it's unfortunate that their own 
conversion functions can't treat these bytes as errors.

Looking at the actual files, it looks like the ones in the "best fit" directory 
are in a format used internally by Microsoft (at a glance, they seem to contain 
enough information to generate the .NLS files, including stuff like the 
question marks in the header and the structure of DBCS tables), and the ones in 
the other mappings directory are sanitized and converted to more or less the 
same format as the other mappings.

(As for 1255 0xCA, the comment in the best fit file suggests that it was 
unclear what hebrew vowel point it was meant to be)
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Windows Best Fit Encodings

2018-01-19 Thread Steve Dower

On 20Jan2018 0518, M.-A. Lemburg wrote:

do you know of a definite resource for Windows code pages
on MSDN or another official MS website ?


I don't know of anything sorry, and my quick search didn't turn up 
anything public. But I can at least confirm that the internal table for 
cp1252 has the same undefined characters as on unicode.org, so 
presumably if MultiByteToWideChar is mapping those to "best fit" 
characters it's only because the flag has been passed. As far as I can 
tell, Microsoft has not been secretly redefining any encodings.


Cheers,
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Windows Best Fit Encodings

2018-01-19 Thread M.-A. Lemburg
Hi Steve,

do you know of a definite resource for Windows code pages
on MSDN or another official MS website ?

I tried to find some links, but only got these ancient
ones:

https://msdn.microsoft.com/en-us/library/cc195054.aspx

(this version of cp1252 doesn't even have the euro sign yet)

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 19 2018)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/



On 19.01.2018 18:17, M.-A. Lemburg wrote:
> On 19.01.2018 17:24, Random832 wrote:
>> On Fri, Jan 19, 2018, at 08:30, M.-A. Lemburg wrote:
 Someone did discover that Microsoft's current implementations of the
 windows-* encodings matches the WHAT-WG spec, rather than the Unicode
 spec that Microsoft originally wrote.
>>>
>>> No, MS implements somethings called "best fit encodings"
>>> and these are different than what WHATWG uses.
>>
>> NO. I made this absolutely clear in my previous message, best fit mappings 
>> can be clearly distinguished from regular mappings by the behavior of the 
>> native conversion functions with certain argument flags (the mapping of 0xA0 
>> to some private use character in cp932, for example, is a best-fit mapping 
>> in the decoding direction - but is treated as a regular mapping for encoding 
>> purposes), and the mapping of 0x81 to U+0081 in cp1252 etc is NOT a best fit 
>> mapping or in any way different from the rest of the mappings.
>>
>> We are not talking about implementing the best fit mappings. We are talking 
>> about real regular mappings that actually exist in these codepages that were 
>> for some unknown reason not included in the files published by Unicode.
> 
> I only know the best fit encoding maps that are available
> on the Unicode site.
> 
> If I read your comment correctly, you are saying that MS has
> moved away from the standard code pages towards something
> else - perhaps even something other than the best fit encodings
> listed on the Unicode site ?
> 
> Do you have some references for this ?
> 
> Note that the Windows code page codecs implemented in Python
> are all based on the Unicode mapping files and those were
> created by MS.
> 
>>> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx
>>>
>>> unfortunately uses the above mentioned best fit encodings,
>>> but this can and should be switched off by specifying the
>>> WC_NO_BEST_FIT_CHARS for anything that requires validation
>>> or needs to be interoperable:
>>
>> Specifying this flag (and MB_ERR_INVALID_CHARS in the other direction) in 
>> fact does not disable the mappings we are discussing.
> 
> Interesting. The CP1252 mapping clearly defines 0x80 to map
> to undefined, whereas the bestfit1252 maps it to 0x0081:
> 
> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
> http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
> 
> Same for the example you gave for CP932:
> 
> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
> http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt
> 
> So at least following the documentation you'd expect the function
> to implement the regular mappings.
> 

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Windows Best Fit Encodings (was: Support WHATWG versions of legacy encodings)

2018-01-19 Thread M.-A. Lemburg
On 19.01.2018 17:24, Random832 wrote:
> On Fri, Jan 19, 2018, at 08:30, M.-A. Lemburg wrote:
>>> Someone did discover that Microsoft's current implementations of the
>>> windows-* encodings matches the WHAT-WG spec, rather than the Unicode
>>> spec that Microsoft originally wrote.
>>
>> No, MS implements somethings called "best fit encodings"
>> and these are different than what WHATWG uses.
> 
> NO. I made this absolutely clear in my previous message, best fit mappings 
> can be clearly distinguished from regular mappings by the behavior of the 
> native conversion functions with certain argument flags (the mapping of 0xA0 
> to some private use character in cp932, for example, is a best-fit mapping in 
> the decoding direction - but is treated as a regular mapping for encoding 
> purposes), and the mapping of 0x81 to U+0081 in cp1252 etc is NOT a best fit 
> mapping or in any way different from the rest of the mappings.
> 
> We are not talking about implementing the best fit mappings. We are talking 
> about real regular mappings that actually exist in these codepages that were 
> for some unknown reason not included in the files published by Unicode.

I only know the best fit encoding maps that are available
on the Unicode site.

If I read your comment correctly, you are saying that MS has
moved away from the standard code pages towards something
else - perhaps even something other than the best fit encodings
listed on the Unicode site ?

Do you have some references for this ?

Note that the Windows code page codecs implemented in Python
are all based on the Unicode mapping files and those were
created by MS.

>> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx
>>
>> unfortunately uses the above mentioned best fit encodings,
>> but this can and should be switched off by specifying the
>> WC_NO_BEST_FIT_CHARS for anything that requires validation
>> or needs to be interoperable:
> 
> Specifying this flag (and MB_ERR_INVALID_CHARS in the other direction) in 
> fact does not disable the mappings we are discussing.

Interesting. The CP1252 mapping clearly defines 0x80 to map
to undefined, whereas the bestfit1252 maps it to 0x0081:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

Same for the example you gave for CP932:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt

So at least following the documentation you'd expect the function
to implement the regular mappings.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 19 2018)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/