Steve Hay wrote on 2009-11-27:
> Nicholas Clark wrote on 2009-11-27:
>> On Fri, Nov 27, 2009 at 10:20:12AM -0000, Steve Hay wrote:
>>> I think setlocale() works okay:
>>> 
>>> C:\gitclients\perl\t>..\perl -MPOSIX -e "print strftime('%#x',
>> localtime), qq[\n]; setlocale(LC_TIME, 'C'); print strftime('%#x',
>> localtime)"
>>> 2009?$BG/11?$B7n27?$BF|
>>> Friday, November 27, 2009
>>> 
>> 
>> Strange. Then I don't understand. The test appears to be failing with
>> the %d
>> failing to be interpolated, but the %Y and %m being handled just
fine:
>> 
>> is(strftime("%Y\x{5e74}%m\x{6708}%d\x{65e5}", gmtime($jan_16)),
>>    "1970\x{5e74}01\x{6708}16\x{65e5}",
>>    "strftime() can handle unicode chars in the format string");
>> 
>> This makes no sense to me.
>> 
> 
> The 'fmt' argument arriving in POSIX::strftime() (and hence also the
> 'fmt' in Perl_my_strftime()) doesn't look right in the debugger:
> 
> %Y<Japanese character>I%m<Japanese character>*d<Japanese character>*
> 
> Not sure where the I or * characters come from


Ah. I think that's the debugger decoding the 'fmt' as cp932 (it is
running in Japanese, after all, so it's bound to assume that, rather
than understanding that it's actually perl's utf8 format).

Sure enough, the first Japanese character in the 'fmt', \x{5e74}, has
this encoding in UTF-8:

\xe5\xb9\xb4

but if you decode those three octets as cp932 instead then you see the
two(!) characters \x{87f7}\x{ff74} instead! The latter is HALFWIDTH
KATAKANA LETTER E, which actually looks exactly like an I...

I think the % in the %d getting swallowed up is something to do with the
fact that the UTF-8 encoding (\xe6\x9c\x88) of the character before it
(\x{6708}) only gives one character (\x{8b5b}) as cp932, but if you
include the % character as a fourth octet to be (mis)interpreted as
cp932 then you get three characters (\x{8b5b}\x{fffd}%).

I suspect that when these octets get passed in to the CRT's strftime(),
it also sees them as the cp932 encoding of two characters instead of the
UTF-8 encoding of one character, and that's somehow screwing things up.

Basically, it doesn't appear to be safe to pass UTF-8 data (or anything
else other than data in the native encoding) into strftime(), at least
not with my system's strftime(). I think if 'fmt' were transcoded into
cp932 before being passed into the CRT's strftime() (and the resulting
'buf' transcoded back to UTF-8 afterwards) then it would work, but, of
course, the Unicode characters in 'fmt' won't always be representable in
the native encoding, which was probably half the point of the recent
Unicode change for strftime() in the first place.

Not sure where that leaves us. Do other people's strftime()s mess around
with the 'fmt' like this, or is it just a MS CRT problem?

Reply via email to