Steve Hay wrote on 2009-11-27: > Nicholas Clark wrote on 2009-11-27: >> On Fri, Nov 27, 2009 at 10:20:12AM -0000, Steve Hay wrote: >>> I think setlocale() works okay: >>> >>> C:\gitclients\perl\t>..\perl -MPOSIX -e "print strftime('%#x', >> localtime), qq[\n]; setlocale(LC_TIME, 'C'); print strftime('%#x', >> localtime)" >>> 2009?$BG/11?$B7n27?$BF| >>> Friday, November 27, 2009 >>> >> >> Strange. Then I don't understand. The test appears to be failing with >> the %d >> failing to be interpolated, but the %Y and %m being handled just fine: >> >> is(strftime("%Y\x{5e74}%m\x{6708}%d\x{65e5}", gmtime($jan_16)), >> "1970\x{5e74}01\x{6708}16\x{65e5}", >> "strftime() can handle unicode chars in the format string"); >> >> This makes no sense to me. >> > > The 'fmt' argument arriving in POSIX::strftime() (and hence also the > 'fmt' in Perl_my_strftime()) doesn't look right in the debugger: > > %Y<Japanese character>I%m<Japanese character>*d<Japanese character>* > > Not sure where the I or * characters come from
Ah. I think that's the debugger decoding the 'fmt' as cp932 (it is running in Japanese, after all, so it's bound to assume that, rather than understanding that it's actually perl's utf8 format). Sure enough, the first Japanese character in the 'fmt', \x{5e74}, has this encoding in UTF-8: \xe5\xb9\xb4 but if you decode those three octets as cp932 instead then you see the two(!) characters \x{87f7}\x{ff74} instead! The latter is HALFWIDTH KATAKANA LETTER E, which actually looks exactly like an I... I think the % in the %d getting swallowed up is something to do with the fact that the UTF-8 encoding (\xe6\x9c\x88) of the character before it (\x{6708}) only gives one character (\x{8b5b}) as cp932, but if you include the % character as a fourth octet to be (mis)interpreted as cp932 then you get three characters (\x{8b5b}\x{fffd}%). I suspect that when these octets get passed in to the CRT's strftime(), it also sees them as the cp932 encoding of two characters instead of the UTF-8 encoding of one character, and that's somehow screwing things up. Basically, it doesn't appear to be safe to pass UTF-8 data (or anything else other than data in the native encoding) into strftime(), at least not with my system's strftime(). I think if 'fmt' were transcoded into cp932 before being passed into the CRT's strftime() (and the resulting 'buf' transcoded back to UTF-8 afterwards) then it would work, but, of course, the Unicode characters in 'fmt' won't always be representable in the native encoding, which was probably half the point of the recent Unicode change for strftime() in the first place. Not sure where that leaves us. Do other people's strftime()s mess around with the 'fmt' like this, or is it just a MS CRT problem?