On Fri, Oct 29, 2010 at 02:09:32PM +0100, Roger Leigh wrote: > On Fri, Oct 29, 2010 at 11:36:59AM +0200, Adam Borowski wrote: > > > > I really wonder why you still need to install "locales" to get UTF-8. Even > > in current glibc, it's a second class citizen. Several years ago, I > > benchmarked a mockup of hard-coding UTF-8 the way ISO-8859-1 and KOI8-R were > > done in the past, and it shaved 20% of the whole > > fork-exec-ld-setlocale-getopt-...-exit sequence almost every program does. > > The character classification tables are needlessly duplicated for every > > locale as well -- try an ISO-8859-1 and look at iswfoo() for chars >0xFF, > > even though there's a separate copy per locale, for all but C and POSIX it's > > identical. > > #522776 has quite a bit of information about basic UTF-8 support without > locales (creation of C.UTF-8).
C.UTF-8 would carry another copy of that big table and provide no performance benefits, but indeed, having a guaranteed UTF-8 locale would be really, really useful. I've read #522776 and it provides compelling reasons to add C.UTF-8 right now, for squeeze -- we can discuss better implementations later. > From the end of the report, there was talk of getting C.UTF-8 into > squeeze, but I'm not sure what the status of that work is at present (it's > a trivial glibc tweak to generate and package the additional locale). Especially that it's already done for an udeb. > Do you still have your patch for hard-coding UTF-8? I did start doing this, > but didn't get as far as having a working locale. It might be a good > starting point if it still works with current glibc. 1. It was several major versions of glibc before. 2. It was merely a mockup, not the proper code. These were stubs like: if (ch < 128) return value_for_C(ch); else return 0; I assumed that having the library bigger by a large table in its data segment should not make a noticeable difference in speed, as it's merely mmapping a bigger chunk without even a single additional syscall. 3. I did not investigate anything but character classification. I suspect uppercasing would work, but I didn't test that. 4. It broke legacy locales. 5. I don't seem to have that anymore, just some test programs for character classes. > I agree the duplication of character tables in glibc is totally insane; a > single copy of each character set is more than plenty, and having both > ASCII and UTF-8 hard-coded into glibc would be a major performance > improvement, though it would require eliminating the duplication on locale > loading. Having the entire UTF-8 table duplicated for each different > locale you use is just mad. At least in that version (unstable in July 2006), all wctype() functions returned the same value for all loadable locales. The two hardcoded ones, C and POSIX, had the data for characters 0..127 only, being the only ones that differ. The only function I found that was actually locale dependent was wcwidth(). For 8 bit classification routines, legacy locales would need to iconv at most 128 characters -- the API can't support multiple byte CJK encodings anyway. That's still a lot faster than opening a file. Meow? -- 1KB // Microsoft corollary to Hanlon's razor: // Never attribute to stupidity what can be // adequately explained by malice. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20101029222356.ga27...@angband.pl