(Actually, this seems more like a job for a type class.) 2010/8/17 Gábor Lehel <illiss...@gmail.com>: > Someone mentioned earlier that IHHO all of this messing around with > encodings and conversions should be handled transparently, and I guess > you could do something like have the internal representation be along > the lines of Either UTF8 UTF16 (or perhaps even more encodings), and > then implement every function in the API equivalently for each > representation (with only the performance characteristics differing), > with input/output functions being specialized for each encoding, and > then only do a conversion when necessary or explicitly requested. But > I assume that would have other problems (like the implicit conversions > causing hard-to-track-down performance bugs when they're triggered > unintentionally). > > On Tue, Aug 17, 2010 at 3:21 PM, Daniel Peebles <pumpkin...@gmail.com> wrote: >> Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and >> UTF-16 "segments" in it list of strict text elements :) Then big chunks of >> western text will be encoded efficiently, and same with CJK! Not sure what >> to do about strict Data.Text though :) >> >> On Tue, Aug 17, 2010 at 1:40 PM, Ketil Malde <ke...@malde.org> wrote: >>> >>> Michael Snoyman <mich...@snoyman.com> writes: >>> >>> > As far as space usage, you are correct that CJK data will take up more >>> > memory in UTF-8 than UTF-16. >>> >>> With the danger of sounding ... alphabetist? as well as belaboring a >>> point I agree is irrelevant (the storage format): >>> >>> I'd point out that it seems at least as unfair to optimize for CJK at >>> the cost of Western languages. UTF-16 uses two bytes for (most) CJK >>> ideograms, and (all, I think) characters in Western and other phonetic >>> scripts. UTF-8 uses one to two bytes for a lot of Western alphabets, >>> but three for CJK ideograms. >>> >>> Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram, >>> while an ASCII letter is about six bits. Thus, the information density >>> of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to >>> 15/16 vs 6/16 for UTF-16. In other words a given document translated >>> between Chinese and English should occupy roughly the same space in >>> UTF-8, but be 2.5 times longer in English for UTF-16. >>> >>> -k >>> -- >>> If I haven't seen further, it is by standing in the footprints of giants >>> _______________________________________________ >>> Haskell-Cafe mailing list >>> Haskell-Cafe@haskell.org >>> http://www.haskell.org/mailman/listinfo/haskell-cafe >> >> >> _______________________________________________ >> Haskell-Cafe mailing list >> Haskell-Cafe@haskell.org >> http://www.haskell.org/mailman/listinfo/haskell-cafe >> >> > > > > -- > Work is punishment for failing to procrastinate effectively. >
-- Work is punishment for failing to procrastinate effectively. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe