Well, I'm not certain if it counts as a typical Chinese website, but here are the stats;
UTF8: 64,198 UTF16: 113,160 And just for fun, after gziping: UTF8: 17,708 UTF16: 19,367 On Wed, Aug 18, 2010 at 2:59 AM, anderson leo <fireman...@gmail.com> wrote: > Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the > wikipedia for Chinese. > > -Andrew > > On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman <mich...@snoyman.com>wrote: > >> >> >> On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale <g...@sefer.org> wrote: >> >>> Ketil Malde wrote: >>> > I haven't benchmarked it, but I'm fairly sure that, if you try to fit a >>> > 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of >>> > RAM, UTF-16 will be slower than UTF-8... >>> >>> I don't think the genome is typical text. And >>> I doubt that is true if that text is in a CJK language. >>> >>> > I think that *IF* we are aiming for a single, grand, unified text >>> > library to Rule Them All, it needs to use UTF-8. >>> >>> Given the growth rate of China's economy, if CJK isn't >>> already the majority of text being processed in the world, >>> it will be soon. I have seen media reports claiming CJK is >>> now a majority of text data going over the wire on the web, >>> though I haven't seen anything scientific backing up those claims. >>> It certainly seems reasonable. I believe Google's measurements >>> based on their own web index showing wide adoption of UTF-8 >>> are very badly skewed due to a strong Western bias. >>> >>> In that case, if we have to pick one encoding for Data.Text, >>> UTF-16 is likely to be a better choice than UTF-8, especially >>> if the cost is fairly low even for the special case of Western >>> languages. Also, UTF-16 has become by far the dominant internal >>> text format for most software and for most user platforms. >>> Except on desktop Linux - and whether we like it or not, Linux >>> desktops will remain a tiny minority for the foreseeable future. >>> >>> I think you are conflating two points here, and ignoring some important >> data. Regarding the data: you haven't actually quoted any statistics about >> the prevalence of CJK data, but even if the majority of web pages served are >> in those three languages, a fairly high percentage of the content will >> *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd >> hate to make up statistics on the spot, especially when I don't have any >> numbers from you to compare them with. >> >> As far as the conflation, there are two questions with regard to the >> encoding choice: encoding/decoding time and space usage. I don't think >> *anyone* is asserting that UTF-16 is a common encoding for files anywhere, >> so by using UTF-16 we are simply incurring an overhead in every case. We >> can't consider a CJK encoding for text, so its prevalence is irrelevant to >> this topic. What *is* relevant is that a very large percentage of web pages >> *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by >> default UTF-8. >> >> As far as space usage, you are correct that CJK data will take up more >> memory in UTF-8 than UTF-16. The question still remains whether the overall >> document size will be larger: I'd be interested in taking a random sampling >> of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I >> think simply talking about this in the vacuum of data is pointless. If >> anyone can recommend a CJK website which would be considered representative >> (or a few), I'll do the test myself. >> >> Michael >> >> _______________________________________________ >> Haskell-Cafe mailing list >> Haskell-Cafe@haskell.org >> http://www.haskell.org/mailman/listinfo/haskell-cafe >> >> >
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe