Re: [Haskell-cafe] Re: String vs ByteString
John Millikin jmilli...@gmail.com writes: The reason many Japanese and Chinese users reject UTF-8 isn't due to space constraints (UTF-8 and UTF-16 are roughly equal), it's because they reject Unicode itself. Probably because they don't think it's complicated enough¹? Shift-JIS and the various Chinese encodings both contain Han characters which are missing from Unicode, either due to the Han unification or simply were not considered important enough to include Surely there's enough space left? I seem to remember some Han characters outside of the BMP, so I would have guessed this is an argument from back in the UCS-2 days. (BTW, on a long train ride, I brought the linear-B alphabet, and practiced writing notes to my kids. So linear-B isn't entirely useless :-) From casual browsing of Wikipedia, the current status in CJK-land seems to be something like this: China: GB2312 and its successor GB18030 Taiwan, Macao, and Hong Kong: Big5 Japan: Shift-JIS Korea: EUC-KR It is interesting that some of these provide a lot fewer characters than Unicode. Another feature of several of them is that ASCII and e.g. kana scripts take up one byte, and ideograms take up two, which correlates with the expected width of the glyphs. Several of the pages indicate that Unicode, and mainly UTF-8, is gradually taking over. -k ¹ Those who remember Emacs in the MULE days will know what I mean. -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Jinjing Wang wrote: John Millikin wrote: The reason many Japanese and Chinese users reject UTF-8 isn't due to space constraints (UTF-8 and UTF-16 are roughly equal), it's because they reject Unicode itself. +1. This is the thing Unicode advocates don't want to admit. Until Unicode has code points for _all_ Chinese and Japanese characters, there will be active resistance to adoption. [...] However, many of the popular websites started during web 2.0 are adopting utf-8 for example: * renren.com (chinese largest facebook clone) * www.kaixin001.com (chinese second largest facebook clone) * t.sina.com.cn (an example of twitter clone) These websites adopted utf-8 because (I think) most web development tools have already standardized on utf-8, and there's little reason change it. Interesting. I don't know much about the politics of Chinese encodings, other than that the GB formats are/were dominant. As for the politics of Japanese encodings, last time I did web work (just at the beginning of web2.0, before they started calling it that) there was still a lot of active resistance among the Japanese. Given some of the characters folks were complaining about, I think it's more an issue of principle than practicality. Then again, the Japanese do love their language games, so obscure and archaic characters are used far more often than would be expected... Whether web2.0 has caused the Japanese to change too, I can't say. I got out of that line of work ^_^ I'm not aware of any (at least common) chinese characters that can be represented by gb2312 but not in unicode. Since the range of gb2312 is a subset of the range of gbk, which is a subset of the range of gb18030. And gb18030 is just another encoding of unicode. All the specific characters I've seen folks complain about were very uncommon or even archaic. All the common characters are there for Japanese too. The only time I've run into issues it was for an archaic character used in a manga title. I was working on a library catalog, and was too pedantic to spell it wrong. -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Alright, here's the results for the first three in the list (please forgive me for being lazy- I am a Haskell programmer after all): ifeng.com: UTF8: 299949 UTF16: 566610 dzh.mop.com: GBK: 1866 UTF8: 1891 UTF16: 3684 www.csdn.net: UTF8: 122870 UTF16: 217420 Seems like UTF8 is a consistent winner versus UTF16, and not much of a loser to the native formats. Michael On Wed, Aug 18, 2010 at 11:01 AM, anderson leo fireman...@gmail.com wrote: More typical Chinese web sites: www.ifeng.com (web site likes nytimes) dzh.mop.com (community for fun) www.csdn.net (web site for IT) www.sohu.com(web site like yahoo) www.sina.com (web site like yahoo) -- Andrew On Wed, Aug 18, 2010 at 11:40 AM, Michael Snoyman mich...@snoyman.comwrote: Well, I'm not certain if it counts as a typical Chinese website, but here are the stats; UTF8: 64,198 UTF16: 113,160 And just for fun, after gziping: UTF8: 17,708 UTF16: 19,367 On Wed, Aug 18, 2010 at 2:59 AM, anderson leo fireman...@gmail.comwrote: Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the wikipedia for Chinese. -Andrew On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman mich...@snoyman.comwrote: On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale g...@sefer.org wrote: Ketil Malde wrote: I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8... I don't think the genome is typical text. And I doubt that is true if that text is in a CJK language. I think that *IF* we are aiming for a single, grand, unified text library to Rule Them All, it needs to use UTF-8. Given the growth rate of China's economy, if CJK isn't already the majority of text being processed in the world, it will be soon. I have seen media reports claiming CJK is now a majority of text data going over the wire on the web, though I haven't seen anything scientific backing up those claims. It certainly seems reasonable. I believe Google's measurements based on their own web index showing wide adoption of UTF-8 are very badly skewed due to a strong Western bias. In that case, if we have to pick one encoding for Data.Text, UTF-16 is likely to be a better choice than UTF-8, especially if the cost is fairly low even for the special case of Western languages. Also, UTF-16 has become by far the dominant internal text format for most software and for most user platforms. Except on desktop Linux - and whether we like it or not, Linux desktops will remain a tiny minority for the foreseeable future. I think you are conflating two points here, and ignoring some important data. Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data, but even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with. As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage. I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case. We can't consider a CJK encoding for text, so its prevalence is irrelevant to this topic. What *is* relevant is that a very large percentage of web pages *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by default UTF-8. As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself. Michael ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Wed, Aug 18, 2010 at 2:12 AM, John Meacham j...@repetae.net wrote: ranty thing to follow That said, there is never a reason to use UTF-16, it is a vestigial remanent from the brief period when it was thought 16 bits would be enough for the unicode standard, any defense of it nowadays is after the fact justification for having accidentally standardized on it back in the day. This is false. Text uses UTF-16 internally as early benchmarks indicated that it was faster. See Tom Harper's response to the other thread that was spawned of this thread by Ketil. Text continues to be UTF-16 today because * no one has written a benchmark that shows that UTF-8 would be faster *for use in Data.Text*, and * no one has written a patch that converts Text to use UTF-8 internally. I'm quite frustrated by this whole discussion; there's lots of talking, no coding, and only a little benchmarking (of web sites, not code). This will get us nowhere. Cheers, Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Johan Tibell johan.tib...@gmail.com writes: Text continues to be UTF-16 today because * no one has written a benchmark that shows that UTF-8 would be faster *for use in Data.Text*, and * no one has written a patch that converts Text to use UTF-8 internally. I'm quite frustrated by this whole discussion; there's lots of talking, no coding, and only a little benchmarking (of web sites, not code). This will get us nowhere. This was my impression as well. If someone desperately wants Text to use UTF-8 internally, why not help code such a change rather than just waving the suggestion around in the air? -- Ivan Lazar Miljenovic ivan.miljeno...@gmail.com IvanMiljenovic.wordpress.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Wed, Aug 18, 2010 at 2:39 PM, Johan Tibell johan.tib...@gmail.comwrote: On Wed, Aug 18, 2010 at 2:12 AM, John Meacham j...@repetae.net wrote: ranty thing to follow That said, there is never a reason to use UTF-16, it is a vestigial remanent from the brief period when it was thought 16 bits would be enough for the unicode standard, any defense of it nowadays is after the fact justification for having accidentally standardized on it back in the day. This is false. Text uses UTF-16 internally as early benchmarks indicated that it was faster. See Tom Harper's response to the other thread that was spawned of this thread by Ketil. Text continues to be UTF-16 today because * no one has written a benchmark that shows that UTF-8 would be faster *for use in Data.Text*, and * no one has written a patch that converts Text to use UTF-8 internally. I'm quite frustrated by this whole discussion; there's lots of talking, no coding, and only a little benchmarking (of web sites, not code). This will get us nowhere. Here's my response to the two points: * I haven't written a patch showing that Data.Text would be faster using UTF-8 because that would require fulfilling the second point (I'll get to in a second). I *have* shown where there are huge performance differences between text and ByteString/String. Unfortunately, the response has been don't use bytestring, it's the wrong datatype, text will get fixed, which is quite underwhelming. * Since the prevailing attitude has been such a disregard to any facts shown thus far, it seems that the effort required to learn the internals of the text package and attempt a patch would be wasted. In the meanwhile, Jasper has released blaze-builder which does an amazing job at producing UTF-8 encoded data, which for the moment is my main need. As much as I'll be chastised by the community, I'll stick with this approach for the moment. Now if you tell me that text would consider applying a UTF-8 patch, that would be a different story. But I don't have the time to maintain a separate UTF-8 version of text. For me, the whole point of this discussion was to determine whether we should attempt porting to UTF-8, which as I understand it would be a rather large undertaking. Michael ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On 18 August 2010 15:04, Michael Snoyman mich...@snoyman.com wrote: For me, the whole point of this discussion was to determine whether we should attempt porting to UTF-8, which as I understand it would be a rather large undertaking. And the answer to that is, yes but only if we have good reason to believe it will actually be faster, and that's where we're most interested in benchmarks rather than hand waving. As Johan and others have said, the original choice to use UTF16 was based on benchmarks showing it was faster (than UTF8 or UTF32). So if we want to counter that then we need either to argue that these were the wrong choice of benchmarks that do not reflect real usage, or that with better implementations that the balance would shift. Now there is an interesting argument to claim that we spend more time shovling strings about than we do actually processing them in any interesting way and therefore that we should pick benchmarks that reflect that. This would then shift the balance to favour the internal representation being identical to some particular popular external representation --- even if that internal representation is slower for many processing tasks. Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Hi Michael, On Wed, Aug 18, 2010 at 4:04 PM, Michael Snoyman mich...@snoyman.comwrote: Here's my response to the two points: * I haven't written a patch showing that Data.Text would be faster using UTF-8 because that would require fulfilling the second point (I'll get to in a second). I *have* shown where there are huge performance differences between text and ByteString/String. Unfortunately, the response has been don't use bytestring, it's the wrong datatype, text will get fixed, which is quite underwhelming. I went through all the emails you sent on with topic String vs ByteString and Re: String vs ByteString and I can't find a single benchmark. I do agree with you that * UTF-8 is more compact than UTF-16, and * UTF-8 is by far the most used encoding on the web. and that establishes a reasonable *theoretical* argument for why switching to UTF-8 might be faster. What I'm looking for is a program that shows a big difference so we can validate the hypothesis. As Duncan mentioned we already ran some benchmarks early on the showed the opposite. Someone posted a benchmark earlier in this thread and Bryan addressed the issue raised by that poster. We want more of those. * Since the prevailing attitude has been such a disregard to any facts shown thus far, it seems that the effort required to learn the internals of the text package and attempt a patch would be wasted. In the meanwhile, Jasper has released blaze-builder which does an amazing job at producing UTF-8 encoded data, which for the moment is my main need. As much as I'll be chastised by the community, I'll stick with this approach for the moment. I'm not sure this discussion has surfaced that many facts. What we do have is plenty of theories. I can easily add some more: * GHC is not doing a good job laying out the branches in the validation code that does arithmetic on the input byte sequence, to validate the input and compute the Unicode code point that should be streamed using fusion. * The differences in text and bytestring's fusion framework get in the way of some optimization in GHC (text uses a more sophisticated fusion frameworks that handles some cases bytestring can't according to Bryan). * Lingering space leaks is hurting performance (Bryan plugged one already). * The use of a polymorphic loop state in the fusion framework gets in the way of unboxing. * Extraneous copying in the Handle implementation slows down I/O. All these are plausible reasons why Text might perform worse than ByteString. We need find out why ones are true by benchmarking and looking at the generated Core. Now if you tell me that text would consider applying a UTF-8 patch, that would be a different story. But I don't have the time to maintain a separate UTF-8 version of text. For me, the whole point of this discussion was to determine whether we should attempt porting to UTF-8, which as I understand it would be a rather large undertaking. I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was faster on some set of benchmarks (starting with the ones already in the library) that we agree on. Cheers, Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Wed, Aug 18, 2010 at 4:12 AM, wren ng thornton w...@freegeek.org wrote: There was a study recently on this. They found that there are four main parts of the Internet: * a densely connected core, where from any site you can get to any other * an in cone, from which you can reach the core (but not other in-cone members, since then you'd both be in the core) * an out cone, which can be reached from the core (but which cannot reach each other) * and, unconnected islands The surprising part is they found that all four parts are approximately the same size. I forget the exact numbers, but they're all 25+/-5%. This implies that an exhaustive crawl of the web would require having about 50% of all websites as seeds (the in-cone plus the islands). If we're only interested in a representative sample, then we could get by with fewer. However, that depends a lot on the definition of representative. And we can't have an accurate definition of representative without doing the entire crawl at some point in order to discover the appropriate distributions. Then again, distributions change over time... Thus, I would guess that Google only has 50~75% of the net: the core, the out-cone, and a fraction of the islands and in-cone. That's an interesting result. However, if you weigh each page with its page views you'll probably find that Google (and other search engines) probably cover much more than that since page views on sites tend to follow a power-law distribution. -- Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Wed, Aug 18, 2010 at 6:24 PM, Johan Tibell johan.tib...@gmail.comwrote: Hi Michael, On Wed, Aug 18, 2010 at 4:04 PM, Michael Snoyman mich...@snoyman.comwrote: Here's my response to the two points: * I haven't written a patch showing that Data.Text would be faster using UTF-8 because that would require fulfilling the second point (I'll get to in a second). I *have* shown where there are huge performance differences between text and ByteString/String. Unfortunately, the response has been don't use bytestring, it's the wrong datatype, text will get fixed, which is quite underwhelming. I went through all the emails you sent on with topic String vs ByteString and Re: String vs ByteString and I can't find a single benchmark. I do agree with you that * UTF-8 is more compact than UTF-16, and * UTF-8 is by far the most used encoding on the web. and that establishes a reasonable *theoretical* argument for why switching to UTF-8 might be faster. What I'm looking for is a program that shows a big difference so we can validate the hypothesis. As Duncan mentioned we already ran some benchmarks early on the showed the opposite. Someone posted a benchmark earlier in this thread and Bryan addressed the issue raised by that poster. We want more of those. Sorry, I thought I'd sent these out. While working on optimizing Hamlet I started playing around with the BigTable benchmark. I wrote two blog posts on the topic: http://www.snoyman.com/blog/entry/bigtable-benchmarks/ http://www.snoyman.com/blog/entry/optimizing-hamlet/ Originally, Hamlet had been based on the text package; the huge slow-down introduced by text convinced me to migrate to bytestrings, and ultimately blaze-html/blaze-builder. It could be that these were flaws in text that are correctable and have nothing to do with UTF-16; however, it will be difficult to produce a benchmark purely on the UTF-8/UTF-16 divide. Using UTF-16 bytestrings would probably overstate the impact since it wouldn't be using Bryan's fusion logic. * Since the prevailing attitude has been such a disregard to any facts shown thus far, it seems that the effort required to learn the internals of the text package and attempt a patch would be wasted. In the meanwhile, Jasper has released blaze-builder which does an amazing job at producing UTF-8 encoded data, which for the moment is my main need. As much as I'll be chastised by the community, I'll stick with this approach for the moment. I'm not sure this discussion has surfaced that many facts. What we do have is plenty of theories. I can easily add some more: * GHC is not doing a good job laying out the branches in the validation code that does arithmetic on the input byte sequence, to validate the input and compute the Unicode code point that should be streamed using fusion. * The differences in text and bytestring's fusion framework get in the way of some optimization in GHC (text uses a more sophisticated fusion frameworks that handles some cases bytestring can't according to Bryan). * Lingering space leaks is hurting performance (Bryan plugged one already). * The use of a polymorphic loop state in the fusion framework gets in the way of unboxing. * Extraneous copying in the Handle implementation slows down I/O. All these are plausible reasons why Text might perform worse than ByteString. We need find out why ones are true by benchmarking and looking at the generated Core. Now if you tell me that text would consider applying a UTF-8 patch, that would be a different story. But I don't have the time to maintain a separate UTF-8 version of text. For me, the whole point of this discussion was to determine whether we should attempt porting to UTF-8, which as I understand it would be a rather large undertaking. I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was faster on some set of benchmarks (starting with the ones already in the library) that we agree on. I think that's the main issue, and one that Duncan nailed on the head: we have to think about what are the important benchmarks. For Hamlet, I need fast UTF-8 bytestring generation. I don't care at all about algorithmic speed for split texts, as an example. My (probably uneducated) guess is that UTF-16 tends to perform many operations in memory faster since almost all characters are represented as 16 bits, while the big benefit for UTF-8 is in reading UTF-8 data, rendering UTF-8 data and decreased memory usage. But as I said, that's an (uneducated) guess. Some people have been floating the idea of multiple text packages. I personally would *not* want to go down that road, but it might be the only approach that allows top performance for all use cases. As is, I'm quite happy using blaze-builder for Hamlet. Michael ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org
Re: [Haskell-cafe] Re: String vs ByteString
On Wed, Aug 18, 2010 at 10:12 AM, Michael Snoyman mich...@snoyman.comwrote: While working on optimizing Hamlet I started playing around with the BigTable benchmark. I wrote two blog posts on the topic: http://www.snoyman.com/blog/entry/bigtable-benchmarks/ http://www.snoyman.com/blog/entry/optimizing-hamlet/ Even though your benchmark didn't explicitly come up in this thread, Johan and I spent some time improving the performance of Text for it. As a result, in darcs HEAD, Text is faster than String, but slower than ByteString. I'd certainly like to close that gap more aggressively. If the other contributors to this thread took just one minute to craft a benchmark they cared about for every ten minutes they spend producing hot air, we'd be a lot better off. It could be that these were flaws in text that are correctable and have nothing to do with UTF-16; Since the internal representation used by text is completely opaque, we could of course change it if necessary, with no user-visible consequences. I've yet to see any data that suggests that it's specifically UTF-16 that is related to any performance shortfalls, however. Some people have been floating the idea of multiple text packages. I personally would *not* want to go down that road, but it might be the only approach that allows top performance for all use cases. I'd be surprised if that proves necessary. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Wed, Aug 18, 2010 at 7:12 PM, Michael Snoyman mich...@snoyman.comwrote: On Wed, Aug 18, 2010 at 6:24 PM, Johan Tibell johan.tib...@gmail.comwrote: Sorry, I thought I'd sent these out. While working on optimizing Hamlet I started playing around with the BigTable benchmark. I wrote two blog posts on the topic: http://www.snoyman.com/blog/entry/bigtable-benchmarks/ http://www.snoyman.com/blog/entry/optimizing-hamlet/ Originally, Hamlet had been based on the text package; the huge slow-down introduced by text convinced me to migrate to bytestrings, and ultimately blaze-html/blaze-builder. It could be that these were flaws in text that are correctable and have nothing to do with UTF-16; however, it will be difficult to produce a benchmark purely on the UTF-8/UTF-16 divide. Using UTF-16 bytestrings would probably overstate the impact since it wouldn't be using Bryan's fusion logic. Those are great. As Bryan mentioned we've already improved performance and I think I know how to improve it further. I appreciate that it's difficult to show the UTF-8/UTF-16 divide. I think the approach we're trying at the moment is looking at benchmarks, improving performance, and repeating until we can't improve anymore. It could be the case that we get a benchmark where the performance difference between bytestring and text cannot be explained/fixed by factors other than changing the internal encoding. That would be strong evidence that we should try to switch the internal encoding. We haven't seen any such benchmarks yet. As for blaze I'm not sure exactly how it deals with UTF-8 input. I tried to browse through the repo but could find that input ByteStrings are actually validated anywhere. If they're not it's a big generous to say that it deals with UTF-8 data, as it would really just be concatenating byte sequences, without validating them. We should ask Jasper about the current state. I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was faster on some set of benchmarks (starting with the ones already in the library) that we agree on. I think that's the main issue, and one that Duncan nailed on the head: we have to think about what are the important benchmarks. For Hamlet, I need fast UTF-8 bytestring generation. I don't care at all about algorithmic speed for split texts, as an example. My (probably uneducated) guess is that UTF-16 tends to perform many operations in memory faster since almost all characters are represented as 16 bits, while the big benefit for UTF-8 is in reading UTF-8 data, rendering UTF-8 data and decreased memory usage. But as I said, that's an (uneducated) guess. I agree. Lets create some more benchmarks. For example, lately I've been working on a benchmark, inspired by a real world problem, where I iterate over the lines in a ~500 MBs file, encoded using UTF-8 data, inserting each line in a Data.Map and do a bunch of further processing on it (such as splitting the strings into words). This tests text I/O throughput, memory overhead, performance of string comparison, etc. We already have benchmarks for reading files (in UTF-8) in several different ways (lazy I/O and iteratee style folds). Boil down the things you care about into a self contained benchmark and send it to this list or put it somewhere were we can retrieve it. Cheers, Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Wed, Aug 18, 2010 at 11:58 PM, Johan Tibell johan.tib...@gmail.comwrote: As for blaze I'm not sure exactly how it deals with UTF-8 input. I tried to browse through the repo but could find that input ByteStrings are actually validated anywhere. If they're not it's a big generous to say that it deals with UTF-8 data, as it would really just be concatenating byte sequences, without validating them. We should ask Jasper about the current state. As far as I can tell, Blaze *never* validates input ByteStrings. The proper approach to inserting data into blaze is either via String or Text. I requested that Jasper provide an unsafeByteString function in Blaze for Hamlet's usage: Hamlet does the UTF-8 encoding at compile time and is able to gain a little extra performance boost. If you want to properly validate bytestrings before inputing them, I believe the best approach would be to use utf8-string or text to read in the bytestrings, but Jasper may have a better approach. Michael ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Benedikt Huber benj...@gmx.net writes: Despite of all this, I think the performance of the text package is very promising, and hope it will improve further! I agree, Data.Text is great. Unfortunately, its internal use of UTF-16 makes it inefficient for many purposes. A large fraction - probably most - textual data isn't natural language text, but data formatted in textual form, and these formats are typically restricted to ASCII (except for a few text fields). For instance, a typical project for me might be 10-100GB of data, mostly in various text formats, real text only making up a few percent of this. The combined (all languages) Wikipedia is 2G words, probably less than 20GB. Being agnostic about string encoding - viz. treating it as bytes - works okay, but it would be nice to allow Unicode in the bits that actually are text, like string fields and labels and such. Due to the sizes involved, I think that in order to efficiently process text-formatted data, UTF-8 is the no-brainer choice for encoding -- certainly in storage, but also for in-memory processing. Unfortunately, there is no clear Data.Text-like effort here. There's (at least): utf8-string - provides utf-8 encoded lazy and strict bytestrings as well as some other data types (and a common class) and System.Environment functionality. utf8-light - provides encoding/decoding to/from (strict?) bytestrings regex-tdfa-utf8 - regular expressions on UTF-8 encoded lazy bytestrings utf8-env- provides an UTF8 aware System.Environment uhexdump - hex dumps for UTF-8 (?) compact-string - support for many different string encodings compact-string-fix - indicates that the above is unmaintained From a quick glance, it appears that utf8-string is the most complete and well maintained of the crowd, but I could be wrong. It'd be nice if a similar effort as Data.Text has seen could be applied to e.g. utf8-string, to produce a similarly efficient and effective library and allow the deprecation of the others. IMO, this could in time replace .Char8 as the default ByteString string representation. Hackathon, anyone? -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 10:08 AM, Ketil Malde ke...@malde.org wrote: Benedikt Huber benj...@gmx.net writes: Despite of all this, I think the performance of the text package is very promising, and hope it will improve further! I agree, Data.Text is great. Unfortunately, its internal use of UTF-16 makes it inefficient for many purposes. [..] From a quick glance, it appears that utf8-string is the most complete and well maintained of the crowd, but I could be wrong. It'd be nice if a similar effort as Data.Text has seen could be applied to e.g. utf8-string, to produce a similarly efficient and effective library and allow the deprecation of the others. IMO, this could in time replace .Char8 as the default ByteString string representation. Hackathon, anyone? Let me ask the question a different way: what are the motivations for having the text package use UTF-16 internaly? I know that some system APIs in Windows use it (at least, I think they do), and perhaps it's more efficient for certain types of processing, but overall do those benefits outweigh all of the reasons for UTF-8 pointed out in this thread? Michael ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 9:08 AM, Ketil Malde ke...@malde.org wrote: Benedikt Huber benj...@gmx.net writes: Despite of all this, I think the performance of the text package is very promising, and hope it will improve further! I agree, Data.Text is great. Unfortunately, its internal use of UTF-16 makes it inefficient for many purposes. It's not clear to me that using UTF-16 internally does make Data.Text noticeably slower. If we could get conclusive evidence that using UTF-16 hurts performance, we could look into changing the internal representation (a major undertaking). What Bryan and I need is benchmarks showing where Data.Text is performing poorly, compare to String or ByteString, so we can investigate the cause(s). Hypothesis are a good starting point for performance improvements, but they're not enough. We need benchmarks and people looking at profiling and compiler output to really understand what's going on. For example, how many know that the Handle implementations copies the input first into a mutable buffer and then into a Text value, for reads less than the buffer size (8k if I remember correctly). One of these copies could be avoided. How do we know that it's using UTF-16 that's our current performance bottleneck and not this extra copy? We need to benchmark, change the code, and then benchmark again. Perhaps the outcome of all the benchmarking and investigation is indeed that UTF-16 is a problem; then we can change the internal encoding. But there are other possibilities, like poorly laid out branches in the generated code. We need to understand what's going on if we are to make progress. A large fraction - probably most - textual data isn't natural language text, but data formatted in textual form, and these formats are typically restricted to ASCII (except for a few text fields). For instance, a typical project for me might be 10-100GB of data, mostly in various text formats, real text only making up a few percent of this. The combined (all languages) Wikipedia is 2G words, probably less than 20GB. I think this is an important observation. Cheers, Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
I agree, Data.Text is great. Unfortunately, its internal use of UTF-16 makes it inefficient for many purposes. In the first iteration of the Text package, UTF-16 was chosen because it had a nice balance of arithmetic overhead and space. The arithmetic for UTF-8 started to have serious performance impacts in situations where the entire document was outside ASCII (i.e. a Russian or Arabic document), but UTF-16 was still relatively compact, compared to both the UTF-32 and String alternatives. This, however, obviously does not represent your use case. I don't know if your use case is the more common one (though it seems likely). The underlying principles of Text should work fine with UTF-8. It has changed a lot since its original writing (thanks to some excellent tuning and maintenance by bos), including some more efficient binary arithmetic. The situation may have changed with respect to the performance limitations of UTF-8, or there may be room for it and a UTF-16 version. Any takers for implementing a UTF-8 version and comparing the two? A large fraction - probably most - textual data isn't natural language text, but data formatted in textual form, and these formats are typically restricted to ASCII (except for a few text fields). For instance, a typical project for me might be 10-100GB of data, mostly in various text formats, real text only making up a few percent of this. The combined (all languages) Wikipedia is 2G words, probably less than 20GB. Being agnostic about string encoding - viz. treating it as bytes - works okay, but it would be nice to allow Unicode in the bits that actually are text, like string fields and labels and such. Is your point that ASCII characters take up the same amount of space (i.e. 16 bits) as higher code points? Do you have any comparisons that quantify how much this affects your ability to process text in real terms? Does it make it too slow? Infeasible memory-wise? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Johan Tibell johan.tib...@gmail.com writes: It's not clear to me that using UTF-16 internally does make Data.Text noticeably slower. I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8. Many applications will get away with streaming over data, retaining only a small part, but some won't. In other cases (e.g. processing CJK text, and perhap also non-Latin1 text), I'm sure it'll be faster - but my (still unsubstantiated) guess is that the difference will be much smaller, and it'll be a case of winning some and losing some - and I'd also conjecture that having 3Gb real text (i.e. natural language, as opposed to text-formatted data) is rare. I think that *IF* we are aiming for a single, grand, unified text library to Rule Them All, it needs to use UTF-8. Alternatively, we can have different libraries with different representations for different purposes, where you'll get another few percent of juice by switching to the most appropriate. Currently the latter approach looks to be in favor, so if we can't have one single library, let us at least aim for a set of libraries with consistent interfaces and optimal performance. Data.Text is great for UTF-16, and I'd like to have something similar for UTF-8. Is all I'm trying to say. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Ketil Malde ke...@malde.org writes: Johan Tibell johan.tib...@gmail.com writes: It's not clear to me that using UTF-16 internally does make Data.Text noticeably slower. I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8. Many applications will get away with streaming over data, retaining only a small part, but some won't. Seeing as how the genome just uses 4 base letters, wouldn't it be better to not treat it as text but use something else? Or do you just mean storage-wise to be able to be read in a text editor, etc. as well (in case someone is trying to do their mad genetic manipulation by hand)? -- Ivan Lazar Miljenovic ivan.miljeno...@gmail.com IvanMiljenovic.wordpress.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Ketil == Ketil Malde ke...@malde.org writes: Ketil Johan Tibell johan.tib...@gmail.com writes: It's not clear to me that using UTF-16 internally does make Data.Text noticeably slower. Ketil I think that *IF* we are aiming for a single, grand, unified Ketil text library to Rule Them All, it needs to use UTF-8. Ketil Alternatively, we can have different libraries with different Ketil representations for different purposes, where you'll get Ketil another few percent of juice by switching to the most Ketil appropriate. Why not instead allow the programmer to decide at the function level which internal encoding to use? -- Colin Adams Preston Lancashire () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Ketil Malde wrote: I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8... I don't think the genome is typical text. And I doubt that is true if that text is in a CJK language. I think that *IF* we are aiming for a single, grand, unified text library to Rule Them All, it needs to use UTF-8. Given the growth rate of China's economy, if CJK isn't already the majority of text being processed in the world, it will be soon. I have seen media reports claiming CJK is now a majority of text data going over the wire on the web, though I haven't seen anything scientific backing up those claims. It certainly seems reasonable. I believe Google's measurements based on their own web index showing wide adoption of UTF-8 are very badly skewed due to a strong Western bias. In that case, if we have to pick one encoding for Data.Text, UTF-16 is likely to be a better choice than UTF-8, especially if the cost is fairly low even for the special case of Western languages. Also, UTF-16 has become by far the dominant internal text format for most software and for most user platforms. Except on desktop Linux - and whether we like it or not, Linux desktops will remain a tiny minority for the foreseeable future. Alternatively, we can have different libraries with different representations for different purposes, where you'll get another few percent of juice by switching to the most appropriate. Currently the latter approach looks to be in favor, so if we can't have one single library, let us at least aim for a set of libraries with consistent interfaces and optimal performance. Data.Text is great for UTF-16, and I'd like to have something similar for UTF-8. Is all I'm trying to say. I agree. Thanks, Yitz ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Tom Harper rtomhar...@gmail.com writes: 2010/8/17 Bulat Ziganshin bulat.zigans...@gmail.com: Hello Tom, snip i don't understand what you mean. are you support all 2^20 codepoints in Data.Text package? Bulat, Yes, its internal representation is UTF-16, which is capable of encoding *any* valid Unicode codepoint. Just like Char is capable of encoding any valid Unicode codepoint. -- Ivan Lazar Miljenovic ivan.miljeno...@gmail.com IvanMiljenovic.wordpress.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Ivan Lazar Miljenovic wrote: Tom Harper rtomhar...@gmail.com writes: 2010/8/17 Bulat Ziganshin bulat.zigans...@gmail.com: Hello Tom, snip i don't understand what you mean. are you support all 2^20 codepoints in Data.Text package? Bulat, Yes, its internal representation is UTF-16, which is capable of encoding *any* valid Unicode codepoint. Just like Char is capable of encoding any valid Unicode codepoint. Char is not an encoding, right? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Miguel Mitrofanov miguelim...@yandex.ru writes: Ivan Lazar Miljenovic wrote: Tom Harper rtomhar...@gmail.com writes: 2010/8/17 Bulat Ziganshin bulat.zigans...@gmail.com: Hello Tom, snip i don't understand what you mean. are you support all 2^20 codepoints in Data.Text package? Bulat, Yes, its internal representation is UTF-16, which is capable of encoding *any* valid Unicode codepoint. Just like Char is capable of encoding any valid Unicode codepoint. Char is not an encoding, right? No, but in GHC at least it corresponds to a Unicode codepoint. -- Ivan Lazar Miljenovic ivan.miljeno...@gmail.com IvanMiljenovic.wordpress.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale g...@sefer.org wrote: Ketil Malde wrote: I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8... I don't think the genome is typical text. And I doubt that is true if that text is in a CJK language. I think that *IF* we are aiming for a single, grand, unified text library to Rule Them All, it needs to use UTF-8. Given the growth rate of China's economy, if CJK isn't already the majority of text being processed in the world, it will be soon. I have seen media reports claiming CJK is now a majority of text data going over the wire on the web, though I haven't seen anything scientific backing up those claims. It certainly seems reasonable. I believe Google's measurements based on their own web index showing wide adoption of UTF-8 are very badly skewed due to a strong Western bias. In that case, if we have to pick one encoding for Data.Text, UTF-16 is likely to be a better choice than UTF-8, especially if the cost is fairly low even for the special case of Western languages. Also, UTF-16 has become by far the dominant internal text format for most software and for most user platforms. Except on desktop Linux - and whether we like it or not, Linux desktops will remain a tiny minority for the foreseeable future. I think you are conflating two points here, and ignoring some important data. Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data, but even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with. As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage. I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case. We can't consider a CJK encoding for text, so its prevalence is irrelevant to this topic. What *is* relevant is that a very large percentage of web pages *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by default UTF-8. As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself. Michael ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 12:54, Ivan Lazar Miljenovic ivan.miljeno...@gmail.com wrote: Tom Harper rtomhar...@gmail.com writes: 2010/8/17 Bulat Ziganshin bulat.zigans...@gmail.com: Hello Tom, snip i don't understand what you mean. are you support all 2^20 codepoints in Data.Text package? Bulat, Yes, its internal representation is UTF-16, which is capable of encoding *any* valid Unicode codepoint. Just like Char is capable of encoding any valid Unicode codepoint. Unless a Char in Haskell is 32 bits (or at least more than 16 bits) it con NOT encode all Unicode points. -Tako ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Ivan Lazar Miljenovic ivan.miljeno...@gmail.com writes: Seeing as how the genome just uses 4 base letters, Yes, the bulk of the data is not really text at all, but each sequence (it's fragmented due to the molecular division into chromosomes, and due to incompleteness) also has a textual header. Generally, the Fasta format looks like this: sequence-id some arbitrary metadata blah blah ACGATATACGCGCATGCGAT... ..lines and lines of letters... (As an aside, although there are only four nucleotides (ACGT), there are occasional wildcard characters, the most common being N for aNy nucleotide, but there are defined wildcards for all subsets of the alphabet.) wouldn't it be better to not treat it as text but use something else? I generally use ByteStrings, with the .Char8 interface if/when appropriate. This is actually a pretty good choice; even if people use Unicode in the headers, I don't particularly want to care - as long as it is transparent. In some cases, I'd like to, say, search headers for some specific string - in these cases, a nice, tidy, rich, and optimized Data.ByteString(.Lazy).UTF8 would be nice. (But obviously not terribly essential at the moment, since I haven't bothered to test the available options. I guess for my stuff, the (human consumable) text bits are neither very performance intensive, nor large, so I could probably and fairly cheaply wrap relevant operations or fields with Data.Text's {de,en}codeUtf8. And in practice - partly due to lacking software support, I'm sure - it's all ASCII anyway. :-) It'd be nice to have efficient substring searches and regular expression, etc for the sequence data, but often this will be better addressed by more specific algorithms, and in any case, a .Char8 implementation is likely to be more efficient than any gratuitous Unicode encoding. (in case someone is trying to do their mad genetic manipulation by hand)? You'd be surprised what a determined biologist can achive, armed only with Word, Excel, and a reckless disregard for surmountability. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Hi Ketil, On Tue, Aug 17, 2010 at 12:09 PM, Ketil Malde ke...@malde.org wrote: Johan Tibell johan.tib...@gmail.com writes: It's not clear to me that using UTF-16 internally does make Data.Text noticeably slower. I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8. Many applications will get away with streaming over data, retaining only a small part, but some won't. I'm not sure if this is a great example as genome data is probably much better stored in a vector (using a few bits per letter). I agree that whenever one data structure will fit in the available RAM and another won't the smaller will win. I just don't know if this case is worth spending weeks worth of work optimizing for. That's why I'd like to see benchmarks for more idiomatic use cases. In other cases (e.g. processing CJK text, and perhap also non-Latin1 text), I'm sure it'll be faster - but my (still unsubstantiated) guess is that the difference will be much smaller, and it'll be a case of winning some and losing some - and I'd also conjecture that having 3Gb real text (i.e. natural language, as opposed to text-formatted data) is rare. I would like to verify this guess. In my personal experience it's really hard to guess which changes will lead to a noticeable performance improvement. I'm probably wrong more often than I'm right. Cheers, Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Tako Schotanus t...@codejive.org writes: On Tue, Aug 17, 2010 at 12:54, Ivan Lazar Miljenovic ivan.miljeno...@gmail.com wrote: Tom Harper rtomhar...@gmail.com writes: 2010/8/17 Bulat Ziganshin bulat.zigans...@gmail.com: Hello Tom, snip i don't understand what you mean. are you support all 2^20 codepoints in Data.Text package? Bulat, Yes, its internal representation is UTF-16, which is capable of encoding *any* valid Unicode codepoint. Just like Char is capable of encoding any valid Unicode codepoint. Unless a Char in Haskell is 32 bits (or at least more than 16 bits) it con NOT encode all Unicode points. http://www.haskell.org/onlinereport/lexemes.html -- Ivan Lazar Miljenovic ivan.miljeno...@gmail.com IvanMiljenovic.wordpress.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Michael Snoyman mich...@snoyman.com writes: I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, *ahem* http://en.wikipedia.org/wiki/UTF-16/UCS-2#Use_in_major_operating_systems_and_environments -- Ivan Lazar Miljenovic ivan.miljeno...@gmail.com IvanMiljenovic.wordpress.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 13:00, Michael Snoyman mich...@snoyman.com wrote: On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale g...@sefer.org wrote: Ketil Malde wrote: I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8... I don't think the genome is typical text. And I doubt that is true if that text is in a CJK language. As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself. Regardless of the outcome of that investigation (which in itself is interesting) I have to agree with Yitzchak that the human genome (or any other ASCII based data that is not ncessarily a representation of written human language) is not a good fir for the Text package. A package like this should IMHO be good at handling human language, as much of them as possible, and support the common operations as efficiently as possible: sorting, upper/lowercase (where those exist), find word boundaries, whatever. Parsing some kind of file containing the human genome and the like I think would be much better served by a package focusing on handling large streams of bytes. No encodings to worry about, no parsing of the stream determine code points, no calculations determine string lengths. If you need to convert things to upper/lower case or do sorting you can just fall back on simple ASCII processing, no need to depend on a package dedicated to human text processing. I do think that in-memory processing of Unicode is better served with UTF16 than UTF8 because except en very rare circumstances you can just treat the text as an array of Char. You can't do that for UTF8 so the efficiency of the algorithmes would suffer. I also think that the memory problem is much easier worked around (for example by dividing the problem in smaller parts) than sub-optimal string processing because of increased complexity. -Tako ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 2:20 PM, Ivan Lazar Miljenovic ivan.miljeno...@gmail.com wrote: Michael Snoyman mich...@snoyman.com writes: I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, *ahem* http://en.wikipedia.org/wiki/UTF-16/UCS-2#Use_in_major_operating_systems_and_environments I was talking about the contents of the files, not the file names or how the system calls work. I know at least on Windows, Linux and FreeBSD, if you open up the default text editor, type in a few letters and hit save, the file will not be in UTF-16. Michael ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 13:29, Ketil Malde ke...@malde.org wrote: Tako Schotanus t...@codejive.org writes: Just like Char is capable of encoding any valid Unicode codepoint. Unless a Char in Haskell is 32 bits (or at least more than 16 bits) it con NOT encode all Unicode points. And since it can encode (or rather, represent) any valid Unicode codepoint, it follows that it is 32 bits (and at least more than 16 bits). :-) (Char is basically a 32bit value, limited valid Unicode code points, so it corresponds to UCS-4/UTF-32.) Yeah, I tried looking it up but I could find the technical definition for Char, but in the end I found that maxBound was 0x10 making it basically 24 bits :) I know for example that Java uses only 16 bits for its Chars and therefore can NOT give you all Unicode code points with a single Char, with Strings you can because of the extension points. -Tako ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Michael Snoyman mich...@snoyman.com writes: As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. With the danger of sounding ... alphabetist? as well as belaboring a point I agree is irrelevant (the storage format): I'd point out that it seems at least as unfair to optimize for CJK at the cost of Western languages. UTF-16 uses two bytes for (most) CJK ideograms, and (all, I think) characters in Western and other phonetic scripts. UTF-8 uses one to two bytes for a lot of Western alphabets, but three for CJK ideograms. Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram, while an ASCII letter is about six bits. Thus, the information density of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to 15/16 vs 6/16 for UTF-16. In other words a given document translated between Chinese and English should occupy roughly the same space in UTF-8, but be 2.5 times longer in English for UTF-16. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Ivan == Ivan Lazar Miljenovic ivan.miljeno...@gmail.com writes: Char is not an encoding, right? Ivan No, but in GHC at least it corresponds to a Unicode codepoint. I don't think this is right, or shouldn't be right, anyway.. Surely it stands for a character. Unicode codepoints include non-characters such as the surrogate codepoints used by UTF-16 to map non-BMP codepoints to pairs of 16-bit codepoints. I don't think you ought to be able to see a surrogate codepoint as a Char. -- Colin Adams Preston Lancashire () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Yitzchak Gale g...@sefer.org writes: I don't think the genome is typical text. I think the typical *large* collection of text is text-encoded data, and not, for lack of a better word, literature. Genomics data is just an example. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 13:40, Ketil Malde ke...@malde.org wrote: Michael Snoyman mich...@snoyman.com writes: As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. With the danger of sounding ... alphabetist? as well as belaboring a point I agree is irrelevant (the storage format): I'd point out that it seems at least as unfair to optimize for CJK at the cost of Western languages. Thing is that here you're only talking about size optimizations, for somebody having to handle a lot of international texts (and I'm not necessarily talking about Chinese or Japanese here) it would be important that this is handled in the most efficient way possible, because in the end storing and retrieving you only do once each while maybe doing a lot of processing in between. And the on-disk storage or the over-the-wire format might very well be different than the in-memory format. Each can be selected for what it's best at. I'll repeat here that in my opinion a Text package should be good at handling text, human text, from whatever country. If I need to handle large streams of ASCII I'll use something else. :) Cheers, -Tako ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Michael Snoyman wrote: Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data True, I haven't seen any - except for Google, which I don't believe is accurate. I would like to see some good unbiased data. Right now we just have our intuitions based on anecdotal evidence and whatever years of experience we have in IT. For the anecdotal evidence, I really wish that people from CJK countries were better represented in this discussion. Unfortunately, Haskell is less prevalent in CJK countries, and there is somewhat of a language barrier. I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with. I agree, I wish we had better numbers. even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead... As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself. Again, I agree that some real data would be great. The problem is, I'm not sure if there is anyone in this discussion who is qualified to come up with anything even close to a fair random sampling or a CJK website that is representative. As far as I can tell, most of us participating in this discussion have absolutely zero perspective of what computing is like in CJK countries. As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage. No, there is a third: using an API that results in robust, readable and maintainable code even in the face of changing encoding requirements. Unless you have proof that the difference in performance between that API and an API with a hard-wired encoding is the factor that is causing your particular application to fail to meet its requirements, the hard-wired approach is guilty of aggravated premature optimization. So for example, UTF-8 is an important option to have in a web toolkit. But if that's the only option, that web toolkit shouldn't be considered a general-purpose one in my opinion. I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case. Well, to start with, all MS Word documents are in UTF-16. There are a few of those around I think. Most applications - in some sense of most - store text in UTF-16 Again, without any data, my intuition tells me that most of the text data stored in the world's files are in UTF-16. There is currently not much Haskell code that reads those formats directly, but I think that will be changing as usage of Haskell in the real world picks up. We can't consider a CJK encoding for text, Not as a default, certainly not as the only option. But nice to have as a choice. What *is* relevant is that a very large percentage of web pages *are*, in fact, standardizing on UTF-8, In Western countries. Regards, Yitz ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Colin Paul Adams co...@colina.demon.co.uk writes: Char is not an encoding, right? Ivan No, but in GHC at least it corresponds to a Unicode codepoint. I don't think this is right, or shouldn't be right, anyway.. Surely it stands for a character. Unicode codepoints include non-characters such as the surrogate codepoints used by UTF-16 to map non-BMP codepoints to pairs of 16-bit codepoints. Prelude (toEnum 0xD800) :: Char '\55296' I don't think you ought to be able to see a surrogate codepoint as a Char. This is a bit confusing. From the Unicode glossary: - Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).] - Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 1016. (See definition D10 in Section 3.4, Characters and Encoding.) (2) A value, or position, for a character, in any coded character set. From Wikipedia on UTF-16: Unicode and ISO/IEC 10646 do not, and will never, assign characters to any of the code points in the U+D800–U+DFFF range, so an individual code unit from a surrogate pair does not ever represent a character. So: A Char holds a code point, that is, a value from 0 to 0x1016. Some of these values do not correspond to Unicode characters. As far as I can tell, a surrogate pair in UTF-16 is both two (surrogate) code points of two bytes each, as well as a single code point encoded as four bytes. Implementations seem to differ about what the length of a string containing surrogate pairs is. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Ketil Malde wrote: I'd point out that it seems at least as unfair to optimize for CJK at the cost of Western languages. Quite true. [...speculative calculation from which we conclude that] a given document translated between Chinese and English should occupy roughly the same space in UTF-8, but be 2.5 times longer in English for UTF-16. Could be. We really need data on that. If it's practical to maintain different backends with identical public APIs and different internal encodings, that would be the best. After a few years of widespread usage, would know a lot more. Regards, Yitz ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 1:36 PM, Tako Schotanus t...@codejive.org wrote: Yeah, I tried looking it up but I could find the technical definition for Char, but in the end I found that maxBound was 0x10 making it basically 24 bits :) I think that's enough to represent all the assigned Unicode code points. I also think the Unicode consortium (or whatever it is called) made some statement about the maximum number of bits they'll ever use. -- Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Hello, Ketil Malde! On Tue, Aug 17, 2010 at 8:02 AM, Ketil Malde ke...@malde.org wrote: Ivan Lazar Miljenovic ivan.miljeno...@gmail.com writes: Seeing as how the genome just uses 4 base letters, Yes, the bulk of the data is not really text at all, but each sequence (it's fragmented due to the molecular division into chromosomes, and due to incompleteness) also has a textual header. Generally, the Fasta format looks like this: sequence-id some arbitrary metadata blah blah ACGATATACGCGCATGCGAT... ..lines and lines of letters... (As an aside, although there are only four nucleotides (ACGT), there are occasional wildcard characters, the most common being N for aNy nucleotide, but there are defined wildcards for all subsets of the alphabet.) As someone who knows and uses your bio package, I'm almost certain that Text really isn't the right data type for representing everything. Certainly *not* for the genomic data itself. In fact, a representation using 4 bits per base (4 nucleotides plus 12 other characters, such as gaps as aNy) is easy to represent using ByteStrings with two bases per byte and should halve the space requirements. However, the header of each sequence is text, in the sense of human language text, and ideally should be represented using Text. In other words, the sequence data type[1] currently is defined as: type SeqData = Data.ByteString.Lazy.ByteString type QualData = Data.ByteString.Lazy.ByteString data Sequence t = Seq !SeqData !SeqData !(Maybe QualData) [1] http://hackage.haskell.org/packages/archive/bio/0.4.6/doc/html/Bio-Sequence-SeqData.html#t:Sequence where the meaning is that in 'Seq header seqdata qualdata', 'header' would be something like sequence-id some arbitrary metadata blah blah and 'seqdata' would be ACGATATACGCGCATGCGAT. But perhaps we should really have: type SeqData = Data.ByteString.Lazy.ByteString type QualData = Data.ByteString.Lazy.ByteString type HeaderData = Data.Text.Text -- strict is prolly a good choice here data Sequence t = Seq !HeaderData !SeqData !(Maybe QualData) Semantically, this is the right choice, putting Text where there is text. We can read everything with ByteStrings and then use[2] decodeUtf8 :: ByteString - Text [2] http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Encoding.html#v:decodeUtf8 only for the header bits. There is only one problem in this approach, UTF-8 for the input FASTA file would be hardcoded. Considering that probably nobody will be using UTF-16 or UTF-32 for the whole FASTA file, there remains only UTF-8 (from which ASCII is just a special case) and other 8-bits encondings (such as ISO8859-1, Shift-JIS, etc.). I haven't seen a FASTA file with characters outside the ASCII range yet, but I guess the choice of UTF-8 shouldn't be a big problem. wouldn't it be better to not treat it as text but use something else? I generally use ByteStrings, with the .Char8 interface if/when appropriate. This is actually a pretty good choice; even if people use Unicode in the headers, I don't particularly want to care - as long as it is transparent. In some cases, I'd like to, say, search headers for some specific string - in these cases, a nice, tidy, rich, and optimized Data.ByteString(.Lazy).UTF8 would be nice. (But obviously not terribly essential at the moment, since I haven't bothered to test the available options. I guess for my stuff, the (human consumable) text bits are neither very performance intensive, nor large, so I could probably and fairly cheaply wrap relevant operations or fields with Data.Text's {de,en}codeUtf8. And in practice - partly due to lacking software support, I'm sure - it's all ASCII anyway. :-) Oh, so I didn't read this paragraph closely enough :). In this e-mail I'm basically agreeing with your thoughts here =). And what do you think about creating a real SeqData data type with two bases per byte? In terms of processing speed I guess there will be a small penalty, but if you need to have large quantities of base pairs in memory this would double your capacity =). Cheers, -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Johan == Johan Tibell johan.tib...@gmail.com writes: Johan On Tue, Aug 17, 2010 at 1:36 PM, Tako Schotanus t...@codejive.org wrote: Johan Yeah, I tried looking it up but I could find the Johan technical definition for Char, but in the end I found that Johan maxBound was 0x10 making it basically 24 bits :) Johan I think that's enough to represent all the assigned Unicode Johan code points. I also think the Unicode consortium (or whatever Johan it is called) made some statement about the maximum number of Johan bits they'll ever use. Yes. And UTF-16 is only capable of dealing with codepoints up to this limit. -- Colin Adams Preston Lancashire () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 2:23 PM, Yitzchak Gale g...@sefer.org wrote: Michael Snoyman wrote: Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data True, I haven't seen any - except for Google, which I don't believe is accurate. I would like to see some good unbiased data. To my knowledge the data we have about prevalence of encoding on the web is accurate. We crawl all pages we can get our hands on, by starting at some set of seeds and then following all the links. You cannot be sure that you've reached all web sites as there might be cliques in the web graph but we try our best to get them all. You're unlikely to get a better estimate anywhere else. I doubt few organizations have the machinery required to crawl most of the web. -- Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 3:23 PM, Yitzchak Gale g...@sefer.org wrote: Michael Snoyman wrote: Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data True, I haven't seen any - except for Google, which I don't believe is accurate. I would like to see some good unbiased data. Right now we just have our intuitions based on anecdotal evidence and whatever years of experience we have in IT. For the anecdotal evidence, I really wish that people from CJK countries were better represented in this discussion. Unfortunately, Haskell is less prevalent in CJK countries, and there is somewhat of a language barrier. I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with. I agree, I wish we had better numbers. even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead... As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself. Again, I agree that some real data would be great. The problem is, I'm not sure if there is anyone in this discussion who is qualified to come up with anything even close to a fair random sampling or a CJK website that is representative. As far as I can tell, most of us participating in this discussion have absolutely zero perspective of what computing is like in CJK countries. I won't call this a scientific study by any stretch of the imagination, but I did a quick test on the www.qq.com homepage. The original file encoding was GB2312; here are the file sizes: GB2312: 193014 UTF8: 200044 UTF16: 371938 As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage. No, there is a third: using an API that results in robust, readable and maintainable code even in the face of changing encoding requirements. Unless you have proof that the difference in performance between that API and an API with a hard-wired encoding is the factor that is causing your particular application to fail to meet its requirements, the hard-wired approach is guilty of aggravated premature optimization. So for example, UTF-8 is an important option to have in a web toolkit. But if that's the only option, that web toolkit shouldn't be considered a general-purpose one in my opinion. I'm not talking about API changes here; the topic at hand is the internal representation of the stream of characters used by the text package. That is currently UTF-16; I would argue switching to UTF8. I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case. Well, to start with, all MS Word documents are in UTF-16. There are a few of those around I think. Most applications - in some sense of most - store text in UTF-16 Again, without any data, my intuition tells me that most of the text data stored in the world's files are in UTF-16. There is currently not much Haskell code that reads those formats directly, but I think that will be changing as usage of Haskell in the real world picks up. I was referring to text files, not binary files with text embedded within them. While we might use the text package to deal with the data from a Word doc once in memory, we would almost certainly need to use ByteString (or binary perhaps) to actually parse the file. But at the end of the day, you're right: there would be an encoding penalty at a certain point, just not on the entire file. We can't consider a CJK encoding for text, Not as a default, certainly not as the only option. But nice to have as a choice. I think you're missing the point at hand: I don't think *any* is opposed to offering encoders/decoders for all the multitude of encoding types out there. In fact, I believe the text-icu package already supports every encoding type under discussion. The question is the internal representation for text, for which a language-specific encoding is *not* a choice, since it does not support all unicode code points. Michael ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and UTF-16 segments in it list of strict text elements :) Then big chunks of western text will be encoded efficiently, and same with CJK! Not sure what to do about strict Data.Text though :) On Tue, Aug 17, 2010 at 1:40 PM, Ketil Malde ke...@malde.org wrote: Michael Snoyman mich...@snoyman.com writes: As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. With the danger of sounding ... alphabetist? as well as belaboring a point I agree is irrelevant (the storage format): I'd point out that it seems at least as unfair to optimize for CJK at the cost of Western languages. UTF-16 uses two bytes for (most) CJK ideograms, and (all, I think) characters in Western and other phonetic scripts. UTF-8 uses one to two bytes for a lot of Western alphabets, but three for CJK ideograms. Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram, while an ASCII letter is about six bits. Thus, the information density of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to 15/16 vs 6/16 for UTF-16. In other words a given document translated between Chinese and English should occupy roughly the same space in UTF-8, but be 2.5 times longer in English for UTF-16. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Felipe Lessa felipe.le...@gmail.com writes: [-snip- I've already spent too much time on the other stuff :-] And what do you think about creating a real SeqData data type with two bases per byte? In terms of processing speed I guess there will be a small penalty, but if you need to have large quantities of base pairs in memory this would double your capacity =). Yes, this is interesting in some cases. Obvious downsides would be a separate data type for protein sequences (20 characters, plus some wildcards), and more complicated string comparison (when a match is off by one). Oh, and lower case is sometimes used to signify less important regions, like repeats. Another choice is the 2bit format (used by BLAT, and supported in Bio for input/output, but not internally), which stores the alphabet proper directly in 2bit quantities, and uses a separate lists for gaps, lower case masking, and Ns (and is obviously extensible to wildcards). Too much extending, and you're likely to lose any benefit, though. Basically, it boils down to a set of different tradeoffs, and I think ByteString is a fairly good choice in *most* cases, and it deals - if not particularly elegantly, then at least fairly effectively with various conventions, like lower-casing or wild cards. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Someone mentioned earlier that IHHO all of this messing around with encodings and conversions should be handled transparently, and I guess you could do something like have the internal representation be along the lines of Either UTF8 UTF16 (or perhaps even more encodings), and then implement every function in the API equivalently for each representation (with only the performance characteristics differing), with input/output functions being specialized for each encoding, and then only do a conversion when necessary or explicitly requested. But I assume that would have other problems (like the implicit conversions causing hard-to-track-down performance bugs when they're triggered unintentionally). On Tue, Aug 17, 2010 at 3:21 PM, Daniel Peebles pumpkin...@gmail.com wrote: Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and UTF-16 segments in it list of strict text elements :) Then big chunks of western text will be encoded efficiently, and same with CJK! Not sure what to do about strict Data.Text though :) On Tue, Aug 17, 2010 at 1:40 PM, Ketil Malde ke...@malde.org wrote: Michael Snoyman mich...@snoyman.com writes: As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. With the danger of sounding ... alphabetist? as well as belaboring a point I agree is irrelevant (the storage format): I'd point out that it seems at least as unfair to optimize for CJK at the cost of Western languages. UTF-16 uses two bytes for (most) CJK ideograms, and (all, I think) characters in Western and other phonetic scripts. UTF-8 uses one to two bytes for a lot of Western alphabets, but three for CJK ideograms. Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram, while an ASCII letter is about six bits. Thus, the information density of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to 15/16 vs 6/16 for UTF-16. In other words a given document translated between Chinese and English should occupy roughly the same space in UTF-8, but be 2.5 times longer in English for UTF-16. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe -- Work is punishment for failing to procrastinate effectively. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
(Actually, this seems more like a job for a type class.) 2010/8/17 Gábor Lehel illiss...@gmail.com: Someone mentioned earlier that IHHO all of this messing around with encodings and conversions should be handled transparently, and I guess you could do something like have the internal representation be along the lines of Either UTF8 UTF16 (or perhaps even more encodings), and then implement every function in the API equivalently for each representation (with only the performance characteristics differing), with input/output functions being specialized for each encoding, and then only do a conversion when necessary or explicitly requested. But I assume that would have other problems (like the implicit conversions causing hard-to-track-down performance bugs when they're triggered unintentionally). On Tue, Aug 17, 2010 at 3:21 PM, Daniel Peebles pumpkin...@gmail.com wrote: Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and UTF-16 segments in it list of strict text elements :) Then big chunks of western text will be encoded efficiently, and same with CJK! Not sure what to do about strict Data.Text though :) On Tue, Aug 17, 2010 at 1:40 PM, Ketil Malde ke...@malde.org wrote: Michael Snoyman mich...@snoyman.com writes: As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. With the danger of sounding ... alphabetist? as well as belaboring a point I agree is irrelevant (the storage format): I'd point out that it seems at least as unfair to optimize for CJK at the cost of Western languages. UTF-16 uses two bytes for (most) CJK ideograms, and (all, I think) characters in Western and other phonetic scripts. UTF-8 uses one to two bytes for a lot of Western alphabets, but three for CJK ideograms. Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram, while an ASCII letter is about six bits. Thus, the information density of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to 15/16 vs 6/16 for UTF-16. In other words a given document translated between Chinese and English should occupy roughly the same space in UTF-8, but be 2.5 times longer in English for UTF-16. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe -- Work is punishment for failing to procrastinate effectively. -- Work is punishment for failing to procrastinate effectively. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 06:12, Michael Snoyman mich...@snoyman.com wrote: I'm not talking about API changes here; the topic at hand is the internal representation of the stream of characters used by the text package. That is currently UTF-16; I would argue switching to UTF8. The Data.Text.Foreign module is part of the API, and is currently hardcoded to use UTF-16. Any change of the internal encoding will require breaking this module's API. We can't consider a CJK encoding for text, Not as a default, certainly not as the only option. But nice to have as a choice. I think you're missing the point at hand: I don't think *any* is opposed to offering encoders/decoders for all the multitude of encoding types out there. In fact, I believe the text-icu package already supports every encoding type under discussion. The question is the internal representation for text, for which a language-specific encoding is *not* a choice, since it does not support all unicode code points. Michael The reason many Japanese and Chinese users reject UTF-8 isn't due to space constraints (UTF-8 and UTF-16 are roughly equal), it's because they reject Unicode itself. Shift-JIS and the various Chinese encodings both contain Han characters which are missing from Unicode, either due to the Han unification or simply were not considered important enough to include (yet there's a codepage for Linear-B...). Ruby, which has an enormous Japanese userbase, solved the problem by essentially defining Text = (Encoding, ByteString), and then re-implementing text logic for each encoding. This allows very efficient operation with every possible encoding, at the cost of increased complexity (caching decoded characters, multi-byte handling, etc). ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 6:19 PM, John Millikin jmilli...@gmail.com wrote: Ruby, which has an enormous Japanese userbase, solved the problem by essentially defining Text = (Encoding, ByteString), and then re-implementing text logic for each encoding. This allows very efficient operation with every possible encoding, at the cost of increased complexity (caching decoded characters, multi-byte handling, etc). This code introduce overhead as each function call needs to dispatch on the encoding, which is unlikely to be known statically. I don't know if this matters or not (yet another thing that needs to be measured). -- Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Quoth John Millikin jmilli...@gmail.com, Ruby, which has an enormous Japanese userbase, solved the problem by essentially defining Text = (Encoding, ByteString), and then re-implementing text logic for each encoding. This allows very efficient operation with every possible encoding, at the cost of increased complexity (caching decoded characters, multi-byte handling, etc). Ruby actually comes from the CJK world in a way, doesn't it? Even if efficient per-encoding manipulation is a tough nut to crack, it at least avoids the fixed cost of bulk decoding, so an application designer doesn't need to think about the pay-off for a correct text approach vs. `binary'/ASCII, and the language/library designer doesn't need to think about whether genome data is a representative case etc. If Haskell had the development resources to make something like this work, would it actually take the form of a Haskell-level type like that - data Text = (Encoding, ByteString)? I mean, I know that's just a very clear and convenient way to express it for the purposes of the present discussion, and actual design is a little premature - ... but, I think you could argue that from the Haskell level, `Text' should be a single type, if the encoding differences aren't semantically interesting. Donn Cave, d...@avvanta.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 9:30 PM, Donn Cave d...@avvanta.com wrote: Quoth John Millikin jmilli...@gmail.com, Ruby, which has an enormous Japanese userbase, solved the problem by essentially defining Text = (Encoding, ByteString), and then re-implementing text logic for each encoding. This allows very efficient operation with every possible encoding, at the cost of increased complexity (caching decoded characters, multi-byte handling, etc). Ruby actually comes from the CJK world in a way, doesn't it? Even if efficient per-encoding manipulation is a tough nut to crack, it at least avoids the fixed cost of bulk decoding, so an application designer doesn't need to think about the pay-off for a correct text approach vs. `binary'/ASCII, and the language/library designer doesn't need to think about whether genome data is a representative case etc. Remember that the cost of decoding is O(n) no matter what encoding is used internally as you always have to validate when going from ByteString to Text. If the external and internal encoding don't match then you also have to copy the bytes into a new buffer, but that is only one allocation (a pointer increment with a semi-space collector) and the copy is cheap since the data is in cache. -- Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the wikipedia for Chinese. -Andrew On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman mich...@snoyman.comwrote: On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale g...@sefer.org wrote: Ketil Malde wrote: I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8... I don't think the genome is typical text. And I doubt that is true if that text is in a CJK language. I think that *IF* we are aiming for a single, grand, unified text library to Rule Them All, it needs to use UTF-8. Given the growth rate of China's economy, if CJK isn't already the majority of text being processed in the world, it will be soon. I have seen media reports claiming CJK is now a majority of text data going over the wire on the web, though I haven't seen anything scientific backing up those claims. It certainly seems reasonable. I believe Google's measurements based on their own web index showing wide adoption of UTF-8 are very badly skewed due to a strong Western bias. In that case, if we have to pick one encoding for Data.Text, UTF-16 is likely to be a better choice than UTF-8, especially if the cost is fairly low even for the special case of Western languages. Also, UTF-16 has become by far the dominant internal text format for most software and for most user platforms. Except on desktop Linux - and whether we like it or not, Linux desktops will remain a tiny minority for the foreseeable future. I think you are conflating two points here, and ignoring some important data. Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data, but even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with. As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage. I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case. We can't consider a CJK encoding for text, so its prevalence is irrelevant to this topic. What *is* relevant is that a very large percentage of web pages *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by default UTF-8. As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself. Michael ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 03:21:32PM +0200, Daniel Peebles wrote: Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and UTF-16 segments in it list of strict text elements :) Then big chunks of western text will be encoded efficiently, and same with CJK! Not sure what to do about strict Data.Text though :) If space is really a concern, there should be a varient that uses LZO or some other fast compression algorithm that allows concatination as the back end. ranty thing to follow That said, there is never a reason to use UTF-16, it is a vestigial remanent from the brief period when it was thought 16 bits would be enough for the unicode standard, any defense of it nowadays is after the fact justification for having accidentally standardized on it back in the day. When people chose to use the 16 bit representation, it was because they wanted a one-to-one mapping between codepoints and units of computation, which has many advantages. However, this is no longer true, if the one-to-one mapping is important then nowadays you use ucs-4, otherwise, you use utf8. If space is very important then you work with compressed text. In practice a mix of the two is fairly ideal. John -- John Meacham - ⑆repetae.net⑆john⑈ - http://notanumber.net/ ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Bulat Ziganshin wrote: Johan wrote: So it's not clear to me that using UTF-16 makes the program noticeably slower or use more memory on a real program. it's clear misunderstanding. of course, not every program holds much text data in memory. but some does, and here you will double memory usage I write programs that hold onto quite a good deal of natural language text; a few million words at least. Getting efficient Unicode for that is a high priority. However, all of that text is in Japanese, Chinese, Arabic, Hindi, Urdu,... That's the reason I want Unicode. I'm pretty sure UTF-16 isn't going to be causing any special problems here. For NLP work, any language with a vaguely ASCII format isn't a problem. We've been shoving English and western European languages into a subset of ASCII for years (heck, we don't even allow real parentheses!). For the mostly English files on my harddrive, UTF-8 is a clear win. But when it comes to programming, I'm not so sure. I'd like to see some good benchmarks and a clear explanation of where the costs are. Relying on intuitions is notoriously bad for these kinds of encoding issues. -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 12:30, Donn Cave d...@avvanta.com wrote: If Haskell had the development resources to make something like this work, would it actually take the form of a Haskell-level type like that - data Text = (Encoding, ByteString)? I mean, I know that's just a very clear and convenient way to express it for the purposes of the present discussion, and actual design is a little premature - ... but, I think you could argue that from the Haskell level, `Text' should be a single type, if the encoding differences aren't semantically interesting. It should be possible to create a Ruby-style Text in Haskell, using the existing Text API. The constructor would be something like data Text = Text !Encoding !ByteString , but there's no need to export it. The only significant improvements, performance-wise, would be that 1) encoding text to its internal encoding would be O(1) and 2) decoding text would only have to perform validation, instead of validation+copy+stream fusion muck. Downside: lazy decoding makes it very difficult to reason about failures, since even simple operations like 'append' might fail if you try to append two texts with mutually-incompatible characters. In any case, I suspect getting Haskell itself to support non-Unicode characters is much more difficult than writing an appropriate Text type. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Michael Snoyman wrote: On Tue, Aug 17, 2010 at 2:20 PM, Ivan Lazar Miljenovic ivan.miljeno...@gmail.com wrote: Michael Snoyman mich...@snoyman.com writes: I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, *ahem* http://en.wikipedia.org/wiki/UTF-16/UCS-2#Use_in_major_operating_systems_and_environments I was talking about the contents of the files, not the file names or how the system calls work. I know at least on Windows, Linux and FreeBSD, if you open up the default text editor, type in a few letters and hit save, the file will not be in UTF-16. OSX, TextEdit, plain text mode is UTF-16 and cannot be altered. Also, if you load a UTF-8 plain text file in TextEdit it will be garbled because it assumes UTF-16. For html files you can choose the encoding, which defaults to UTF-8. But for plain text, it's always UTF-16. OSX is also fond of UTF-16 in Cocoa... -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
John Millikin wrote: The reason many Japanese and Chinese users reject UTF-8 isn't due to space constraints (UTF-8 and UTF-16 are roughly equal), it's because they reject Unicode itself. +1. This is the thing Unicode advocates don't want to admit. Until Unicode has code points for _all_ Chinese and Japanese characters, there will be active resistance to adoption. -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Johan Tibell wrote: To my knowledge the data we have about prevalence of encoding on the web is accurate. We crawl all pages we can get our hands on, by starting at some set of seeds and then following all the links. You cannot be sure that you've reached all web sites as there might be cliques in the web graph but we try our best to get them all. You're unlikely to get a better estimate anywhere else. I doubt few organizations have the machinery required to crawl most of the web. There was a study recently on this. They found that there are four main parts of the Internet: * a densely connected core, where from any site you can get to any other * an in cone, from which you can reach the core (but not other in-cone members, since then you'd both be in the core) * an out cone, which can be reached from the core (but which cannot reach each other) * and, unconnected islands The surprising part is they found that all four parts are approximately the same size. I forget the exact numbers, but they're all 25+/-5%. This implies that an exhaustive crawl of the web would require having about 50% of all websites as seeds (the in-cone plus the islands). If we're only interested in a representative sample, then we could get by with fewer. However, that depends a lot on the definition of representative. And we can't have an accurate definition of representative without doing the entire crawl at some point in order to discover the appropriate distributions. Then again, distributions change over time... Thus, I would guess that Google only has 50~75% of the net: the core, the out-cone, and a fraction of the islands and in-cone. -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On 18 August 2010 12:12, wren ng thornton w...@freegeek.org wrote: Johan Tibell wrote: To my knowledge the data we have about prevalence of encoding on the web is accurate. We crawl all pages we can get our hands on, by starting at some set of seeds and then following all the links. You cannot be sure that you've reached all web sites as there might be cliques in the web graph but we try our best to get them all. You're unlikely to get a better estimate anywhere else. I doubt few organizations have the machinery required to crawl most of the web. There was a study recently on this. They found that there are four main parts of the Internet: * a densely connected core, where from any site you can get to any other * an in cone, from which you can reach the core (but not other in-cone members, since then you'd both be in the core) * an out cone, which can be reached from the core (but which cannot reach each other) * and, unconnected islands I'm guessing here that you're referring to what I've heard called the hidden web: databases, etc. that require sign-ins, etc. (as stuff that isn't in the core, to differing degrees: some of these databases are indexed by google but you can't actually read them without an account, etc.) ? -- Ivan Lazar Miljenovic ivan.miljeno...@gmail.com IvanMiljenovic.wordpress.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Aug 17, 2010, at 11:51 PM, Ketil Malde wrote: Yitzchak Gale g...@sefer.org writes: I don't think the genome is typical text. I think the typical *large* collection of text is text-encoded data, and not, for lack of a better word, literature. Genomics data is just an example. I have a collection of 100,000 patents I'm working with. 5.5GB of XML, most of it (US-)English text. After stripping out the XML markup, it's 4GB of text. It's a random sample from some 14 million patents I could have access to, but 100,000 was more than enough. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Ivan Lazar Miljenovic wrote: On 18 August 2010 12:12, wren ng thornton w...@freegeek.org wrote: Johan Tibell wrote: To my knowledge the data we have about prevalence of encoding on the web is accurate. We crawl all pages we can get our hands on, by starting at some set of seeds and then following all the links. You cannot be sure that you've reached all web sites as there might be cliques in the web graph but we try our best to get them all. You're unlikely to get a better estimate anywhere else. I doubt few organizations have the machinery required to crawl most of the web. There was a study recently on this. They found that there are four main parts of the Internet: * a densely connected core, where from any site you can get to any other * an in cone, from which you can reach the core (but not other in-cone members, since then you'd both be in the core) * an out cone, which can be reached from the core (but which cannot reach each other) * and, unconnected islands I'm guessing here that you're referring to what I've heard called the hidden web: databases, etc. that require sign-ins, etc. (as stuff that isn't in the core, to differing degrees: some of these databases are indexed by google but you can't actually read them without an account, etc.) ? Not so far as I recall. I'd have to find a copy of the paper to be sure though. Because the metric used was graph connectivity, if those hidden pages have links out into non-hidden pages (e.g., the login page), then they'd be counted in the same way as the non-hidden pages reachable from them. -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Well, I'm not certain if it counts as a typical Chinese website, but here are the stats; UTF8: 64,198 UTF16: 113,160 And just for fun, after gziping: UTF8: 17,708 UTF16: 19,367 On Wed, Aug 18, 2010 at 2:59 AM, anderson leo fireman...@gmail.com wrote: Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the wikipedia for Chinese. -Andrew On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman mich...@snoyman.comwrote: On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale g...@sefer.org wrote: Ketil Malde wrote: I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8... I don't think the genome is typical text. And I doubt that is true if that text is in a CJK language. I think that *IF* we are aiming for a single, grand, unified text library to Rule Them All, it needs to use UTF-8. Given the growth rate of China's economy, if CJK isn't already the majority of text being processed in the world, it will be soon. I have seen media reports claiming CJK is now a majority of text data going over the wire on the web, though I haven't seen anything scientific backing up those claims. It certainly seems reasonable. I believe Google's measurements based on their own web index showing wide adoption of UTF-8 are very badly skewed due to a strong Western bias. In that case, if we have to pick one encoding for Data.Text, UTF-16 is likely to be a better choice than UTF-8, especially if the cost is fairly low even for the special case of Western languages. Also, UTF-16 has become by far the dominant internal text format for most software and for most user platforms. Except on desktop Linux - and whether we like it or not, Linux desktops will remain a tiny minority for the foreseeable future. I think you are conflating two points here, and ignoring some important data. Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data, but even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with. As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage. I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case. We can't consider a CJK encoding for text, so its prevalence is irrelevant to this topic. What *is* relevant is that a very large percentage of web pages *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by default UTF-8. As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself. Michael ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
John Millikin wrote: The reason many Japanese and Chinese users reject UTF-8 isn't due to space constraints (UTF-8 and UTF-16 are roughly equal), it's because they reject Unicode itself. +1. This is the thing Unicode advocates don't want to admit. Until Unicode has code points for _all_ Chinese and Japanese characters, there will be active resistance to adoption. -- Live well, ~wren For mainland chinese websites: Most that became popular during web 1.0 (5-10 years ago) are using utf-8 incompatible format, e.g. gb2312. for example: * www.sina.com.cn * www.sohu.com They didn't switch to utf-8 probably just because they never have to. However, many of the popular websites started during web 2.0 are adopting utf-8 for example: * renren.com (chinese largest facebook clone) * www.kaixin001.com (chinese second largest facebook clone) * t.sina.com.cn (an example of twitter clone) These websites adopted utf-8 because (I think) most web development tools have already standardized on utf-8, and there's little reason change it. I'm not aware of any (at least common) chinese characters that can be represented by gb2312 but not in unicode. Since the range of gb2312 is a subset of the range of gbk, which is a subset of the range of gb18030. And gb18030 is just another encoding of unicode. ref: * http://en.wikipedia.org/wiki/GB_18030 -- jinjing ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Hi Bulat, On Monday 16 August 2010 07:35:44, Bulat Ziganshin wrote: Hello Daniel, Sunday, August 15, 2010, 10:39:24 PM, you wrote: That's great. If that performance difference is a show stopper, one shouldn't go higher-level than C anyway :) *all* speed measurements that find Haskell is as fast as C, was broken. That's a pretty bold claim, considering that you probably don't know all such measurements ;) But let's get serious. Bryan posted measurements showing the text (HEAD) package's performance within a reasonable factor of wc's. (Okay, he didn't give a complete description of his test, so we can only assume that all participants did the same job. I'm bold enough to assume that.) Lazy text being 7% slower than wc, strict 30%. If you are claiming that his test was flawed (and since the numbers clearly showed Haskell slower thanC, just not much, I suspect you do, otherwise I don't see the point of your post), could you please elaborate why you think it's flawed? Let's see: D:\testingread MsOffice.arc MsOffice.arc 317mb -- Done Time 0.407021 seconds (timer accuracy 0.00 seconds) Speed 779.505632 mbytes/sec I see nothing here, not knowing what `read' is. None of read (n), read (2), read (1p), read(3p) makes sense here, so it must be something else. Since it outputs a size in bytes, I doubt that it actually counts characters, like wc -m and, presumably, the text programmes Bryan benchmarked. Just counting bytes, wc and Data.ByteString[.Lazy] can do much faster than counting characters too. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Sat, Aug 14, 2010 at 10:46 PM, Michael Snoyman mich...@snoyman.comwrote: When I'm writing a web app, my code is sitting on a Linux system where the default encoding is UTF-8, communicating with a database speaking UTF-8, receiving request bodies in UTF-8 and sending response bodies in UTF-8. So converting all of that data to UTF-16, just to be converted right back to UTF-8, does seem strange for that purpose. Bear in mind that much of the data you're working with can't be readily trusted. UTF-8 coming from the filesystem, the network, and often the database may not be valid. The cost of validating it isn't all that different from the cost of converting it to UTF-16. And of course the internals of Data.Text are all fusion-based, so much of the time you're not going to be allocating UTF-16 arrays at all, but instead creating a pipeline of characters that are manipulated in a tight loop. This eliminates a lot of the additional copying that bytestring has to do, for instance. To give you an idea of how competitive Data.Text can be compared to C code, this is the system's wc command counting UTF-8 characters in a modestly large file: $ time wc -m huge.txt 32443330 real 0.728s This is Data.Text performing the same task: $ time ./FileRead text huge.txt 32443330 real 0.697s ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Quoth John Millikin jmilli...@gmail.com, I don't see why [Char] is obvious -- you'd never use [Word8] for storing binary data, right? [Char] is popular because it's the default type for string literals, and due to simple inertia, but when there's a type based on packed arrays there's no reason to use the list representation. Well, yes, string literals - and pattern matching support, maybe that's the same thing. And I think it's fair to say that [Char] is a natural, elegant match for the language, I mean it leverages your basic Haskell skills if for example you want to parse something fairly simple. So even if ByteString weren't the monumental hassle it is today for simple stuff, String would have at least a little appeal. And if packed arrays really always mattered, [Char] would be long gone. They don't, you can do a lot of stuff with [Char] before it turns into a problem. Also, despite the name, ByteString and Text are for separate purposes. ByteString is an efficient [Word8], Text is an efficient [Char] -- use ByteString for binary data, and Text for...text. Most mature languages have both types, though the choice of UTF-16 for Text is unusual. Maybe most mature languages have one or more extra string types hacked on to support wide characters. I don't think it's necessarily a virtue. ByteString vs. ByteString.Char8, where you can choose more or less indiscriminately to treat the data as Char or Word8, seems to me like a more useful way to approach the problem. (Of course, ByteString.Char8 isn't a good way to deal with wide characters correctly, I'm just saying that's where I'd like to find the answer, not in some internal character encoding into which all text data must be converted.) Donn Cave, d...@avvanta.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Sat, Aug 14, 2010 at 10:07 PM, Donn Cave d...@avvanta.com wrote: Am I confused about this? It's why I can't see Text ever being simply the obvious choice. [Char] will continue to be the obvious choice if you want a functional data type that supports pattern matching etc. Actually, with view patterns, Text is pretty nice to pattern match against: foo (uncons - Just (c,cs)) = whee despam (prefixed spam - Just suffix) = whee `mappend` suffix ByteString will continue to be the obvious choice for big data loads. Don't confuse I have big data with I need bytes. If you are working with bytes, use bytestring. If you are working with text, outside of a few narrow domains you should use text. We'll have a three way choice between programming elegance, correctness and efficiency. If Haskell were more than just a research language, this might be its most prominent open sore, don't you think? No, that's just FUD. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Bryan == Bryan O'Sullivan b...@serpentine.com writes: Bryan On Sat, Aug 14, 2010 at 10:46 PM, Michael Snoyman mich...@snoyman.com wrote: Bryan When I'm writing a web app, my code is sitting on a Linux Bryan system where the default encoding is UTF-8, communicating Bryan with a database speaking UTF-8, receiving request bodies in Bryan UTF-8 and sending response bodies in UTF-8. So converting all Bryan of that data to UTF-16, just to be converted right back to Bryan UTF-8, does seem strange for that purpose. Bryan Bear in mind that much of the data you're working with can't Bryan be readily trusted. UTF-8 coming from the filesystem, the Bryan network, and often the database may not be valid. The cost of Bryan validating it isn't all that different from the cost of Bryan converting it to UTF-16. But UTF-16 (apart from being an abomination for creating a hole in the codepoint space and making it impossible to ever etxend it) is slow to process compared with UTF-32 - you can't get the nth character in constant time, so it seems an odd choice to me. -- Colin Adams Preston Lancashire () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Hi Colin, On Sun, Aug 15, 2010 at 9:34 AM, Colin Paul Adams co...@colina.demon.co.ukwrote: But UTF-16 (apart from being an abomination for creating a hole in the codepoint space and making it impossible to ever etxend it) is slow to process compared with UTF-32 - you can't get the nth character in constant time, so it seems an odd choice to me. Aside: Getting the nth character isn't very useful when working with Unicode text: * Most text processing is linear. * What we consider a character and what Unicode considers a character differs a bit e.g. since Unicode uses combining characters. Cheers, Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Don Stewart d...@galois.com writes: * Pay attention to Haskell Cafe announcements * Follow the Reddit Haskell news. * Read the quarterly reports on Hackage * Follow Planet Haskell And yet there are still many packages that fall under the radar with no announcements of any kind on initial release or even new versions :( -- Ivan Lazar Miljenovic ivan.miljeno...@gmail.com IvanMiljenovic.wordpress.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
2010/8/15 Ivan Lazar Miljenovic ivan.miljeno...@gmail.com: Don Stewart d...@galois.com writes: * Pay attention to Haskell Cafe announcements * Follow the Reddit Haskell news. * Read the quarterly reports on Hackage * Follow Planet Haskell And yet there are still many packages that fall under the radar with no announcements of any kind on initial release or even new versions :( If you're interested in a comprehensive update list, you can follow Hackage on Twitter, or the news feed. Cheers, Thu ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Don Stewart wrote: So, to stay up to date, but without drowning in data. Do one of: * Pay attention to Haskell Cafe announcements * Follow the Reddit Haskell news. * Read the quarterly reports on Hackage * Follow Planet Haskell Interesting. Obviously I look at Haskell Cafe from time to time (although there's usually far too much traffic to follow it all). I wasn't aware of *any* of the other resources listed. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Vo Minh Thu not...@gmail.com writes: 2010/8/15 Ivan Lazar Miljenovic ivan.miljeno...@gmail.com: Don Stewart d...@galois.com writes: * Pay attention to Haskell Cafe announcements * Follow the Reddit Haskell news. * Read the quarterly reports on Hackage * Follow Planet Haskell And yet there are still many packages that fall under the radar with no announcements of any kind on initial release or even new versions :( If you're interested in a comprehensive update list, you can follow Hackage on Twitter, or the news feed. Except that that doesn't tell you: * The purpose of the library * How a release differs from a previous one * Why you should use it, etc. Furthermore, several interesting discussions have arisen out of announcement emails. -- Ivan Lazar Miljenovic ivan.miljeno...@gmail.com IvanMiljenovic.wordpress.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
2010/8/15 Ivan Lazar Miljenovic ivan.miljeno...@gmail.com: Vo Minh Thu not...@gmail.com writes: 2010/8/15 Ivan Lazar Miljenovic ivan.miljeno...@gmail.com: Don Stewart d...@galois.com writes: * Pay attention to Haskell Cafe announcements * Follow the Reddit Haskell news. * Read the quarterly reports on Hackage * Follow Planet Haskell And yet there are still many packages that fall under the radar with no announcements of any kind on initial release or even new versions :( If you're interested in a comprehensive update list, you can follow Hackage on Twitter, or the news feed. Except that that doesn't tell you: * The purpose of the library * How a release differs from a previous one * Why you should use it, etc. Furthermore, several interesting discussions have arisen out of announcement emails. Sure, nor does it write a book chapter about some practical usage. I mean (tongue in cheek) that the other ressource, nor even some proper annoucement, provide all that. I still remember the UHC annoucement (a (nearly) complete Haskell 98 compiler) thread where most of it was about lack of support for n+k pattern. But the bullet list above was to point Andrew a few places where he could have learn about Text. Cheers, Thu ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 8/15/10 03:01 , Bryan O'Sullivan wrote: On Sat, Aug 14, 2010 at 10:07 PM, Donn Cave d...@avvanta.com mailto:d...@avvanta.com wrote: We'll have a three way choice between programming elegance, correctness and efficiency. If Haskell were more than just a research language, this might be its most prominent open sore, don't you think? No, that's just FUD. More to the point, there's nothing elegant about [Char] --- its sole advantage is requiring no thought. - -- brandon s. allbery [linux,solaris,freebsd,perl] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkxoBPgACgkQIn7hlCsL25WbWACgz+MXfwL6ly1Euv1X1HD7Gmg8 fO0Anj1LY6CqDyLjr0s5L2M5Okx8ie+/ =eIIs -END PGP SIGNATURE- ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
No, not really. Linked lists are very easy to deal with recursively and Strings automatically work with any already-defined list functions. On Sun, Aug 15, 2010 at 11:17 AM, Brandon S Allbery KF8NH allb...@ece.cmu.edu wrote: More to the point, there's nothing elegant about [Char] --- its sole advantage is requiring no thought. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Quoth Bryan O'Sullivan b...@serpentine.com, On Sat, Aug 14, 2010 at 10:07 PM, Donn Cave d...@avvanta.com wrote: ... ByteString will continue to be the obvious choice for big data loads. Don't confuse I have big data with I need bytes. If you are working with bytes, use bytestring. If you are working with text, outside of a few narrow domains you should use text. I wonder how many ByteString users are `working with bytes', in the sense you apparently mean where the bytes are not text characters. My impression is that in practice, there is a sizeable contingent out here using ByteString.Char8 and relatively few applications for the Word8 type. Some of it should no doubt move to Text, but the ability to work with native packed data - minimal processing and space requirements, interoperability with foreign code, mmap, etc. - is attractive enough that the choice can be less than obvious. Donn Cave, d...@avvanta.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Quoth Bill Atkins watk...@alum.rpi.edu, No, not really. Linked lists are very easy to deal with recursively and Strings automatically work with any already-defined list functions. Yes, they're great - a terrible mistake, for a practical programming language, but if you fail to recognize the attraction, you miss some of the historical lesson on emphasizing elegance and correctness over practical performance. Donn Cave, d...@avvanta.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Sun, Aug 15, 2010 at 12:50 PM, Donn Cave d...@avvanta.com wrote: I wonder how many ByteString users are `working with bytes', in the sense you apparently mean where the bytes are not text characters. My impression is that in practice, there is a sizeable contingent out here using ByteString.Char8 and relatively few applications for the Word8 type. Some of it should no doubt move to Text, but the ability to work with native packed data - minimal processing and space requirements, interoperability with foreign code, mmap, etc. - is attractive enough that the choice can be less than obvious. Using ByteString.Char8 doesn't mean your data isn't a stream of bytes, it means that it is a stream of bytes but for convenience you prefer using Char8 functions. For example, a DNA sequence (AATCGATACATG...) is a stream of bytes, but it is better to write 'A' than 65. But yes, many users of ByteStrings should be using Text. =) Cheers! -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 8/15/10 11:25 , Bill Atkins wrote: No, not really. Linked lists are very easy to deal with recursively and Strings automatically work with any already-defined list functions. On Sun, Aug 15, 2010 at 11:17 AM, Brandon S Allbery KF8NH allb...@ece.cmu.edu mailto:allb...@ece.cmu.edu wrote: More to the point, there's nothing elegant about [Char] --- its sole advantage is requiring no thought. Except that it seems to me that a number of functions in Data.List are really functions on Strings and not especially useful on generic lists. There is overlap but it's not as large as might be thought. - -- brandon s. allbery [linux,solaris,freebsd,perl] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkxoFt4ACgkQIn7hlCsL25V+OACfXngN6ZX5L7AL153AkRYDFnqZ jqsAnA3Lem5LioDVS5bc0ADGzHwWsKFE =ehkx -END PGP SIGNATURE- ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Donn Cave wrote: I wonder how many ByteString users are `working with bytes', in the sense you apparently mean where the bytes are not text characters. My impression is that in practice, there is a sizeable contingent out here using ByteString.Char8 and relatively few applications for the Word8 type. Some of it should no doubt move to Text, but the ability to work with native packed data - minimal processing and space requirements, interoperability with foreign code, mmap, etc. - is attractive enough that the choice can be less than obvious. I use ByteString for various binary-processing stuff. I also use it for string-processing, but that's mainly because I didn't know anything else existed. I'm sure lots of other people are using stuff like Data.Binary to serialise raw binary data using ByteString too. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Donn Cave wrote: Quoth Bill Atkins watk...@alum.rpi.edu, No, not really. Linked lists are very easy to deal with recursively and Strings automatically work with any already-defined list functions. Yes, they're great - a terrible mistake, for a practical programming language, but if you fail to recognize the attraction, you miss some of the historical lesson on emphasizing elegance and correctness over practical performance. And if you fail to recognise what a grave mistake placing performance before correctness is, you end up with things like buffer overflow exploits, SQL injection attacks, the Y2K bug, programs that can't handle files larger than 2GB or that don't understand Unicode, and so forth. All things that could have been almost trivially avoided if everybody wasn't so hung up on absolute performance at any cost. Sure, performance is a priority. But it should never be the top priority. ;-) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 8/15/10 13:53 , Andrew Coppin wrote: injection attacks, the Y2K bug, programs that can't handle files larger than 2GB or that don't understand Unicode, and so forth. All things that could have been almost trivially avoided if everybody wasn't so hung up on absolute performance at any cost. Now that's a bit unfair; nobody imagined back when lseek() was enshrined in the Unix API that it would still be in use when a (long) wasn't big enough :) (Remember that Unix is itself a practical example of a research platform avoiding success at any cost gone horribly wrong.) - -- brandon s. allbery [linux,solaris,freebsd,perl] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkxoK2gACgkQIn7hlCsL25VaHgCcCj8T8Qqfx4Co1lXZCH7BApkW iI8AoNcSabjLso9nXBfujeI+diC8rM78 =FwBb -END PGP SIGNATURE- ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Sat, Aug 14, 2010 at 6:05 PM, Bryan O'Sullivan b...@serpentine.comwrote: - If it's not good enough, and the fault lies in a library you chose, report a bug and provide a test case. As a case in point, I took the string search benchmark that Daniel shared on Friday, and boiled it down to a simple test case: how long does it take to read a 31MB file? GNU wc -m: - en_US.UTF-8: 0.701s text 0.7.1.0: - lazy text: 1.959s - strict text: 3.527s darcs HEAD: - lazy text: 0.749s - strict text: 0.927s ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Brandon S Allbery KF8NH wrote: (Remember that Unix is itself a practical example of a research platform avoiding success at any cost gone horribly wrong.) I haven't used Erlang myself, but I've heard it described in a similar way. (I don't know how true that actually is...) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Sunday 15 August 2010 20:04:01, Bryan O'Sullivan wrote: On Sat, Aug 14, 2010 at 6:05 PM, Bryan O'Sullivan b...@serpentine.comwrote: - If it's not good enough, and the fault lies in a library you chose, report a bug and provide a test case. As a case in point, I took the string search benchmark that Daniel shared on Friday, and boiled it down to a simple test case: how long does it take to read a 31MB file? GNU wc -m: - en_US.UTF-8: 0.701s text 0.7.1.0: - lazy text: 1.959s - strict text: 3.527s darcs HEAD: - lazy text: 0.749s - strict text: 0.927s That's great. If that performance difference is a show stopper, one shouldn't go higher-level than C anyway :) (doesn't mean one should stop thinking about further speed-up, though) Out of curiosity, what kind of speed-up did your Friday fix bring to the searching/replacing functions? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 8/15/10 14:34 , Andrew Coppin wrote: Brandon S Allbery KF8NH wrote: (Remember that Unix is itself a practical example of a research platform avoiding success at any cost gone horribly wrong.) I haven't used Erlang myself, but I've heard it described in a similar way. (I don't know how true that actually is...) Similar case, actually: internal research project with internal practical uses, then got discovered and productized by a different internal group. - -- brandon s. allbery [linux,solaris,freebsd,perl] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkxoNoAACgkQIn7hlCsL25XSAgCgtLKTtT8YN99KsArnhW2kMDvh oHcAnR1QrfIaq3hmzqU7yF31NZubEMsR =zpv1 -END PGP SIGNATURE- ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Sun, Aug 15, 2010 at 11:39 AM, Daniel Fischer daniel.is.fisc...@web.dewrote: Out of curiosity, what kind of speed-up did your Friday fix bring to the searching/replacing functions? Quite a bit! text 0.7.1.0 and 0.7.2.1: - 1.056s darcs HEAD: - 0.158s ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Quoth Andrew Coppin andrewcop...@btinternet.com, ... And if you fail to recognise what a grave mistake placing performance before correctness is, you end up with things like buffer overflow exploits, SQL injection attacks, the Y2K bug, programs that can't handle files larger than 2GB or that don't understand Unicode, and so forth. All things that could have been almost trivially avoided if everybody wasn't so hung up on absolute performance at any cost. Sure, performance is a priority. But it should never be the top priority. ;-) You should never have to choose. Not to belabor the point, but to dismiss all that as the work of morons who weren't as wise as we are, is the same mistake from the other side of the wall - performance counts. If you solve the problem by assigning a priority to one or the other, you aren't solving the problem. Donn Cave, d...@avvanta.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Sunday 15 August 2010 20:53:32, Bryan O'Sullivan wrote: On Sun, Aug 15, 2010 at 11:39 AM, Daniel Fischer daniel.is.fisc...@web.dewrote: Out of curiosity, what kind of speed-up did your Friday fix bring to the searching/replacing functions? Quite a bit! text 0.7.1.0 and 0.7.2.1: - 1.056s darcs HEAD: - 0.158s Awesome :D ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Ivan Lazar Miljenovic ivan.miljeno...@gmail.com writes: Don Stewart d...@galois.com writes: * Pay attention to Haskell Cafe announcements * Follow the Reddit Haskell news. * Read the quarterly reports on Hackage * Follow Planet Haskell And yet there are still many packages that fall under the radar with no announcements of any kind on initial release or even new versions :( Subscribe to http://hackage.haskell.org/packages/archive/recent.rss in your RSS reader: problem solved! G -- Gregory Collins g...@gregorycollins.net ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Bryan O'Sullivan wrote: As a case in point, I took the string search benchmark that Daniel shared on Friday, and boiled it down to a simple test case: how long does it take to read a 31MB file? GNU wc -m: - en_US.UTF-8: 0.701s text 0.7.1.0: - lazy text: 1.959s - strict text: 3.527s darcs HEAD: - lazy text: 0.749s - strict text: 0.927s When should we expect to see the HEAD stamped and numbered? After some of the recent benchmark dueling re web frameworks, I know Text got a bad rap compared to ByteString. It'd be good to stop the FUD early. Repeating the above in the announcement should help a lot. -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
wren: Bryan O'Sullivan wrote: As a case in point, I took the string search benchmark that Daniel shared on Friday, and boiled it down to a simple test case: how long does it take to read a 31MB file? GNU wc -m: - en_US.UTF-8: 0.701s text 0.7.1.0: - lazy text: 1.959s - strict text: 3.527s darcs HEAD: - lazy text: 0.749s - strict text: 0.927s When should we expect to see the HEAD stamped and numbered? After some of the recent benchmark dueling re web frameworks, I know Text got a bad rap compared to ByteString. It'd be good to stop the FUD early. Repeating the above in the announcement should help a lot. For what its worth, for several bytestring announcements I published comprehensive function-by-function comparisions of performance on enormous data sets, until there was unambiguous evidence bytestring was faster than List. E.g http://www.mail-archive.com/hask...@haskell.org/msg18596.html ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Gregory Collins g...@gregorycollins.net writes: Ivan Lazar Miljenovic ivan.miljeno...@gmail.com writes: Don Stewart d...@galois.com writes: * Pay attention to Haskell Cafe announcements * Follow the Reddit Haskell news. * Read the quarterly reports on Hackage * Follow Planet Haskell And yet there are still many packages that fall under the radar with no announcements of any kind on initial release or even new versions :( Subscribe to http://hackage.haskell.org/packages/archive/recent.rss in your RSS reader: problem solved! As I said in reply to someone else: that won't help you get the intent of a library, how it has changed from previous versions, etc. -- Ivan Lazar Miljenovic ivan.miljeno...@gmail.com IvanMiljenovic.wordpress.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Johan Tibell wrote: On Fri, Aug 13, 2010 at 4:24 PM, Kevin Jardine kevinjard...@gmail.com mailto:kevinjard...@gmail.com wrote: One of the more puzzling aspects of Haskell for newbies is the large number of libraries that appear to provide similar/duplicate functionality. I agree. Here's a rule of thumb: If you have binary data, use Data.ByteString. If you have text, use Data.Text. Those libraries have benchmarks and have been well tuned by experienced Haskelleres and should be the fastest and most memory compact in most cases. There are still a few cases where String beats Text but they are being worked on as we speak. Interesting. I've never even heard of Data.Text. When did that come into existence? More importantly: How does the average random Haskeller discover that a package has become available that might be relevant to their work? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
* Bryan O'Sullivan: If you know it's text and not binary data you are working with, you should still use Data.Text. There are a few good reasons. 1. The API is more correct. For instance, if you use Text.toUpper on a string containing latin1 ß (eszett, sharp S), you'll get the two-character sequence SS, which is correct. Using Char8.map Char.toUpper here gives the wrong answer. Data.Text ist still incorrect for some scripts: $ LANG=tr_TR.UTF-8 ghci GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help Loading package ghc-prim ... linking ... done. Loading package integer-gmp ... linking ... done. Loading package base ... linking ... done. Prelude import Data.Text Prelude Data.Text toUpper $ pack i Loading package array-0.3.0.0 ... linking ... done. Loading package containers-0.3.0.0 ... linking ... done. Loading package deepseq-1.1.0.0 ... linking ... done. Loading package bytestring-0.9.1.5 ... linking ... done. Loading package text-0.7.2.1 ... linking ... done. I Prelude Data.Text ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
Andrew Coppin andrewcop...@btinternet.com writes: Interesting. I've never even heard of Data.Text. When did that come into existence? The first version hit Hackage in February last year... More importantly: How does the average random Haskeller discover that a package has become available that might be relevant to their work? Look on Hackage; subscribe to mailing lists (where package maintainers should really write announcement emails), etc. It's rather surprising you haven't heard of text: it is for benchmarking this that Bryan wrote criterion; there's emails on -cafe and blog posts that mention it on a semi-regular basis, etc. -- Ivan Lazar Miljenovic ivan.miljeno...@gmail.com IvanMiljenovic.wordpress.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: String vs ByteString
On Sat, Aug 14, 2010 at 12:15 PM, Florian Weimer f...@deneb.enyo.de wrote: * Bryan O'Sullivan: If you know it's text and not binary data you are working with, you should still use Data.Text. There are a few good reasons. 1. The API is more correct. For instance, if you use Text.toUpper on a string containing latin1 ß (eszett, sharp S), you'll get the two-character sequence SS, which is correct. Using Char8.map Char.toUpper here gives the wrong answer. Data.Text ist still incorrect for some scripts: $ LANG=tr_TR.UTF-8 ghci GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help Loading package ghc-prim ... linking ... done. Loading package integer-gmp ... linking ... done. Loading package base ... linking ... done. Prelude import Data.Text Prelude Data.Text toUpper $ pack i Loading package array-0.3.0.0 ... linking ... done. Loading package containers-0.3.0.0 ... linking ... done. Loading package deepseq-1.1.0.0 ... linking ... done. Loading package bytestring-0.9.1.5 ... linking ... done. Loading package text-0.7.2.1 ... linking ... done. I Prelude Data.Text Yes. We need locale support for that one. I think Bryan is planning to add it. -- Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe