On Wed, Aug 18, 2010 at 6:24 PM, Johan Tibell <johan.tib...@gmail.com>wrote:
> Hi Michael, > > > On Wed, Aug 18, 2010 at 4:04 PM, Michael Snoyman <mich...@snoyman.com>wrote: > >> Here's my response to the two points: >> >> * I haven't written a patch showing that Data.Text would be faster using >> UTF-8 because that would require fulfilling the second point (I'll get to in >> a second). I *have* shown where there are huge performance differences >> between text and ByteString/String. Unfortunately, the response has been >> "don't use bytestring, it's the wrong datatype, text will get fixed," which >> is quite underwhelming. >> > > I went through all the emails you sent on with topic "String vs ByteString" > and "Re: String vs ByteString" and I can't find a single benchmark. I do > agree with you that > > * UTF-8 is more compact than UTF-16, and > * UTF-8 is by far the most used encoding on the web. > > and that establishes a reasonable *theoretical* argument for why switching > to UTF-8 might be faster. > > What I'm looking for is a program that shows a big difference so we can > validate the hypothesis. As Duncan mentioned we already ran some benchmarks > early on the showed the opposite. Someone posted a benchmark earlier in this > thread and Bryan addressed the issue raised by that poster. We want more of > those. > > Sorry, I thought I'd sent these out. While working on optimizing Hamlet I started playing around with the BigTable benchmark. I wrote two blog posts on the topic: http://www.snoyman.com/blog/entry/bigtable-benchmarks/ http://www.snoyman.com/blog/entry/optimizing-hamlet/ Originally, Hamlet had been based on the text package; the huge slow-down introduced by text convinced me to migrate to bytestrings, and ultimately blaze-html/blaze-builder. It could be that these were flaws in text that are correctable and have nothing to do with UTF-16; however, it will be difficult to produce a benchmark purely on the UTF-8/UTF-16 divide. Using UTF-16 bytestrings would probably overstate the impact since it wouldn't be using Bryan's fusion logic. * Since the prevailing attitude has been such a disregard to any facts shown >> thus far, it seems that the effort required to learn the internals of the >> text package and attempt a patch would be wasted. In the meanwhile, Jasper >> has released blaze-builder which does an amazing job at producing UTF-8 >> encoded data, which for the moment is my main need. As much as I'll be >> chastised by the community, I'll stick with this approach for the moment. >> > > I'm not sure this discussion has surfaced that many facts. What we do have > is plenty of theories. I can easily add some more: > > * GHC is not doing a good job laying out the branches in the validation > code that does arithmetic on the input byte sequence, to validate the input > and compute the Unicode code point that should be streamed using fusion. > > * The differences in text and bytestring's fusion framework get in the > way of some optimization in GHC (text uses a more sophisticated fusion > frameworks that handles some cases bytestring can't according to Bryan). > > * Lingering space leaks is hurting performance (Bryan plugged one > already). > > * The use of a polymorphic loop state in the fusion framework gets in > the way of unboxing. > > * Extraneous copying in the Handle implementation slows down I/O. > > All these are plausible reasons why Text might perform worse than > ByteString. We need find out why ones are true by benchmarking and looking > at the generated Core. > > Now if you tell me that text would consider applying a UTF-8 patch, that >> would be a different story. But I don't have the time to maintain a separate >> UTF-8 version of text. For me, the whole point of this discussion was to >> determine whether we should attempt porting to UTF-8, which as I understand >> it would be a rather large undertaking. >> > > I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was > faster on some set of benchmarks (starting with the ones already in the > library) that we agree on. > > I think that's the main issue, and one that Duncan nailed on the head: we have to think about what are the important benchmarks. For Hamlet, I need fast UTF-8 bytestring generation. I don't care at all about algorithmic speed for split texts, as an example. My (probably uneducated) guess is that UTF-16 tends to perform many operations in memory faster since almost all characters are represented as 16 bits, while the big benefit for UTF-8 is in reading UTF-8 data, rendering UTF-8 data and decreased memory usage. But as I said, that's an (uneducated) guess. Some people have been floating the idea of multiple text packages. I personally would *not* want to go down that road, but it might be the only approach that allows top performance for all use cases. As is, I'm quite happy using blaze-builder for Hamlet. Michael
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe