Hi Ketil, On Tue, Aug 17, 2010 at 12:09 PM, Ketil Malde <ke...@malde.org> wrote:
> Johan Tibell <johan.tib...@gmail.com> writes: > > > It's not clear to me that using UTF-16 internally does make Data.Text > > noticeably slower. > > I haven't benchmarked it, but I'm fairly sure that, if you try to fit a > 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of > RAM, UTF-16 will be slower than UTF-8. Many applications will get away > with streaming over data, retaining only a small part, but some won't. > I'm not sure if this is a great example as genome data is probably much better stored in a vector (using a few bits per "letter"). I agree that whenever one data structure will fit in the available RAM and another won't the smaller will win. I just don't know if this case is worth spending weeks worth of work optimizing for. That's why I'd like to see benchmarks for more idiomatic use cases. > In other cases (e.g. processing CJK text, and perhap also > non-Latin1 text), I'm sure it'll be faster - but my (still > unsubstantiated) guess is that the difference will be much smaller, and > it'll be a case of winning some and losing some - and I'd also > conjecture that having 3Gb "real" text (i.e. natural language, as > opposed to text-formatted data) is rare. > I would like to verify this guess. In my personal experience it's really hard to guess which changes will lead to a noticeable performance improvement. I'm probably wrong more often than I'm right. Cheers, Johan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe