Re: Re[2]: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 1:05 PM, Bulat Ziganshin wrote: > Hello Tako, > > Tuesday, August 17, 2010, 3:03:20 PM, you wrote: > > > Unless a Char in Haskell is 32 bits (or at least more than 16 bits) > > it con NOT encode all Unicode points. > > it's 32 bit > Like Bulat said it's 32 bit. It's *defined* as being the Unicode code point number. It has no relation to e.g. char in C. -- Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: Re[2]: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 12:39 PM, Bulat Ziganshin wrote: > Hello Tom, > > Tuesday, August 17, 2010, 2:09:09 PM, you wrote: > > > In the first iteration of the Text package, UTF-16 was chosen because > > it had a nice balance of arithmetic overhead and space. The > > arithmetic for UTF-8 started to have serious performance impacts in > > situations where the entire document was outside ASCII (i.e. a Russian > > or Arabic document), but UTF-16 was still relatively compact > > i don't understand what you mean. are you support all 2^20 codepoints > in Data.Text package? > Yes, UTF-16 can represent all Unicode code points, using surrogate pairs. -- Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: Re[2]: [Haskell-cafe] Re: String vs ByteString
2010/8/17 Bulat Ziganshin : > Hello Tom, > i don't understand what you mean. are you support all 2^20 codepoints > in Data.Text package? Bulat, Yes, its internal representation is UTF-16, which is capable of encoding *any* valid Unicode codepoint. -- Tom ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: Re[2]: [Haskell-cafe] Re: String vs ByteString
Hi Bulat, On Tue, Aug 17, 2010 at 10:34 AM, Bulat Ziganshin wrote: > > It's not clear to me that using UTF-16 internally does make > > Data.Text noticeably slower. > > not slower but require 2x more memory. speed is the same since > Unicode contains 2^20 codepoints > Yes, in theory a program could use as much as 2x the memory. That being said, most programs don't hold that much text data in memory at any given point so that might be 2x of a small number. One experiment [1] found it difficult to show any difference in memory usage at all in Trac when switching Python's internal representation from UCS2 to UCS4. So it's not clear to me that using UTF-16 makes the program noticeably slower or use more memory on a real program. 1. http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python Cheers, Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: Re[2]: [Haskell-cafe] Re: String vs ByteString
On Tue, Aug 17, 2010 at 10:34, Bulat Ziganshin wrote: > Hello Johan, > > Tuesday, August 17, 2010, 12:20:37 PM, you wrote: > > > I agree, Data.Text is great. Unfortunately, its internal use of UTF-16 > > makes it inefficient for many purposes. > > > It's not clear to me that using UTF-16 internally does make > > Data.Text noticeably slower. > > not slower but require 2x more memory. speed is the same since > Unicode contains 2^20 codepoints > > This is not entirely correct because it all depends on your data. For western languages is normally holds true that UTF16 occupies twice the memory of UTF8, but for other languages code points might take up to 3 bytes (I thought even 4, but the wikipedia page only mentions 3: http://en.wikipedia.org/wiki/UTF-8). That wikipedia page is a nice read anyway, it mentions some of the advantages and disadvantages of the different encodings. (The complexity of the code that determines the length of an UTF string depends on the encoding for example) Cheers, -Tako ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe