Re: [julia-users] Re: UTF16String or UTF8String?

2015-09-28 Thread Jameson Nash
I find it interesting to note that the wikipedia article points out that if size compression is the goal (and there is enough text for it to matter), then SCSU (or other attempts at creating a unicode-specific compression scheme) is inferior to using a general purpose compression algorithm. Since

Re: [julia-users] Re: UTF16String or UTF8String?

2015-09-28 Thread Scott Jones
On Monday, September 28, 2015 at 3:32:55 AM UTC-4, Jameson wrote: > > I find it interesting to note that the wikipedia article points out that > if size compression is the goal (and there is enough text for it to > matter), then SCSU (or other attempts at creating a unicode-specific >

[julia-users] Re: UTF16String or UTF8String?

2015-09-27 Thread Páll Haraldsson
UTF-16 was earlier (strictly speaking UCS-2) and Windows adopted it (and also used elsewhere..). UTF-8 is almost in all cases better (except in East-Asian languages, but not even there, if you use, something HTML (or I guess XML..), that has has lots of ASCII for tags etc.):

Re: [julia-users] Re: UTF16String or UTF8String?

2015-09-27 Thread Scott Jones
On Sunday, September 27, 2015 at 4:56:28 PM UTC-4, Jameson wrote: > > UTF-16 is much faster in many situations than UTF-8. >> > > an encoding is not a speed. it is a format. Both formats are > variable-length encodings, and therefore both algorithms have the same time > and space complexity

Re: [julia-users] Re: UTF16String or UTF8String?

2015-09-27 Thread Jameson Nash
> > UTF-16 is much faster in many situations than UTF-8. > an encoding is not a speed. it is a format. Both formats are variable-length encodings, and therefore both algorithms have the same time and space complexity (although the implementation of UTF16 does appear to be simpler from the length

Re: [julia-users] Re: UTF16String or UTF8String?

2015-09-27 Thread Daniel Carrera
Thanks. On 27 September 2015 at 20:42, Páll Haraldsson wrote: > > UTF-16 was earlier (strictly speaking UCS-2) and Windows adopted it (and > also used elsewhere..). UTF-8 is almost in all cases better (except in > East-Asian languages, but not even there, if you use,

Re: [julia-users] Re: UTF16String or UTF8String?

2015-09-27 Thread Páll Haraldsson
2015-09-27 20:29 GMT+00:00 Scott Jones : > If it is mainly in North/South America, Western Europe, or Australia/NZ, > UTF-8 does OK. > UTF-8 is great for data interchange, but can really slow things down if > you have many non-ASCII characters > Did you mean non-BMP?

Re: [julia-users] Re: UTF16String or UTF8String?

2015-09-27 Thread Scott Jones
On Sunday, September 27, 2015 at 5:40:03 PM UTC-4, Páll Haraldsson wrote: > > 2015-09-27 21:26 GMT+00:00 Páll Haraldsson >: > >> 2015-09-27 20:29 GMT+00:00 Scott Jones > >: >> >>> If it is mainly in North/South America, Western Europe, or

Re: [julia-users] Re: UTF16String or UTF8String?

2015-09-27 Thread Páll Haraldsson
2015-09-27 21:26 GMT+00:00 Páll Haraldsson : > 2015-09-27 20:29 GMT+00:00 Scott Jones : > >> If it is mainly in North/South America, Western Europe, or Australia/NZ, >> UTF-8 does OK. >> UTF-8 is great for data interchange, but can really

Re: [julia-users] Re: UTF16String or UTF8String?

2015-09-27 Thread Scott Jones
On Sunday, September 27, 2015 at 5:26:26 PM UTC-4, Páll Haraldsson wrote: > > 2015-09-27 20:29 GMT+00:00 Scott Jones >: > >> If it is mainly in North/South America, Western Europe, or Australia/NZ, >> UTF-8 does OK. >> UTF-8 is great for data interchange, but can

Re: [julia-users] Re: UTF16String or UTF8String?

2015-09-27 Thread Scott Jones
UTF-16 is much faster in many situations than UTF-8. It really depends a lot on just what you are doing, and the data you are processing. If it is mainly in North/South America, Western Europe, or Australia/NZ, UTF-8 does OK. UTF-8 is great for data interchange, but can really slow things down

Re: [julia-users] Re: UTF16String or UTF8String?

2015-09-27 Thread Daniel Carrera
On 27 September 2015 at 23:41, Scott Jones wrote: > No. Most characters used in the countries I mentioned above can be > represented using just ANSI Latin1 > (which is why I specified *Western Europe*), so UTF-8 will take 1 or 2 > bytes for each character, > but when