subject:"\[Haskell\-cafe\] Re\: PROPOSAL\: New efficient Unicode string library."

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-03 Thread Jonathan Cast

On Wed, 2007-10-03 at 14:15 +0200, Stephane Bortzmeyer wrote:
> On Wed, Oct 03, 2007 at 12:01:50AM +0200,
>  Twan van Laarhoven <[EMAIL PROTECTED]> wrote 
>  a message of 24 lines which said:
> 
> > Lots of people wrote:
> > > I want a UTF-8 bikeshed!
> > > No, I want a UTF-16 bikeshed!
> 
> Personnally, I want an UTF-32 bikeshed. UTF-16 is as lousy as UTF-8
> (for both of them, characters have different sizes, unlike what
> happens in UTF-32).

+1

> > What the heck does it matter what encoding the library uses
> > internally? 
> 
> +1 It can even use a non-standard encoding scheme if it wants.

+3

jcc


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-03 Thread Johan Tibell

> > What the heck does it matter what encoding the library uses
> > internally?
>
> +1 It can even use a non-standard encoding scheme if it wants.

Sounds good to me. I (think) one of my initial questions was if the
encoding should be visible in the type of the UnicodeString type or
not. My gut feeling is that having the type visible might make it hard
to change the internal representation but I haven't yet got a good
example to prove this.

-- Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-03 Thread Stephane Bortzmeyer

On Wed, Oct 03, 2007 at 12:01:50AM +0200,
 Twan van Laarhoven <[EMAIL PROTECTED]> wrote 
 a message of 24 lines which said:

> Lots of people wrote:
> > I want a UTF-8 bikeshed!
> > No, I want a UTF-16 bikeshed!

Personnally, I want an UTF-32 bikeshed. UTF-16 is as lousy as UTF-8
(for both of them, characters have different sizes, unlike what
happens in UTF-32).
 
> What the heck does it matter what encoding the library uses
> internally? 

+1 It can even use a non-standard encoding scheme if it wants.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-02 Thread Ketil Malde

On Tue, 2007-10-02 at 21:45 -0400, Brandon S. Allbery KF8NH wrote:

> > Due to the additional complexity of handling UTF-8 -- EVEN IF the  
> > actual text processed happens all to be US-ASCII -- will UTF-8  
> > perhaps be less efficient than UTF-16, or only as fast?

> UTF8 will be very slightly faster in the all-ASCII case, but quickly  
> blows chunks if you have *any* characters that require multibyte.   

What benchmarks are you basing this on?  Doubling your data size is
going to cost you if you are doing simple operations (searching, say),
but I don't see UTF-8 being particularly expensive - somebody (forget
who) implemented UTF-8 on top of ByteString, and IIRC, the benchmarks
numbers didn't change all that much from the regular Char8.

-k

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-02 Thread Ketil Malde

On Tue, 2007-10-02 at 14:32 -0700, Stefan O'Rear wrote:

> UTF-8 supports CJK languages too.  The only question is efficiency, and
> I believe CJK is still a relatively uncommon case compared to English
> and other Latin-alphabet languages.  (That said, I live in a country all
> of whose dominant languages use the Latin alphabet)

As for space efficiency, I guess the argument could be made that since
an ideogram typically conveys a whole word, it is reasonably to spend
more bits for it.

Anyway, I am unsure if I should take part in this discussion, as I'm not
really dealing with text as such in multiple languages.  Most of my data
is in ASCII, and when they are not, I'm happy to treat it ("treat" here
meaning "mostly ignore") as Latin1 bytes (current ByteString) or UTF-8.
The only thing I miss is the ability to use String syntactic sugar --
but IIUC, that's coming?

However, increased space usage is not acceptable, and I also don't want
any conversion layer which could conceivably modify my data (e.g. by
normalizing or error handling).

-k

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-02 Thread Brandon S. Allbery KF8NH



On Oct 2, 2007, at 21:12 , Isaac Dupree wrote:


Stefan O'Rear wrote:

On Tue, Oct 02, 2007 at 11:05:38PM +0200, Johan Tibell wrote:
I do not believe that anyone was seriously advocating multiple  
blessed
encodings.  The main question is *which* encoding to bless.  99+ 
% of
text I encounter is in US-ASCII, so I would favor UTF-8.  Why is  
UTF-16

better for me?

All software I write professional have to support 40 languages
(including CJK ones) so I would prefer UTF-16 in case I could use
Haskell at work some day in the future. I dunno that who uses what
encoding the most is good grounds to pick encoding though. Ease of
implementation and speed on some representative sample set of  
text may

be.

UTF-8 supports CJK languages too.  The only question is efficiency


Due to the additional complexity of handling UTF-8 -- EVEN IF the  
actual text processed happens all to be US-ASCII -- will UTF-8  
perhaps be less efficient than UTF-16, or only as fast?


UTF8 will be very slightly faster in the all-ASCII case, but quickly  
blows chunks if you have *any* characters that require multibyte.   
Given the way UTF8 encoding works, this includes even Latin-1 non- 
ASCII, never mind CJK.  (I think people have been missing that  
point.  UTF8 is only cheap for 00-7f, *nothing else*.)


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] [EMAIL PROTECTED]
system administrator [openafs,heimdal,too many hats] [EMAIL PROTECTED]
electrical and computer engineering, carnegie mellon universityKF8NH


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-02 Thread Isaac Dupree


Stefan O'Rear wrote:

On Tue, Oct 02, 2007 at 11:05:38PM +0200, Johan Tibell wrote:

I do not believe that anyone was seriously advocating multiple blessed
encodings.  The main question is *which* encoding to bless.  99+% of
text I encounter is in US-ASCII, so I would favor UTF-8.  Why is UTF-16
better for me?

All software I write professional have to support 40 languages
(including CJK ones) so I would prefer UTF-16 in case I could use
Haskell at work some day in the future. I dunno that who uses what
encoding the most is good grounds to pick encoding though. Ease of
implementation and speed on some representative sample set of text may
be.


UTF-8 supports CJK languages too.  The only question is efficiency


Due to the additional complexity of handling UTF-8 -- EVEN IF the actual 
text processed happens all to be US-ASCII -- will UTF-8 perhaps be less 
efficient than UTF-16, or only as fast?


Isaac
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-02 Thread Jonathan Cast

On Wed, 2007-10-03 at 00:01 +0200, Twan van Laarhoven wrote:
> Lots of people wrote:
>  > I want a UTF-8 bikeshed!
>  > No, I want a UTF-16 bikeshed!
> 
> What the heck does it matter what encoding the library uses internally? 

+1

jcc


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-02 Thread Twan van Laarhoven

Lots of people wrote:
> I want a UTF-8 bikeshed!
> No, I want a UTF-16 bikeshed!

What the heck does it matter what encoding the library uses internally? 
I expect the interface to be something like (from my own CompactString 
library):

> fromByteString :: Encoding -> ByteString -> UnicodeString
> toByteString   :: Encoding -> UnicodeString -> ByteString
The only matter is efficiency for a particular encoding.

I would suggest that we get a working library first. Either UTF-8 or 
UTF-16 will do, as long as it works.

Even better would be to implement both (and perhaps more encodings), and 
then benchmark them to get a sensible default. Then the choice can be 
made available to the user as well, in case someone has specifix needs. 
But again: get it working first!

Twan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-02 Thread Deborah Goldsmith

On Oct 2, 2007, at 3:01 PM, Twan van Laarhoven wrote:

Lots of people wrote:
> I want a UTF-8 bikeshed!
> No, I want a UTF-16 bikeshed!

What the heck does it matter what encoding the library uses  
internally? I expect the interface to be something like (from my own  
CompactString library):

> fromByteString :: Encoding -> ByteString -> UnicodeString
> toByteString   :: Encoding -> UnicodeString -> ByteString

I agree, from an API perspective the internal encoding doesn't matter.

The only matter is efficiency for a particular encoding.

This matters a lot.

I would suggest that we get a working library first. Either UTF-8 or  
UTF-16 will do, as long as it works.

Even better would be to implement both (and perhaps more encodings),  
and then benchmark them to get a sensible default. Then the choice  
can be made available to the user as well, in case someone has  
specifix needs. But again: get it working first!

The problem is that the internal encoding can have a big effect on the  
implementation of the library. It's better not to have to do it over  
again if the first choice is not optimal.

I'm just trying to share the experience of the Unicode Consortium, the  
ICU library contributors, and Apple, with the Haskell community. They,  
and I personally, have many years of experience implementing support  
for Unicode.

Anyway, I think we're starting to repeat ourselves...

Deborah

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-02 Thread Deborah Goldsmith


On Oct 2, 2007, at 8:44 AM, Jonathan Cast wrote:
I would like to, again, strongly argue against sacrificing  
compatibility
with Linux/BSD/etc. for the sake of compatibility with OS X or  
Windows.
FFI bindings have to convert data formats in any case; Haskell  
shouldn't
gratuitously break Linux support (or make life harder on Linux) just  
to

support proprietary operating systems better.

Now, if /independent of the details of MacOS X/, UTF-16 is better
(objectively), it can be converted to anything by the FFI.  But  
doing it
the way Java or MacOS X or Win32 or anyone else does it, at the  
expense

of Linux, I am strongly opposed to.


No one is advocating that. Any Unicode support library needs to  
support exporting text as UTF-8 since it's so widely used. It's used  
on Mac OS X, too, in exactly the same contexts it would be used on  
Linux. However, UTF-8 is a poor choice for internal representation.


On Oct 2, 2007, at 2:32 PM, Stefan O'Rear wrote:
UTF-8 supports CJK languages too.  The only question is efficiency,  
and

I believe CJK is still a relatively uncommon case compared to English
and other Latin-alphabet languages.  (That said, I live in a country  
all

of whose dominant languages use the Latin alphabet)


First of all, non-Latin countries already represent a large fraction  
of computer usage and the computer market. It is not at all  
"relatively uncommon." Japan alone is a huge market. China is a huge  
market.


Second, it's not just CJK, but anything that's not mostly ASCII.  
Russian, Greek, Thai, Arabic, Hebrew, etc. etc. etc. UTF-8 is intended  
for compatibility with existing software that expects multibyte  
encodings. It doesn't work well as an internal representation. Again,  
no one is saying a Unicode library shouldn't have full support for  
input and output of UTF-8 (and other encodings).


If you want to process ASCII text and squeeze out every last ounce of  
performance, use byte strings. Unicode strings should be optimized for  
representing and processing human language text, a large share of  
which is not in the Latin alphabet.


Remember, speakers of English and other Latin-alphabet languages are a  
minority in the world, though not in the computer-using world. Yet.


Deborah

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-02 Thread Stefan O'Rear

On Tue, Oct 02, 2007 at 11:05:38PM +0200, Johan Tibell wrote:
> > I do not believe that anyone was seriously advocating multiple blessed
> > encodings.  The main question is *which* encoding to bless.  99+% of
> > text I encounter is in US-ASCII, so I would favor UTF-8.  Why is UTF-16
> > better for me?
> 
> All software I write professional have to support 40 languages
> (including CJK ones) so I would prefer UTF-16 in case I could use
> Haskell at work some day in the future. I dunno that who uses what
> encoding the most is good grounds to pick encoding though. Ease of
> implementation and speed on some representative sample set of text may
> be.

UTF-8 supports CJK languages too.  The only question is efficiency, and
I believe CJK is still a relatively uncommon case compared to English
and other Latin-alphabet languages.  (That said, I live in a country all
of whose dominant languages use the Latin alphabet)

Stefan


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-02 Thread Johan Tibell

> I do not believe that anyone was seriously advocating multiple blessed
> encodings.  The main question is *which* encoding to bless.  99+% of
> text I encounter is in US-ASCII, so I would favor UTF-8.  Why is UTF-16
> better for me?

All software I write professional have to support 40 languages
(including CJK ones) so I would prefer UTF-16 in case I could use
Haskell at work some day in the future. I dunno that who uses what
encoding the most is good grounds to pick encoding though. Ease of
implementation and speed on some representative sample set of text may
be.

-- Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-02 Thread Stefan O'Rear

On Tue, Oct 02, 2007 at 08:02:30AM -0700, Deborah Goldsmith wrote:
> UTF-16 is the type used in all the APIs. Everything else is considered an 
> encoding conversion.
>
> CoreFoundation uses UTF-16 internally except when the string fits entirely 
> in a single-byte legacy encoding like MacRoman or MacCyrillic. If any kind 
> of Unicode processing needs to be done to the string, it is first coerced 
> to UTF-16. If it weren't for backwards compatibility issues, I think we'd 
> use UTF-16 all the time as the machinery for switching encodings adds 
> complexity. I wouldn't advise it for a new library.

I do not believe that anyone was seriously advocating multiple blessed
encodings.  The main question is *which* encoding to bless.  99+% of
text I encounter is in US-ASCII, so I would favor UTF-8.  Why is UTF-16
better for me?

Stefan


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-02 Thread Jonathan Cast

On Tue, 2007-10-02 at 22:05 +0400, Miguel Mitrofanov wrote:
> > I would like to, again, strongly argue against sacrificing  
> > compatibility
> > with Linux/BSD/etc. for the sake of compatibility with OS X or  
> > Windows.
> 
> Ehm? I've used to think MacOS is a sort of BSD...

Cocoa, then.

jcc


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-02 Thread Miguel Mitrofanov

I would like to, again, strongly argue against sacrificing  
compatibility
with Linux/BSD/etc. for the sake of compatibility with OS X or  
Windows.


Ehm? I've used to think MacOS is a sort of BSD...
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-02 Thread Jonathan Cast

On Tue, 2007-10-02 at 08:02 -0700, Deborah Goldsmith wrote:
> On Oct 2, 2007, at 5:11 AM, ChrisK wrote:
> > Deborah Goldsmith wrote:
> >
> >> UTF-16 is the native encoding used for Cocoa, Java, ICU, and  
> >> Carbon, and
> >> is what appears in the APIs for all of them. UTF-16 is also what's
> >> stored in the volume catalog on Mac disks. UTF-8 is only used in BSD
> >> APIs for backward compatibility. It's also used in plain text  
> >> files (or
> >> XML or HTML), again for compatibility.
> >>
> >> Deborah
> >
> >
> > On OS X, Cocoa and Carbon use Core Foundation, whose API does not  
> > have a
> > one-true-encoding internally.  Follow the rather long URL for details:
> >
> > http://developer.apple.com/documentation/CoreFoundation/Conceptual/ 
> > CFStrings/index.html?http://developer.apple.com/documentation/ 
> > CoreFoundation/Conceptual/CFStrings/Articles/StringStorage.html#// 
> > apple_ref/doc/uid/20001179
> >
> > I would vote for an API that not just hides the internal store, but  
> > allows
> > different internal stores to be used in a mostly compatible way.
> >
> > However, There is a UniChar typedef on OS X which is the same  
> > unsigned 16 bit
> > integer as Java's JNI would use.
> 
> UTF-16 is the type used in all the APIs. Everything else is  
> considered an encoding conversion.
> 
> CoreFoundation uses UTF-16 internally except when the string fits  
> entirely in a single-byte legacy encoding like MacRoman or  
> MacCyrillic. If any kind of Unicode processing needs to be done to  
> the string, it is first coerced to UTF-16. If it weren't for  
> backwards compatibility issues, I think we'd use UTF-16 all the time  
> as the machinery for switching encodings adds complexity. I wouldn't  
> advise it for a new library.

I would like to, again, strongly argue against sacrificing compatibility
with Linux/BSD/etc. for the sake of compatibility with OS X or Windows.
FFI bindings have to convert data formats in any case; Haskell shouldn't
gratuitously break Linux support (or make life harder on Linux) just to
support proprietary operating systems better.

Now, if /independent of the details of MacOS X/, UTF-16 is better
(objectively), it can be converted to anything by the FFI.  But doing it
the way Java or MacOS X or Win32 or anyone else does it, at the expense
of Linux, I am strongly opposed to.

jcc


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-02 Thread Deborah Goldsmith


On Oct 2, 2007, at 5:11 AM, ChrisK wrote:

Deborah Goldsmith wrote:

UTF-16 is the native encoding used for Cocoa, Java, ICU, and  
Carbon, and

is what appears in the APIs for all of them. UTF-16 is also what's
stored in the volume catalog on Mac disks. UTF-8 is only used in BSD
APIs for backward compatibility. It's also used in plain text  
files (or

XML or HTML), again for compatibility.

Deborah



On OS X, Cocoa and Carbon use Core Foundation, whose API does not  
have a

one-true-encoding internally.  Follow the rather long URL for details:

http://developer.apple.com/documentation/CoreFoundation/Conceptual/ 
CFStrings/index.html?http://developer.apple.com/documentation/ 
CoreFoundation/Conceptual/CFStrings/Articles/StringStorage.html#// 
apple_ref/doc/uid/20001179


I would vote for an API that not just hides the internal store, but  
allows

different internal stores to be used in a mostly compatible way.

However, There is a UniChar typedef on OS X which is the same  
unsigned 16 bit

integer as Java's JNI would use.


UTF-16 is the type used in all the APIs. Everything else is  
considered an encoding conversion.


CoreFoundation uses UTF-16 internally except when the string fits  
entirely in a single-byte legacy encoding like MacRoman or  
MacCyrillic. If any kind of Unicode processing needs to be done to  
the string, it is first coerced to UTF-16. If it weren't for  
backwards compatibility issues, I think we'd use UTF-16 all the time  
as the machinery for switching encodings adds complexity. I wouldn't  
advise it for a new library.


Deborah

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-02 Thread ChrisK

Deborah Goldsmith wrote:

> UTF-16 is the native encoding used for Cocoa, Java, ICU, and Carbon, and
> is what appears in the APIs for all of them. UTF-16 is also what's
> stored in the volume catalog on Mac disks. UTF-8 is only used in BSD
> APIs for backward compatibility. It's also used in plain text files (or
> XML or HTML), again for compatibility.
> 
> Deborah

On OS X, Cocoa and Carbon use Core Foundation, whose API does not have a
one-true-encoding internally.  Follow the rather long URL for details:

http://developer.apple.com/documentation/CoreFoundation/Conceptual/CFStrings/index.html?http://developer.apple.com/documentation/CoreFoundation/Conceptual/CFStrings/Articles/StringStorage.html#//apple_ref/doc/uid/20001179

I would vote for an API that not just hides the internal store, but allows
different internal stores to be used in a mostly compatible way.

However, There is a UniChar typedef on OS X which is the same unsigned 16 bit
integer as Java's JNI would use.

-- 
Chris

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-10-01 Thread Deborah Goldsmith


Sorry for the long delay, work has been really busy...

On Sep 27, 2007, at 12:25 PM, Aaron Denney wrote:

On 2007-09-27, Aaron Denney <[EMAIL PROTECTED]> wrote:
Well, not so much. As Duncan mentioned, it's a matter of what the  
most
common case is. UTF-16 is effectively fixed-width for the majority  
of

text in the majority of languages. Combining sequences and surrogate
pairs are relatively infrequent.


Infrequent, but they exist, which means you can't seek x/2 bytes  
ahead

to seek x characters ahead.  All such seeking must be linear for both
UTF-16 *and* UTF-8.


Speaking as someone who has done a lot of Unicode implementation, I
would say UTF-16 represents the best time/space tradeoff for an
internal representation. As I mentioned, it's what's used in  
Windows,

Mac OS X, ICU, and Java.


I guess why I'm being something of a pain-in-the-ass here, is that
I want to use your Unicode implementation expertise to know what
these time/space tradeoffs are.

Are there any algorithmic asymptotic complexity differences, or all
these all constant factors?  The constant factors depend on projected
workload.  And are these actually tradeoffs, except between UTF-32
(which uses native wordsizes on 32-bit platforms) and the other two?
Smaller space means smaller cache footprint, which can dominate.


Yes, cache footprint is one reason to use UTF-16 rather than UTF-32.  
Having no surrogate pairs also doesn't save you anything because you  
need to handle sequences anyway, such as combining marks and clusters.


The best reference for all of this is:

http://www.unicode.org/faq/utf_bom.html

See especially:
http://www.unicode.org/faq/utf_bom.html#10
http://www.unicode.org/faq/utf_bom.html#12

Which data type is best depends on what the purpose is. If the data  
will primarily be ASCII with an occasional non-ASCII characters, UTF-8  
may be best. If the data is general Unicode text, UTF-16 is best. I  
would think a Unicode string type would be intended for processing  
natural language text, not just ASCII data.


Simplicity of algorithms is also a concern.  Validating a byte  
sequence
as UTF-8 is harder than validating a sequence of 16-bit values as  
UTF-16.


(I'd also like to see a reference to the Mac OS X encoding.  I know  
that

the filesystem interface is UTF-8 (decomposed a certain a way).  Is it
just that UTF-16 is a common application choice, or is there some
common framework or library that uses that?)


UTF-16 is the native encoding used for Cocoa, Java, ICU, and Carbon,  
and is what appears in the APIs for all of them. UTF-16 is also what's  
stored in the volume catalog on Mac disks. UTF-8 is only used in BSD  
APIs for backward compatibility. It's also used in plain text files  
(or XML or HTML), again for compatibility.


Deborah

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-27 Thread Aaron Denney

On 2007-09-27, Duncan Coutts <[EMAIL PROTECTED]> wrote:
> In message <[EMAIL PROTECTED]> [EMAIL PROTECTED] writes:
>> On 2007-09-27, Deborah Goldsmith <[EMAIL PROTECTED]> wrote:
>> > On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:
>> >>> UTF-16 has no advantage over UTF-8 in this respect, because of  
>> >>> surrogate
>> >>> pairs and combining characters.
>> >>
>> >> Good point.
>> >
>> > Well, not so much. As Duncan mentioned, it's a matter of what the most  
>> > common case is. UTF-16 is effectively fixed-width for the majority of  
>> > text in the majority of languages. Combining sequences and surrogate  
>> > pairs are relatively infrequent.
>> 
>> Infrequent, but they exist, which means you can't seek x/2 bytes ahead
>> to seek x characters ahead.  All such seeking must be linear for both
>> UTF-16 *and* UTF-8.
>
> And in [Char] for all these years, yet I don't hear people complaining. Most
> string processing is linear and does not need random access to characters.

Yeah.  I'm saying the differences between them are going to be in the
constant factors, and that these constant factors will differ between 
workloads.  

-- 
Aaron Denney
-><-

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-27 Thread Aaron Denney

On 2007-09-27, Aaron Denney <[EMAIL PROTECTED]> wrote:
> On 2007-09-27, Deborah Goldsmith <[EMAIL PROTECTED]> wrote:
>> On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:
 UTF-16 has no advantage over UTF-8 in this respect, because of  
 surrogate
 pairs and combining characters.
>>>
>>> Good point.
>>
>> Well, not so much. As Duncan mentioned, it's a matter of what the most  
>> common case is. UTF-16 is effectively fixed-width for the majority of  
>> text in the majority of languages. Combining sequences and surrogate  
>> pairs are relatively infrequent.
>
> Infrequent, but they exist, which means you can't seek x/2 bytes ahead
> to seek x characters ahead.  All such seeking must be linear for both
> UTF-16 *and* UTF-8.
>
>> Speaking as someone who has done a lot of Unicode implementation, I
>> would say UTF-16 represents the best time/space tradeoff for an
>> internal representation. As I mentioned, it's what's used in Windows,
>> Mac OS X, ICU, and Java.

I guess why I'm being something of a pain-in-the-ass here, is that 
I want to use your Unicode implementation expertise to know what
these time/space tradeoffs are.

Are there any algorithmic asymptotic complexity differences, or all
these all constant factors?  The constant factors depend on projected
workload.  And are these actually tradeoffs, except between UTF-32
(which uses native wordsizes on 32-bit platforms) and the other two?
Smaller space means smaller cache footprint, which can dominate.

Simplicity of algorithms is also a concern.  Validating a byte sequence
as UTF-8 is harder than validating a sequence of 16-bit values as UTF-16.  

(I'd also like to see a reference to the Mac OS X encoding.  I know that 
the filesystem interface is UTF-8 (decomposed a certain a way).  Is it
just that UTF-16 is a common application choice, or is there some
common framework or library that uses that?)

-- 
Aaron Denney
-><-

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-27 Thread Duncan Coutts

In message <[EMAIL PROTECTED]> Tony Finch
<[EMAIL PROTECTED]> writes:
> On Thu, 27 Sep 2007, Ross Paterson wrote:
> >
> > Combining characters are not an issue here, just the surrogate pairs,
> > because we're discussing representations of sequences of Chars (Unicode
> > code points).
> 
> I dislike referring to unicode code points as "characters" because that
> tends to imply a lot of invalid simplifications.

Just to be pedantic, Ross did say Char not character. A Char is defined in the
Haskell report as a Unicode code point. As you say, that does not directly
correspond to what many people think of as a character due to combining
characters etc.

Duncan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-27 Thread Tony Finch

On Thu, 27 Sep 2007, Ross Paterson wrote:
>
> Combining characters are not an issue here, just the surrogate pairs,
> because we're discussing representations of sequences of Chars (Unicode
> code points).

I dislike referring to unicode code points as "characters" because that
tends to imply a lot of invalid simplifications.

Tony.
-- 
f.a.n.finch  <[EMAIL PROTECTED]>  http://dotat.at/
IRISH SEA: SOUTHERLY, BACKING NORTHEASTERLY FOR A TIME, 3 OR 4. SLIGHT OR
MODERATE. SHOWERS. MODERATE OR GOOD, OCCASIONALLY POOR.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-27 Thread Johan Tibell

> Well, if you never heard anyone complaining about [Char] and never had
> any problem with it's slowness, you're probably not in a field where
> the efficiency of a Unicode library is really a concern, that's for
> sure. (I know that the _main_ problem with [Char] wasn't random
> access, but you must admit [Char] isn't really a good example to speak
> about efficiency problems)

I have problems with [Char] and use ByteString instead but that forces
me to keep track of the encoding myself and hence UnicodeString.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-27 Thread Chaddaï Fouché

2007/9/27, Duncan Coutts <[EMAIL PROTECTED]>:
> > Infrequent, but they exist, which means you can't seek x/2 bytes ahead
> > to seek x characters ahead.  All such seeking must be linear for both
> > UTF-16 *and* UTF-8.
>
> And in [Char] for all these years, yet I don't hear people complaining. Most
> string processing is linear and does not need random access to characters.

Well, if you never heard anyone complaining about [Char] and never had
any problem with it's slowness, you're probably not in a field where
the efficiency of a Unicode library is really a concern, that's for
sure. (I know that the _main_ problem with [Char] wasn't random
access, but you must admit [Char] isn't really a good example to speak
about efficiency problems)

-- 
Jedaï
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-27 Thread Duncan Coutts

In message <[EMAIL PROTECTED]> [EMAIL PROTECTED] writes:
> On 2007-09-27, Deborah Goldsmith <[EMAIL PROTECTED]> wrote:
> > On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:
> >>> UTF-16 has no advantage over UTF-8 in this respect, because of  
> >>> surrogate
> >>> pairs and combining characters.
> >>
> >> Good point.
> >
> > Well, not so much. As Duncan mentioned, it's a matter of what the most  
> > common case is. UTF-16 is effectively fixed-width for the majority of  
> > text in the majority of languages. Combining sequences and surrogate  
> > pairs are relatively infrequent.
> 
> Infrequent, but they exist, which means you can't seek x/2 bytes ahead
> to seek x characters ahead.  All such seeking must be linear for both
> UTF-16 *and* UTF-8.

And in [Char] for all these years, yet I don't hear people complaining. Most
string processing is linear and does not need random access to characters.

Duncan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-27 Thread Aaron Denney

On 2007-09-27, Ross Paterson <[EMAIL PROTECTED]> wrote:
> On Thu, Sep 27, 2007 at 07:26:07AM +, Aaron Denney wrote:
>> On 2007-09-27, Ross Paterson <[EMAIL PROTECTED]> wrote:
>> > Combining characters are not an issue here, just the surrogate pairs,
>> > because we're discussing representations of sequences of Chars (Unicode
>> > code points).
>> 
>> You'll never want to combine combining characters or vice-versa?  Never
>> want to figure out how much screen space a sequence will take?  It _is_
>> an issue.
>
> It's an issue for a higher layer, not for a compact String representation.

Yes, and no.  It's not something the lower layer should be doing, but
enabling the higher layers to do so efficiently is a concern.


-- 
Aaron Denney
-><-

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-27 Thread Ross Paterson

On Thu, Sep 27, 2007 at 06:39:24AM +, Aaron Denney wrote:
> On 2007-09-27, Deborah Goldsmith <[EMAIL PROTECTED]> wrote:
> > Well, not so much. As Duncan mentioned, it's a matter of what the most  
> > common case is. UTF-16 is effectively fixed-width for the majority of  
> > text in the majority of languages. Combining sequences and surrogate  
> > pairs are relatively infrequent.
> 
> Infrequent, but they exist, which means you can't seek x/2 bytes ahead
> to seek x characters ahead.  All such seeking must be linear for both
> UTF-16 *and* UTF-8.

You could get rapid seeks by ignoring the UTFs and representing strings
as sequences of chunks, where each chunk is uniformly 8-bit, 16-bit or
32-bit as required to cover the characters it contains.  Hardly anyone
would need 32-bit chunks (and some of us would need only the 8-bit ones).
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-27 Thread Ross Paterson

On Thu, Sep 27, 2007 at 07:26:07AM +, Aaron Denney wrote:
> On 2007-09-27, Ross Paterson <[EMAIL PROTECTED]> wrote:
> > Combining characters are not an issue here, just the surrogate pairs,
> > because we're discussing representations of sequences of Chars (Unicode
> > code points).
> 
> You'll never want to combine combining characters or vice-versa?  Never
> want to figure out how much screen space a sequence will take?  It _is_
> an issue.

It's an issue for a higher layer, not for a compact String representation.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-27 Thread Aaron Denney

On 2007-09-27, Ross Paterson <[EMAIL PROTECTED]> wrote:
> Combining characters are not an issue here, just the surrogate pairs,
> because we're discussing representations of sequences of Chars (Unicode
> code points).

You'll never want to combine combining characters or vice-versa?  Never
want to figure out how much screen space a sequence will take?  It _is_
an issue.

-- 
Aaron Denney
-><-

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-26 Thread Ross Paterson

On Wed, Sep 26, 2007 at 11:25:30AM +0100, Tony Finch wrote:
> On Wed, 26 Sep 2007, Aaron Denney wrote:
> > It's true that time-wise there are definite issues in finding character
> > boundaries.
> 
> UTF-16 has no advantage over UTF-8 in this respect, because of surrogate
> pairs and combining characters.

Combining characters are not an issue here, just the surrogate pairs,
because we're discussing representations of sequences of Chars (Unicode
code points).
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-26 Thread Aaron Denney

On 2007-09-27, Deborah Goldsmith <[EMAIL PROTECTED]> wrote:
> On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:
>>> UTF-16 has no advantage over UTF-8 in this respect, because of  
>>> surrogate
>>> pairs and combining characters.
>>
>> Good point.
>
> Well, not so much. As Duncan mentioned, it's a matter of what the most  
> common case is. UTF-16 is effectively fixed-width for the majority of  
> text in the majority of languages. Combining sequences and surrogate  
> pairs are relatively infrequent.

Infrequent, but they exist, which means you can't seek x/2 bytes ahead
to seek x characters ahead.  All such seeking must be linear for both
UTF-16 *and* UTF-8.

-- 
Aaron Denney
-><-

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-26 Thread Deborah Goldsmith


On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:
UTF-16 has no advantage over UTF-8 in this respect, because of  
surrogate

pairs and combining characters.


Good point.


Well, not so much. As Duncan mentioned, it's a matter of what the most  
common case is. UTF-16 is effectively fixed-width for the majority of  
text in the majority of languages. Combining sequences and surrogate  
pairs are relatively infrequent.


Speaking as someone who has done a lot of Unicode implementation, I  
would say UTF-16 represents the best time/space tradeoff for an  
internal representation. As I mentioned, it's what's used in Windows,  
Mac OS X, ICU, and Java.


Deborah

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-26 Thread Aaron Denney

On 2007-09-26, Tony Finch <[EMAIL PROTECTED]> wrote:
> On Wed, 26 Sep 2007, Aaron Denney wrote:
>>
>> It's true that time-wise there are definite issues in finding character
>> boundaries.
>
> UTF-16 has no advantage over UTF-8 in this respect, because of surrogate
> pairs and combining characters.

Good point.

-- 
Aaron Denney
-><-

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-26 Thread Aaron Denney

On 2007-09-26, Johan Tibell <[EMAIL PROTECTED]> wrote:
> On 9/26/07, Aaron Denney <[EMAIL PROTECTED]> wrote:
>> On 2007-09-26, Johan Tibell <[EMAIL PROTECTED]> wrote:
>> > If UTF-16 is what's used by everyone else (how about Java? Python?) I
>> > think that's a strong reason to use it. I don't know Unicode well
>> > enough to say otherwise.
>>
>> The internal representations don't matter except in the case of making
>> FFI linkages.  The external representations do, and UTF-8 has won on
>> that front.
>
> It could matter for performance. However, you can encode your
> UnicodeString into any external representation you want for your I/O
> needs, including UTF-8.

Right.  I was trying to say "other languages internal representations
shouldn't affect the choice of those doing a Haskell implementation."

-- 
Aaron Denney
-><-

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-26 Thread Tony Finch

On Wed, 26 Sep 2007, Aaron Denney wrote:
>
> It's true that time-wise there are definite issues in finding character
> boundaries.

UTF-16 has no advantage over UTF-8 in this respect, because of surrogate
pairs and combining characters. Code points, characters, and glyphs are
all different things, and it's very difficult to represent the latter two
as anything other than a string of code points.

Tony.
-- 
f.a.n.finch  <[EMAIL PROTECTED]>  http://dotat.at/
IRISH SEA: SOUTHERLY, BACKING NORTHEASTERLY FOR A TIME, 3 OR 4. SLIGHT OR
MODERATE. SHOWERS. MODERATE OR GOOD, OCCASIONALLY POOR.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-26 Thread Johan Tibell

On 9/26/07, Aaron Denney <[EMAIL PROTECTED]> wrote:
> On 2007-09-26, Johan Tibell <[EMAIL PROTECTED]> wrote:
> > If UTF-16 is what's used by everyone else (how about Java? Python?) I
> > think that's a strong reason to use it. I don't know Unicode well
> > enough to say otherwise.
>
> The internal representations don't matter except in the case of making
> FFI linkages.  The external representations do, and UTF-8 has won on
> that front.

It could matter for performance. However, you can encode your
UnicodeString into any external representation you want for your I/O
needs, including UTF-8.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-26 Thread Aaron Denney

On 2007-09-26, Johan Tibell <[EMAIL PROTECTED]> wrote:
> If UTF-16 is what's used by everyone else (how about Java? Python?) I
> think that's a strong reason to use it. I don't know Unicode well
> enough to say otherwise.

The internal representations don't matter except in the case of making
FFI linkages.  The external representations do, and UTF-8 has won on
that front.

-- 
Aaron Denney
-><-

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

2007-09-25 Thread Aaron Denney

On 2007-09-26, Deborah Goldsmith <[EMAIL PROTECTED]> wrote:
>  From an implementation point of view, UTF-16 is the most efficient  
> representation for processing Unicode.

This depends on the characteristics of the text being processed.
Spacewise, English stays 1 byte/char in UTF-8.  Most European languages
go up to at most 2, and on average only a bit above 1.  Greek and
Cyrillic are 2 bytes/char.  It's really only the Asian, African, Arabic,
etc, that lose space-wise.

It's true that time-wise there are definite issues in finding character
boundaries.

-- 
Aaron Denney
-><-

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

40 matches

Mail list logo