Re: String != [Char]

2012-03-24 Thread Greg Weber
On Sat, Mar 24, 2012 at 7:26 PM, Gabriel Dos Reis
 wrote:
> On Sat, Mar 24, 2012 at 9:09 PM, Greg Weber  wrote:

>> Problem: we want to write beautiful (and possibly inefficient) code
>> that is easy to explain. If nothing else, this is pedagologically
>> important.
>> The goals of this code are to:
>>  * use list processing pattern matching and functions on a string type
>
> I may have missed this question so I will ask it (apologies if it is a
> repeat):  Why is it believed that list processing pattern matching is
> appropriate or the right tool for text processing?

Nobody said it is the right tool for text processing. In fact, I think
we all agreed it is the wrong tool for many cases. But it is easy for
students to understand since they are already being taught to use
lists for everything else.  It would be great if you can talk with
teachers of Haskell and figure out a better way to teach text
processing.

>
>
> -- Gaby

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Gabriel Dos Reis
On Sat, Mar 24, 2012 at 9:09 PM, Greg Weber  wrote:
> # Switching to Text by default makes us embarrassed!

Text processing /is/ quick to embarrassment :-)

> Problem: we want to write beautiful (and possibly inefficient) code
> that is easy to explain. If nothing else, this is pedagologically
> important.
> The goals of this code are to:
>  * use list processing pattern matching and functions on a string type

I may have missed this question so I will ask it (apologies if it is a
repeat):  Why is it believed that list processing pattern matching is
appropriate or the right tool for text processing?


-- Gaby

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Gabriel Dos Reis
On Sat, Mar 24, 2012 at 8:51 PM, Johan Tibell  wrote:
> On Sat, Mar 24, 2012 at 5:54 PM, Gabriel Dos Reis
>  wrote:
>> I think there is a confusion here.  A Unicode character is an abstract
>> entity.  For it to exist in some concrete form in a program, you need
>> an encoding.  The fact that char16_t is 16-bit wide is irrelevant to
>> whether it can be used in a representation of a Unicode text, just like
>> uint8_t (e.g. 'unsigned char') can be used to encode Unicode string
>> despite it being only 8-bit wide.   You do not need to make the
>> character type exactly equal to the type of the individual element
>> in the text representation.
>
> Well, if you have a >21-bit type you can declare its value to be a
> Unicode code point (which are numbered.)

That is correct.  Because not all Unicode points represent characters,
and not all Unicode code point sequences represent valid characters,
even if you have that >21-bit type T, the list type [T] would still not be a
good string type.

> Using a char* that you claim
> contain utf-8 encoded data is bad for safety, as there is no guarantee
> that that's indeed the case.

Indeed, and that is why a Text should be an abstract datatype, hiding
the concrete implementation away from the user.

>> Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is 
>> insufficient
>> as far as text processing goes; you also need a localization at the
>> minimum.  It is the
>> combination of the two that gives some meaning to text representation
>> and operations.
>
> text does that via ICU. Some operations would be possible without
> using the locale, if it wasn't for those Turkish i:s. :/

yeah, 7 bits should be enough for every character ;-)

-- Gaby

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Greg Weber
# Switching to Text by default makes us embarrassed!

Problem: we want to write beautiful (and possibly inefficient) code
that is easy to explain. If nothing else, this is pedagologically
important.
The goals of this code are to:
  * use list processing pattern matching and functions on a string type
  * avoid embarassing name clashes and the need for qualified names
(T.split, etc)

The second point is Haskell's festering language design sore rearing
its ugly head.
Lets note that the current state of Haskell is not any more beautiful
than what will happen after this proposal is implemented. It is just
that we currently have partly hidden away a deficiency in Haskell by
only exporting list functions in the Prelude. So our real goal is to
come up with conventions and/or hacks that will allow us to continue
to hide this deficiency of Haskell for the purposes of pedagogy.
If you can't tell, IMHO the issue we are circumventing is Haskell's
biggest issue form a laguage design perspective. It is a shame that
SPJ's TDNR proposal was shouted down and no alternative has been
given.

But I am not going to hold out hope that this issue will be solved any
time soon. Just limiting solving this to records has proved very
difficult. So onto our hacks for making Text the default string type!


## Option 1: T. prefixing

using Text functions still requires the T. prefix
For pedagogy, continue to use [Char], but use an OverloadedText extension

This is a safe conservative option that puts us in a better place than
we are today.
It just makes us look strange when we build something into the
language that requires a prefix.
Of course, we could try to give every Text function a slightly
different name than the Prelude list functions, but I think that will
make using Haskell more difficult that putting up with prefixes.


## Option 2: TDNR for lists

(Prelude) list functions are resolved in a special way.
For example, we could have 2 different map functions in scope
unqualified: one for lists, and one for Text. The compiler is tasked
with resolving whether the type is a list or not and determining the
appropriate function.

I would much rather add a TDNR construct to the language in a
universal way than go down this route.


## Option 3: implicit List typeclass

We can operate on Text (and other non-list data structures) using a
List typeclass.
We have 2 concers:
  * list pattern matching ('c':string)
  * requiring the typeclass in the type signature everywhere

I think we can extend the compiler to pattern match characters out of
Text, so lets move onto the second point.
If we don't write type signatures anywhere, we actually won't care about it.
However, if we add sparse annotations, we will need a List constraint.

  listF :: List l => ...

This could get tiresome quickly. It makes pedagogy immediately delve
into an explanation of typeclasses. A simple solution is to special
case the List class.
We declare that List is so fundamental to Haskell that requiring the
List typeclass is not necessary.
The Prelude exports (class List where ...).
If a List typeclass function is used, the compiler inserts the List
typeclass constraint into a type signature automatically.

This option is very attractive because it solves all of our problems
at the cost of 1 easy to explain piece of magic. It also makes it
possible to unify list behavior across different data types without
the hassle of typeclass insertions everywhere.

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Johan Tibell
On Sat, Mar 24, 2012 at 5:54 PM, Gabriel Dos Reis
 wrote:
> I think there is a confusion here.  A Unicode character is an abstract
> entity.  For it to exist in some concrete form in a program, you need
> an encoding.  The fact that char16_t is 16-bit wide is irrelevant to
> whether it can be used in a representation of a Unicode text, just like
> uint8_t (e.g. 'unsigned char') can be used to encode Unicode string
> despite it being only 8-bit wide.   You do not need to make the
> character type exactly equal to the type of the individual element
> in the text representation.

Well, if you have a >21-bit type you can declare its value to be a
Unicode code point (which are numbered.) Using a char* that you claim
contain utf-8 encoded data is bad for safety, as there is no guarantee
that that's indeed the case.

> Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is 
> insufficient
> as far as text processing goes; you also need a localization at the
> minimum.  It is the
> combination of the two that gives some meaning to text representation
> and operations.

text does that via ICU. Some operations would be possible without
using the locale, if it wasn't for those Turkish i:s. :/

-- Johan

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Greg Weber
Can we all agree that

* Text can now demonstrate both CPU and RAM performance improvements
in benchmarks. Because Text is an opaque type it has a maximum
potential for future performance improvements. Declaring a String to
be a list limits performance improvements
* In a Unicode world, String = [Char] is not always correct: instead
for some operations one must operate on the String as a whole. Using a
[Char] type makes it much more likely for a programmer to  mistakenly
operate on individual characters. Using a Text type allows us to
choose to not expose character manipulation functions.
* The usage of String in the base libraries will continue as long as
Text is not in the language standard. This will continue to make
writing Haskell code a greater chore than is necessary: converting
between types, and working around the inconvenience of defining
typeclasses that operate on both String and [].


These are important enough to *try* to include Text into the standard,
even if there are objections to how it might practically be included.

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Gabriel Dos Reis
On Sat, Mar 24, 2012 at 7:16 PM, Johan Tibell  wrote:
> On Sat, Mar 24, 2012 at 4:42 PM, Gabriel Dos Reis
>  wrote:
>> Hmm, std::u16string, std::u23string, and std::wstring are C++ standard
>> types to process Unicode texts.
>
> Note that at least u16string is too small to encode all of Unicode and
> wstring might be as 16 bits is not enough to encode all of Unicode.
>

I think there is a confusion here.  A Unicode character is an abstract
entity.  For it to exist in some concrete form in a program, you need
an encoding.  The fact that char16_t is 16-bit wide is irrelevant to
whether it can be used in a representation of a Unicode text, just like
uint8_t (e.g. 'unsigned char') can be used to encode Unicode string
despite it being only 8-bit wide.   You do not need to make the
character type exactly equal to the type of the individual element
in the text representation.

Now, if you want to make a one-to-one correspondence between
individual elements in a std::basic_string and a Unicode character,
you would of course go for char32_t, which might be wasteful
depending on the circumstances.  Text processing languages like Perl
have long decided to de-emphasize one-character-at-a-time processing.
For most common cases, it is just inefficient.  But, I also understand
that the efficiency argument may not be strong in the context of Haskell.
However, I believe a particular attention must be paid to the correctness
of the semantics.

Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is insufficient
as far as text processing goes; you also need a localization at the
minimum.  It is the
combination of the two that gives some meaning to text representation
and operations.

I have been following the discussion, but I don't see anything said
about locales.

-- Gaby

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Johan Tibell
On Sat, Mar 24, 2012 at 4:42 PM, Gabriel Dos Reis
 wrote:
> Hmm, std::u16string, std::u23string, and std::wstring are C++ standard
> types to process Unicode texts.

Note that at least u16string is too small to encode all of Unicode and
wstring might be as 16 bits is not enough to encode all of Unicode.

-- Johan

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Gabriel Dos Reis
On Sat, Mar 24, 2012 at 6:00 PM, Johan Tibell  wrote:

> C++'s char* is morally equivalent of our ByteString, not Text. There's
> no standardized C++ Unicode string type, ICU's UnicodeString is
> perhaps the closest to one.

Hmm, std::u16string, std::u23string, and std::wstring are C++ standard
types to process Unicode texts.

Anyway, my inclination is that having a proper string in Haskell type would
be a Good Thing.  Sometimes it is worth breaking the textbook.

In our local Haskell system for AVR microcontrollers, we explicitly made
String distinct from [Char] -- we cannot afford the memory
inefficiency that [Char] entails, just to represent simple strings.

-- Gaby

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Gabriel Dos Reis
On Sat, Mar 24, 2012 at 5:33 PM, Freddie Manners  wrote:
> To add my tuppence-worth on this, addressed to no-one in particular:
>
> (1) I think getting hung up on UTF-8 correctness is a distraction here.  I
> can't imagine anyone suggesting that the C/C++ standards removed support for
> (char*) because it wasn't UTF-8 correct: sure, you'd recommend people use a
> different type when it matters, but the language standard itself shouldn't
> be driven by technical issues that don't affect most people most of the
> time.  I'm sure it's good engineering practice to worry about these things,
> but the standard isn't there to encourage good engineering practice.

C++ does not consider 'char*' as the type of a string.

It has a standard template std::basic_string that can be instantiated on
char (giving std::string) or encoding type (of unicode characters) char16_t,
char32_t, and wchar_t giving rise to u16string, u32string, and wstring.
It has a large number of functions to manipulate a string as a sequence
(Haskell's statu quo) or as a text thanks to an elaborated
localization machinery.

-- Gaby, back to lurking mode

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Johan Tibell
On Sat, Mar 24, 2012 at 3:33 PM, Freddie Manners  wrote:
> To add my tuppence-worth on this, addressed to no-one in particular:
>
> (1) I think getting hung up on UTF-8 correctness is a distraction here.  I
> can't imagine anyone suggesting that the C/C++ standards removed support for
> (char*) because it wasn't UTF-8 correct: sure, you'd recommend people use a
> different type when it matters, but the language standard itself shouldn't
> be driven by technical issues that don't affect most people most of the
> time.  I'm sure it's good engineering practice to worry about these things,
> but the standard isn't there to encourage good engineering practice.

(I assume you mean Unicode correctness. UTF-8 is only one possible
encoding. Also I'm not arguing for removing type String = [Char], I
arguing why Text is better than String.)

C++'s char* is morally equivalent of our ByteString, not Text. There's
no standardized C++ Unicode string type, ICU's UnicodeString is
perhaps the closest to one.

-- Johan

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Thomas Schilling
On 24 March 2012 22:33, Freddie Manners  wrote:
> To add my tuppence-worth on this, addressed to no-one in particular:
>
> (1) I think getting hung up on UTF-8 correctness is a distraction here.  I
> can't imagine anyone suggesting that the C/C++ standards removed support for
> (char*) because it wasn't UTF-8 correct: sure, you'd recommend people use a
> different type when it matters, but the language standard itself shouldn't
> be driven by technical issues that don't affect most people most of the
> time.  I'm sure it's good engineering practice to worry about these things,
> but the standard isn't there to encourage good engineering practice.

It doesn't really have anything to do with UTF-8.  UTF-8 is just a
particular serialisation of a unicode string.

Here's a simple illustration of the problems one faces:  Let's say you
want to search for the string "fix".  Now, the problem is that the
sequence 'f','i' could be represented both as ['f', 'i'] or as [chr
0xfb01] (the "fi" ligature).  The text-icu package provides a function
to normalise a string such that only one of these forms can occur in
each string.  Because the world's languages are rather complex there
are many more such cases which need to be handled properly (if you
don't want to run into weird corner cases).

> (2) I'd suggest that a proposal that advocated overloaded string literals --
> of which [Char] was an option -- couldn't be much more confusing from a
> pedagogical perspective than the fact that numeric literals are overloaded.
>  Since that seems to be one of the main biases in favour of [Char] in the
> current standard, that might be a possible incremental fix.

I agree that this proposal should probably include the standardisation
of the OverloadedStrings extension.

>
> Best,
> Freddie
>
>
> On 24 March 2012 22:15, Ian Lynagh  wrote:
>>
>> On Sat, Mar 24, 2012 at 08:38:23PM +, Thomas Schilling wrote:
>> > On 24 March 2012 20:16, Ian Lynagh  wrote:
>> > >
>> > >> Correctness
>> > >> ==
>> > >>
>> > >> Using list-based operations on Strings are almost always wrong
>> > >
>> > > Data.Text seems to think that many of them are worth reimplementing
>> > > for
>> > > Text. It looks like someone's systematically gone through Data.List.
>> >
>> > That's exactly what happened as part of the platform inclusion
>> > process.  In fact, there was quite a bit of bike shedding whether the
>> > Text API should be compatible with the list API or not.  In the end
>> > the decision was made to add all the list functions even if that
>> > encouraged running into unicode issues.  I'm pretty sure you
>> > participated in that discussion.
>>
>> As far as I remember, a few functions were added to text and bytestring
>> during that, but mostly the discussion was about naming.
>>
>> Even in the first 0.1 release of bytestring:
>>
>>  http://hackage.haskell.org/packages/archive/text/0.1/doc/html/Data-Text.html
>> there is a large amount of Data.List covered, e.g. map, transpose,
>> foldl1', minimum, mapAccumR, groupBy.
>>
>>
>> Thanks
>> Ian
>>
>>
>> ___
>> Haskell-prime mailing list
>> Haskell-prime@haskell.org
>> http://www.haskell.org/mailman/listinfo/haskell-prime
>
>
>
> ___
> Haskell-prime mailing list
> Haskell-prime@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-prime
>



-- 
Push the envelope. Watch it bend.

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Johan Tibell
On Sat, Mar 24, 2012 at 3:45 PM, Isaac Dupree
 wrote:
> How is Text for small strings currently (e.g. one English word, if not one
> character)?  Can we reasonably recommend it for that?
> This recent question suggests it's still not great:
> http://stackoverflow.com/questions/9398572/memory-efficient-strings-in-haskell

It's definitely not as good as it could be with the common case being
2 bytes per code point and then some fixed overhead.

The UTF-8 GSoC project last summer was an attempt to see if we could
do better, but unfortunately GHC does a worse job streaming out of a
byte array containing utf-8 than out of a byte array containing utf-16
(due to bad branch layout.)

This resulted in some performance gains and some performance losses,
with some more wins and losses. As there are other engineering
benefits in favor of utf-16 (e.g. being able to use ICU efficiently)
we opted for not switching the decoding. If we can get GHC to the
point where it compiles an utf-8 based Text really well, we could
reconsider this decision.

There's also a design trade-off in Text that favors better asymptotic
complexity for some operations (e.g. taking substrings) that adds 2
words of overhead to every string.

-- Johan

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Isaac Dupree

On 03/24/2012 02:50 PM, Johan Tibell wrote:

[...]
Furthermore, the memory overhead of Text is smaller, which means that
applications that hold on to many string value will use less heap and
thus experience smaller "freezes" due major GC collections, which are
linear in the heap size.


How is Text for small strings currently (e.g. one English word, if not 
one character)?  Can we reasonably recommend it for that?

This recent question suggests it's still not great:
http://stackoverflow.com/questions/9398572/memory-efficient-strings-in-haskell

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Thomas Schilling
On 24 March 2012 22:27, Ian Lynagh  wrote:
> On Sat, Mar 24, 2012 at 05:31:48PM -0400, Brandon Allbery wrote:
>> On Sat, Mar 24, 2012 at 16:16, Ian Lynagh  wrote:
>>
>> > On Sat, Mar 24, 2012 at 11:50:10AM -0700, Johan Tibell wrote:
>> > > Using list-based operations on Strings are almost always wrong
>> >
>> > Data.Text seems to think that many of them are worth reimplementing for
>> > Text. It looks like someone's systematically gone through Data.List.
>> > And in fact, very few functions there /don't/ look like they are
>> > directly equivalent to list functions.
>> >
>>
>> I was under the impression they have been very carefully designed to do the
>> right thing with characters represented by multiple codepoints, which is
>> something the String version *cannot* do.  It would help if Bryan were
>> involved with this discussion, though.  (I'm cc:ing him on this.)  Since
>> the whole point of Data.Text is to handle stuff like this properly I would
>> be surprised if your assertion that
>>
>> >     upcase :: String -> String
>> > >     upcase = map toUpper
>> >
>> > This is no more incorrect than
>> >    upcase = Data.Text.map toUpper
>>
>> is correct.
>
> I don't see how it could do any better, given both use
>    toUpper :: Char -> Char
> to do the hard work. That's why there is also a
>    Data.Text.toUpper :: Text -> Text
>
> Based on a very quick skim I think that there are only 3 such functions
> in Data.Text (toCaseFold, toLower, toUpper), although the 3
> justification functions may handle double-width characters properly.
>
>
> Anyway, my main point is that I don't think that either text or String
> should make it any easier for people to get things right. It's true that
> currently only text makes correct case-conversions easy, but only
> because no-one's written Data.String.to* yet.

The reason Text uses UTF16 internally is so that it can be used with
the ICU library (written in C, I think) which implements all the
difficult things (http://hackage.haskell.org/package/text-icu).
Reimplementing all that in Haskell would be a significant undertaking.
 You could do the same for String, but that would have to encode and
re-encode on each invokation.

BTW, I checked the version history of the text package and most of the
list functions existed already in Tom Harper's version that text was
based on in 2009.  If you look at the documentation you can see that
many of the list-like functions treat some invalid characters
specially, so they are different.

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Freddie Manners
To add my tuppence-worth on this, addressed to no-one in particular:

(1) I think getting hung up on UTF-8 correctness is a distraction here.  I
can't imagine anyone suggesting that the C/C++ standards removed support
for (char*) because it wasn't UTF-8 correct: sure, you'd recommend people
use a different type when it matters, but the language standard itself
shouldn't be driven by technical issues that don't affect most people most
of the time.  I'm sure it's good engineering practice to worry about these
things, but the standard isn't there to encourage good engineering practice.

(2) I'd suggest that a proposal that advocated overloaded string literals
-- of which [Char] was an option -- couldn't be much more confusing from a
pedagogical perspective than the fact that numeric literals are overloaded.
 Since that seems to be one of the main biases in favour of [Char] in the
current standard, that might be a possible incremental fix.

Best,
Freddie

On 24 March 2012 22:15, Ian Lynagh  wrote:

> On Sat, Mar 24, 2012 at 08:38:23PM +, Thomas Schilling wrote:
> > On 24 March 2012 20:16, Ian Lynagh  wrote:
> > >
> > >> Correctness
> > >> ==
> > >>
> > >> Using list-based operations on Strings are almost always wrong
> > >
> > > Data.Text seems to think that many of them are worth reimplementing for
> > > Text. It looks like someone's systematically gone through Data.List.
> >
> > That's exactly what happened as part of the platform inclusion
> > process.  In fact, there was quite a bit of bike shedding whether the
> > Text API should be compatible with the list API or not.  In the end
> > the decision was made to add all the list functions even if that
> > encouraged running into unicode issues.  I'm pretty sure you
> > participated in that discussion.
>
> As far as I remember, a few functions were added to text and bytestring
> during that, but mostly the discussion was about naming.
>
> Even in the first 0.1 release of bytestring:
>
> http://hackage.haskell.org/packages/archive/text/0.1/doc/html/Data-Text.html
> there is a large amount of Data.List covered, e.g. map, transpose,
> foldl1', minimum, mapAccumR, groupBy.
>
>
> Thanks
> Ian
>
>
> ___
> Haskell-prime mailing list
> Haskell-prime@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-prime
>
___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Ian Lynagh
On Sat, Mar 24, 2012 at 05:31:48PM -0400, Brandon Allbery wrote:
> On Sat, Mar 24, 2012 at 16:16, Ian Lynagh  wrote:
> 
> > On Sat, Mar 24, 2012 at 11:50:10AM -0700, Johan Tibell wrote:
> > > Using list-based operations on Strings are almost always wrong
> >
> > Data.Text seems to think that many of them are worth reimplementing for
> > Text. It looks like someone's systematically gone through Data.List.
> > And in fact, very few functions there /don't/ look like they are
> > directly equivalent to list functions.
> >
> 
> I was under the impression they have been very carefully designed to do the
> right thing with characters represented by multiple codepoints, which is
> something the String version *cannot* do.  It would help if Bryan were
> involved with this discussion, though.  (I'm cc:ing him on this.)  Since
> the whole point of Data.Text is to handle stuff like this properly I would
> be surprised if your assertion that
> 
> > upcase :: String -> String
> > > upcase = map toUpper
> >
> > This is no more incorrect than
> >upcase = Data.Text.map toUpper
> 
> is correct.

I don't see how it could do any better, given both use
toUpper :: Char -> Char
to do the hard work. That's why there is also a
Data.Text.toUpper :: Text -> Text

Based on a very quick skim I think that there are only 3 such functions
in Data.Text (toCaseFold, toLower, toUpper), although the 3
justification functions may handle double-width characters properly.


Anyway, my main point is that I don't think that either text or String
should make it any easier for people to get things right. It's true that
currently only text makes correct case-conversions easy, but only
because no-one's written Data.String.to* yet.


Thanks
Ian


___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Ian Lynagh
On Sat, Mar 24, 2012 at 08:38:23PM +, Thomas Schilling wrote:
> On 24 March 2012 20:16, Ian Lynagh  wrote:
> >
> >> Correctness
> >> ==
> >>
> >> Using list-based operations on Strings are almost always wrong
> >
> > Data.Text seems to think that many of them are worth reimplementing for
> > Text. It looks like someone's systematically gone through Data.List.
> 
> That's exactly what happened as part of the platform inclusion
> process.  In fact, there was quite a bit of bike shedding whether the
> Text API should be compatible with the list API or not.  In the end
> the decision was made to add all the list functions even if that
> encouraged running into unicode issues.  I'm pretty sure you
> participated in that discussion.

As far as I remember, a few functions were added to text and bytestring
during that, but mostly the discussion was about naming.

Even in the first 0.1 release of bytestring:
  http://hackage.haskell.org/packages/archive/text/0.1/doc/html/Data-Text.html
there is a large amount of Data.List covered, e.g. map, transpose,
foldl1', minimum, mapAccumR, groupBy.


Thanks
Ian


___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Johan Tibell
On Sat, Mar 24, 2012 at 2:31 PM, Brandon Allbery  wrote:
> I was under the impression they have been very carefully designed to do the
> right thing with characters represented by multiple codepoints, which is
> something the String version *cannot* do.  It would help if Bryan were
> involved with this discussion, though.  (I'm cc:ing him on this.)  Since the
> whole point of Data.Text is to handle stuff like this properly I would be
> surprised if your assertion that
>
>> >     upcase :: String -> String
>> >     upcase = map toUpper
>>
>> This is no more incorrect than
>>    upcase = Data.Text.map toUpper
>
>
> is correct.

This is simply not possible given the Unicode specification. There's
no code point that corresponds to the two characters used to represent
an upcased version of the essets. I think the list based API predates
Bryan.

-- Johan

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Johan Tibell
On Sat, Mar 24, 2012 at 1:16 PM, Ian Lynagh  wrote:
> Data.Text seems to think that many of them are worth reimplementing for
> Text. It looks like someone's systematically gone through Data.List.
> And in fact, very few functions there /don't/ look like they are
> directly equivalent to list functions.

I'm not sure why the list-inspired functions are there. It doesn't
really matter. It doesn't change the fact that from a Unicode
perspective they give the wrong result in most situations.

> This is no more incorrect than
>    upcase = Data.Text.map toUpper

No and that's why Bryan added a correct case-modification, case
folding, etc to text.

> There's no reason that there couldn't be a Data.String.toUpper
> corresponding to Data.Text.toUpper.

That's true. But this isn't the point we were discussing. We were
discussing whether the simplification of treating strings as a list is
a good thing (from an educational perspective.) I pointer out that
from a correctness perspective it's wrong.

> I think Heinrich meant 20% performance in a useful program, not a
> micro-benchmark.

I that's what he meant and given that "useful program" isn't defined,
so the 20% number is completely arbitrary.

-- Johan

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Brandon Allbery
On Sat, Mar 24, 2012 at 16:16, Ian Lynagh  wrote:

> On Sat, Mar 24, 2012 at 11:50:10AM -0700, Johan Tibell wrote:
> > Using list-based operations on Strings are almost always wrong
>
> Data.Text seems to think that many of them are worth reimplementing for
> Text. It looks like someone's systematically gone through Data.List.
> And in fact, very few functions there /don't/ look like they are
> directly equivalent to list functions.
>

I was under the impression they have been very carefully designed to do the
right thing with characters represented by multiple codepoints, which is
something the String version *cannot* do.  It would help if Bryan were
involved with this discussion, though.  (I'm cc:ing him on this.)  Since
the whole point of Data.Text is to handle stuff like this properly I would
be surprised if your assertion that

> upcase :: String -> String
> > upcase = map toUpper
>
> This is no more incorrect than
>upcase = Data.Text.map toUpper
>

is correct.

-- 
brandon s allbery  allber...@gmail.com
wandering unix systems administrator (available) (412) 475-9364 vm/sms
___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Thomas Schilling
On 24 March 2012 20:16, Ian Lynagh  wrote:
>
> Hi Johan,
>
> On Sat, Mar 24, 2012 at 11:50:10AM -0700, Johan Tibell wrote:
>>
>> On Sat, Mar 24, 2012 at 12:39 AM, Heinrich Apfelmus
>>  wrote:
>> > Which brings me to the fundamental question behind this proposal: Why do we
>> > need Text at all? What are its virtues and how do they compare? What is the
>> > trade-off? (I'm not familiar enough with the Text library to answer these.)
>> >
>> > To put it very pointedly: is a %20 performance increase on the current
>> > generation of computers worth the cost in terms of ease-of-use, when the
>> > performance can equally be gained by buying a faster computer or more RAM?
>> > I'm not sure whether I even agree with this statement, but this is the
>> > trade-off we are deciding on.
>>
>> Correctness
>> ==
>>
>> Using list-based operations on Strings are almost always wrong
>
> Data.Text seems to think that many of them are worth reimplementing for
> Text. It looks like someone's systematically gone through Data.List.

That's exactly what happened as part of the platform inclusion
process.  In fact, there was quite a bit of bike shedding whether the
Text API should be compatible with the list API or not.  In the end
the decision was made to add all the list functions even if that
encouraged running into unicode issues.  I'm pretty sure you
participated in that discussion.



>> Performance
>> ===
>>
>> Depending on the benchmark, the difference can be much bigger than
>> 20%. For example, here's a comparison of decoding UTF-8 byte data into
>> a String vs a Text value:
>
> I think Heinrich meant 20% performance in a useful program, not a
> micro-benchmark.

Generating web sites is a huge application area of Haskell and one
where a proper text type is in no way a micro optimisation.

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Ian Lynagh

Hi Johan,

On Sat, Mar 24, 2012 at 11:50:10AM -0700, Johan Tibell wrote:
> 
> On Sat, Mar 24, 2012 at 12:39 AM, Heinrich Apfelmus
>  wrote:
> > Which brings me to the fundamental question behind this proposal: Why do we
> > need Text at all? What are its virtues and how do they compare? What is the
> > trade-off? (I'm not familiar enough with the Text library to answer these.)
> >
> > To put it very pointedly: is a %20 performance increase on the current
> > generation of computers worth the cost in terms of ease-of-use, when the
> > performance can equally be gained by buying a faster computer or more RAM?
> > I'm not sure whether I even agree with this statement, but this is the
> > trade-off we are deciding on.
> 
> Correctness
> ==
> 
> Using list-based operations on Strings are almost always wrong

Data.Text seems to think that many of them are worth reimplementing for
Text. It looks like someone's systematically gone through Data.List.
And in fact, very few functions there /don't/ look like they are
directly equivalent to list functions.

> , as
> soon as you move away from English text. You almost always have to
> deal with Unicode strings as blobs, considering several code points at
> once. For example,
> 
> upcase :: String -> String
> upcase = map toUpper

This is no more incorrect than
upcase = Data.Text.map toUpper

There's no reason that there couldn't be a Data.String.toUpper
corresponding to Data.Text.toUpper.

> Performance
> ===
> 
> Depending on the benchmark, the difference can be much bigger than
> 20%. For example, here's a comparison of decoding UTF-8 byte data into
> a String vs a Text value:

I think Heinrich meant 20% performance in a useful program, not a
micro-benchmark.


Thanks
Ian


___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: Long live String = [Char] (Was: Re: String != [Char])

2012-03-24 Thread Thomas Schilling
On 24 March 2012 12:53, Henrik Nilsson  wrote:
> Hi all,
>
> Thomas Schilling wrote:
>
>> I think most here agree that the main advantage of the current
>> definition is only pedagogical.
>
> But that in itself is not a small deal. In fact, it's a pretty
> major advantage.
>
> Moreover, the utter simplicity of String = [Char] is a benefit
> in its own right. Let's not forget that this, in practice,
> across all Haskell applications, works just fine in the vast
> majority of cases.
>
> I get the sense that the proponents for deprecating, and ultimately
> get rid of, String = [Char], are suggesting that this would lead
> to noticeable performance improvements across the board by virtue
> of preventing programmers from accidentally making a poor choice
> of data structure for representing string. But I conjecture that
> the performance impact of switching form e.g. String to Text at
> the level of complete applications would be negligible in most
> cases, simply because most Haskell applications are not dominated
> by heavy-duty string processing. And those that are, probably
> already uses something like Text, and were written be people
> who know a thing or two about appropriate choice of data structures
> anyway.
>
> As to teaching:
>
>> I don't really
>> think that having an abstract type is such a big problem for teaching.
>> You can do string processing by doing (pack . myfunction . unpack)
>
> Here at Nottingham, we're teaching all our 1st-year undergraduates
> Haskell. It works, but it is a challenge, and, alas, far from everyone
> "gets" it. And this is despite the module being taught by one of
> the leading and most experienced Haskell educators (and text book
> author), Graham Hutton.
>
> Without starting an endless discussion about how to best teach
> programming languages in general and Haskell in particular to
> (near) beginners, I dare say that idioms like the one suggested
> above would do nothing to help.
>
> String != [Char] would break no end of code, text books, tutorials,
> lecture slides, would not help with teaching Haskell, all
> for very little if any benefit in the grand scheme of things.

OK, I agree that breaking text books is a big deal.  On the other
hand, the lack of a good Text data type forced text books to teach bad
approaches to dealing with strings.  Haskell should do better.

Johan mentioned both semantic and performance problems with Strings.
A part he didn't stress is that Strings are also a horribly
memory-inefficient way of storing strings.  On 64 bit GHC systems a
single ASCII character needs 16 bytes of memory (i.e., an overhead of
16x). A non-ASCII character (ord c > 255) actually requires 32 bytes.
(This is due to a de-duplication optimisation in the GHC GC).  Other
implementations may do better, but an abstract type would still be
better to enable more freedom for implementors.

Correct handling of unicode strings is a Hard Problem and String =
[Char] is only better if you ignore all the issues (which is certainly
fine a teaching environment).

I would be happy to have a simplistic String = [Char] coexist with a
Text type if it weren't for the problem that so many things are biased
towards String.  E.g., error takes a String, Show is used everywhere
and produces strings, the pretty printing library uses Strings, Read
parses Strings.

> On the other hand, a standardised, well thought-out, API for
> high-performance strings and appropriate mechanisms such
> as a measure of overloading to make it easy and palatable to
> use, and that work alongside the present String = [Char], would be a
> good thing.

As I said, while I'm not a huge fan of having two String types
co-exist, I could accept it as a necessary trade-off to keep text
books valid and preserve backwards compatibility.  (There are also
other issues with String.  For example, you can't write an instance
MyClass String in Haskell2010, and even with GHC extensions it seems
wrong and you often end up writing instances that overlap with MyClass
[a].)  I'm using Data.Text a lot, so I can work around the issue, but
unfortunately you run into a lot of issues where the standard library
forces the use of String, and that, I believe, is wrong.

If changing the standard library is the bigger issue, however, then
I'm not sure whether this discussion needs to take place on the
haskell-prime list or on the libraries list.

/ Thomas

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Johan Tibell
Hi all,

On Sat, Mar 24, 2012 at 12:39 AM, Heinrich Apfelmus
 wrote:
> Which brings me to the fundamental question behind this proposal: Why do we
> need Text at all? What are its virtues and how do they compare? What is the
> trade-off? (I'm not familiar enough with the Text library to answer these.)
>
> To put it very pointedly: is a %20 performance increase on the current
> generation of computers worth the cost in terms of ease-of-use, when the
> performance can equally be gained by buying a faster computer or more RAM?
> I'm not sure whether I even agree with this statement, but this is the
> trade-off we are deciding on.

Correctness
==

Using list-based operations on Strings are almost always wrong, as
soon as you move away from English text. You almost always have to
deal with Unicode strings as blobs, considering several code points at
once. For example,

upcase :: String -> String
upcase = map toUpper

Is terse, beautiful, and wrong, as several languages map a single
lowercase character to two uppercase characters (as I'm sure you're
aware.)

Perhaps this is OK to ignore when teaching students Haskell, but it
really hurts those who want to use Haskell as an engineering language.

Performance
===

Depending on the benchmark, the difference can be much bigger than
20%. For example, here's a comparison of decoding UTF-8 byte data into
a String vs a Text value:

benchmarking Pure/decode/Text
mean: 50.22202 us, lb 50.08306 us, ub 50.37669 us, ci 0.950
std dev: 751.1139 ns, lb 666.2243 ns, ub 865.8246 ns, ci 0.950
variance introduced by outliers: 7.553%
variance is slightly inflated by outliers

benchmarking Pure/decode/String
mean: 188.0507 us, lb 187.4970 us, ub 188.6955 us, ci 0.950
std dev: 3.053076 us, lb 2.647318 us, ub 3.606262 us, ci 0.950
variance introduced by outliers: 9.407%
variance is slightly inflated by outliers

A difference of almost 4x.

Many of the Text vs String benchmarks measure the performance of
operations ignoring both decoding and encoding, while any real
application would have to do both.

On top of that, String is more or less as optimized as it can be;
benchmarks are almost completely memory bound. Text on the other hand
still has potential of (large) improvements, as GHC doesn't general
optimal code for tight loops over arrays. For example, we know that
GHC generates bad code for decodeUtf8 as used by Text's stream fusion,
hurting any code that uses fusion.

Furthermore, the memory overhead of Text is smaller, which means that
applications that hold on to many string value will use less heap and
thus experience smaller "freezes" due major GC collections, which are
linear in the heap size.

Cheers,
Johan

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Long live String = [Char] (Was: Re: String != [Char])

2012-03-24 Thread Henrik Nilsson

Hi all,

Thomas Schilling wrote:

> I think most here agree that the main advantage of the current
> definition is only pedagogical.

But that in itself is not a small deal. In fact, it's a pretty
major advantage.

Moreover, the utter simplicity of String = [Char] is a benefit
in its own right. Let's not forget that this, in practice,
across all Haskell applications, works just fine in the vast
majority of cases.

I get the sense that the proponents for deprecating, and ultimately
get rid of, String = [Char], are suggesting that this would lead
to noticeable performance improvements across the board by virtue
of preventing programmers from accidentally making a poor choice
of data structure for representing string. But I conjecture that
the performance impact of switching form e.g. String to Text at
the level of complete applications would be negligible in most
cases, simply because most Haskell applications are not dominated
by heavy-duty string processing. And those that are, probably
already uses something like Text, and were written be people
who know a thing or two about appropriate choice of data structures
anyway.

As to teaching:

> I don't really
> think that having an abstract type is such a big problem for teaching.
> You can do string processing by doing (pack . myfunction . unpack)

Here at Nottingham, we're teaching all our 1st-year undergraduates
Haskell. It works, but it is a challenge, and, alas, far from everyone
"gets" it. And this is despite the module being taught by one of
the leading and most experienced Haskell educators (and text book
author), Graham Hutton.

Without starting an endless discussion about how to best teach
programming languages in general and Haskell in particular to
(near) beginners, I dare say that idioms like the one suggested
above would do nothing to help.

String != [Char] would break no end of code, text books, tutorials,
lecture slides, would not help with teaching Haskell, all
for very little if any benefit in the grand scheme of things.

So let's not go there.

On the other hand, a standardised, well thought-out, API for
high-performance strings and appropriate mechanisms such
as a measure of overloading to make it easy and palatable to
use, and that work alongside the present String = [Char], would be a
good thing.

All the best,

/Henrik

--
Henrik Nilsson
School of Computer Science
The University of Nottingham
n...@cs.nott.ac.uk

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-24 Thread Heinrich Apfelmus

Edward Kmett wrote:

Like I said, my objection to including Text is a lot less strong than
my feelings on any notion of deprecating String.

[..]

The pedagogical concern is quite real, remember many introductory
lanuage classes have time to present Haskell and the list data type
and not much else. Showing parsing through pattern matching on
strings makes a very powerful tool, its harder to show that with
Text.

[..]

The major benefits of Text come from FFI opportunities, but even
there if you dig into its internals it has to copy out of the array
to talk to foreign functions because it lives in unpinned memory
unlike ByteString.


I agree with Edward Kmett on the virtues of  String = [Char]  for 
learning Haskell. I'm teaching beginners regularly and it is simply 
eye-opening for them that they can use the familiar list operations to 
solve real world problems which usually involve textual data.


Which brings me to the fundamental question behind this proposal: Why do 
we need Text at all? What are its virtues and how do they compare? What 
is the trade-off? (I'm not familiar enough with the Text library to 
answer these.)


To put it very pointedly: is a %20 performance increase on the current 
generation of computers worth the cost in terms of ease-of-use, when the 
performance can equally be gained by buying a faster computer or more 
RAM? I'm not sure whether I even agree with this statement, but this is 
the trade-off we are deciding on.



Best regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com


___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime