Re: String != [Char]
On Sat, Mar 24, 2012 at 7:26 PM, Gabriel Dos Reis wrote: > On Sat, Mar 24, 2012 at 9:09 PM, Greg Weber wrote: >> Problem: we want to write beautiful (and possibly inefficient) code >> that is easy to explain. If nothing else, this is pedagologically >> important. >> The goals of this code are to: >> * use list processing pattern matching and functions on a string type > > I may have missed this question so I will ask it (apologies if it is a > repeat): Why is it believed that list processing pattern matching is > appropriate or the right tool for text processing? Nobody said it is the right tool for text processing. In fact, I think we all agreed it is the wrong tool for many cases. But it is easy for students to understand since they are already being taught to use lists for everything else. It would be great if you can talk with teachers of Haskell and figure out a better way to teach text processing. > > > -- Gaby ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Sat, Mar 24, 2012 at 9:09 PM, Greg Weber wrote: > # Switching to Text by default makes us embarrassed! Text processing /is/ quick to embarrassment :-) > Problem: we want to write beautiful (and possibly inefficient) code > that is easy to explain. If nothing else, this is pedagologically > important. > The goals of this code are to: > * use list processing pattern matching and functions on a string type I may have missed this question so I will ask it (apologies if it is a repeat): Why is it believed that list processing pattern matching is appropriate or the right tool for text processing? -- Gaby ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Sat, Mar 24, 2012 at 8:51 PM, Johan Tibell wrote: > On Sat, Mar 24, 2012 at 5:54 PM, Gabriel Dos Reis > wrote: >> I think there is a confusion here. A Unicode character is an abstract >> entity. For it to exist in some concrete form in a program, you need >> an encoding. The fact that char16_t is 16-bit wide is irrelevant to >> whether it can be used in a representation of a Unicode text, just like >> uint8_t (e.g. 'unsigned char') can be used to encode Unicode string >> despite it being only 8-bit wide. You do not need to make the >> character type exactly equal to the type of the individual element >> in the text representation. > > Well, if you have a >21-bit type you can declare its value to be a > Unicode code point (which are numbered.) That is correct. Because not all Unicode points represent characters, and not all Unicode code point sequences represent valid characters, even if you have that >21-bit type T, the list type [T] would still not be a good string type. > Using a char* that you claim > contain utf-8 encoded data is bad for safety, as there is no guarantee > that that's indeed the case. Indeed, and that is why a Text should be an abstract datatype, hiding the concrete implementation away from the user. >> Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is >> insufficient >> as far as text processing goes; you also need a localization at the >> minimum. It is the >> combination of the two that gives some meaning to text representation >> and operations. > > text does that via ICU. Some operations would be possible without > using the locale, if it wasn't for those Turkish i:s. :/ yeah, 7 bits should be enough for every character ;-) -- Gaby ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
# Switching to Text by default makes us embarrassed! Problem: we want to write beautiful (and possibly inefficient) code that is easy to explain. If nothing else, this is pedagologically important. The goals of this code are to: * use list processing pattern matching and functions on a string type * avoid embarassing name clashes and the need for qualified names (T.split, etc) The second point is Haskell's festering language design sore rearing its ugly head. Lets note that the current state of Haskell is not any more beautiful than what will happen after this proposal is implemented. It is just that we currently have partly hidden away a deficiency in Haskell by only exporting list functions in the Prelude. So our real goal is to come up with conventions and/or hacks that will allow us to continue to hide this deficiency of Haskell for the purposes of pedagogy. If you can't tell, IMHO the issue we are circumventing is Haskell's biggest issue form a laguage design perspective. It is a shame that SPJ's TDNR proposal was shouted down and no alternative has been given. But I am not going to hold out hope that this issue will be solved any time soon. Just limiting solving this to records has proved very difficult. So onto our hacks for making Text the default string type! ## Option 1: T. prefixing using Text functions still requires the T. prefix For pedagogy, continue to use [Char], but use an OverloadedText extension This is a safe conservative option that puts us in a better place than we are today. It just makes us look strange when we build something into the language that requires a prefix. Of course, we could try to give every Text function a slightly different name than the Prelude list functions, but I think that will make using Haskell more difficult that putting up with prefixes. ## Option 2: TDNR for lists (Prelude) list functions are resolved in a special way. For example, we could have 2 different map functions in scope unqualified: one for lists, and one for Text. The compiler is tasked with resolving whether the type is a list or not and determining the appropriate function. I would much rather add a TDNR construct to the language in a universal way than go down this route. ## Option 3: implicit List typeclass We can operate on Text (and other non-list data structures) using a List typeclass. We have 2 concers: * list pattern matching ('c':string) * requiring the typeclass in the type signature everywhere I think we can extend the compiler to pattern match characters out of Text, so lets move onto the second point. If we don't write type signatures anywhere, we actually won't care about it. However, if we add sparse annotations, we will need a List constraint. listF :: List l => ... This could get tiresome quickly. It makes pedagogy immediately delve into an explanation of typeclasses. A simple solution is to special case the List class. We declare that List is so fundamental to Haskell that requiring the List typeclass is not necessary. The Prelude exports (class List where ...). If a List typeclass function is used, the compiler inserts the List typeclass constraint into a type signature automatically. This option is very attractive because it solves all of our problems at the cost of 1 easy to explain piece of magic. It also makes it possible to unify list behavior across different data types without the hassle of typeclass insertions everywhere. ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Sat, Mar 24, 2012 at 5:54 PM, Gabriel Dos Reis wrote: > I think there is a confusion here. A Unicode character is an abstract > entity. For it to exist in some concrete form in a program, you need > an encoding. The fact that char16_t is 16-bit wide is irrelevant to > whether it can be used in a representation of a Unicode text, just like > uint8_t (e.g. 'unsigned char') can be used to encode Unicode string > despite it being only 8-bit wide. You do not need to make the > character type exactly equal to the type of the individual element > in the text representation. Well, if you have a >21-bit type you can declare its value to be a Unicode code point (which are numbered.) Using a char* that you claim contain utf-8 encoded data is bad for safety, as there is no guarantee that that's indeed the case. > Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is > insufficient > as far as text processing goes; you also need a localization at the > minimum. It is the > combination of the two that gives some meaning to text representation > and operations. text does that via ICU. Some operations would be possible without using the locale, if it wasn't for those Turkish i:s. :/ -- Johan ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
Can we all agree that * Text can now demonstrate both CPU and RAM performance improvements in benchmarks. Because Text is an opaque type it has a maximum potential for future performance improvements. Declaring a String to be a list limits performance improvements * In a Unicode world, String = [Char] is not always correct: instead for some operations one must operate on the String as a whole. Using a [Char] type makes it much more likely for a programmer to mistakenly operate on individual characters. Using a Text type allows us to choose to not expose character manipulation functions. * The usage of String in the base libraries will continue as long as Text is not in the language standard. This will continue to make writing Haskell code a greater chore than is necessary: converting between types, and working around the inconvenience of defining typeclasses that operate on both String and []. These are important enough to *try* to include Text into the standard, even if there are objections to how it might practically be included. ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Sat, Mar 24, 2012 at 7:16 PM, Johan Tibell wrote: > On Sat, Mar 24, 2012 at 4:42 PM, Gabriel Dos Reis > wrote: >> Hmm, std::u16string, std::u23string, and std::wstring are C++ standard >> types to process Unicode texts. > > Note that at least u16string is too small to encode all of Unicode and > wstring might be as 16 bits is not enough to encode all of Unicode. > I think there is a confusion here. A Unicode character is an abstract entity. For it to exist in some concrete form in a program, you need an encoding. The fact that char16_t is 16-bit wide is irrelevant to whether it can be used in a representation of a Unicode text, just like uint8_t (e.g. 'unsigned char') can be used to encode Unicode string despite it being only 8-bit wide. You do not need to make the character type exactly equal to the type of the individual element in the text representation. Now, if you want to make a one-to-one correspondence between individual elements in a std::basic_string and a Unicode character, you would of course go for char32_t, which might be wasteful depending on the circumstances. Text processing languages like Perl have long decided to de-emphasize one-character-at-a-time processing. For most common cases, it is just inefficient. But, I also understand that the efficiency argument may not be strong in the context of Haskell. However, I believe a particular attention must be paid to the correctness of the semantics. Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is insufficient as far as text processing goes; you also need a localization at the minimum. It is the combination of the two that gives some meaning to text representation and operations. I have been following the discussion, but I don't see anything said about locales. -- Gaby ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Sat, Mar 24, 2012 at 4:42 PM, Gabriel Dos Reis wrote: > Hmm, std::u16string, std::u23string, and std::wstring are C++ standard > types to process Unicode texts. Note that at least u16string is too small to encode all of Unicode and wstring might be as 16 bits is not enough to encode all of Unicode. -- Johan ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Sat, Mar 24, 2012 at 6:00 PM, Johan Tibell wrote: > C++'s char* is morally equivalent of our ByteString, not Text. There's > no standardized C++ Unicode string type, ICU's UnicodeString is > perhaps the closest to one. Hmm, std::u16string, std::u23string, and std::wstring are C++ standard types to process Unicode texts. Anyway, my inclination is that having a proper string in Haskell type would be a Good Thing. Sometimes it is worth breaking the textbook. In our local Haskell system for AVR microcontrollers, we explicitly made String distinct from [Char] -- we cannot afford the memory inefficiency that [Char] entails, just to represent simple strings. -- Gaby ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Sat, Mar 24, 2012 at 5:33 PM, Freddie Manners wrote: > To add my tuppence-worth on this, addressed to no-one in particular: > > (1) I think getting hung up on UTF-8 correctness is a distraction here. I > can't imagine anyone suggesting that the C/C++ standards removed support for > (char*) because it wasn't UTF-8 correct: sure, you'd recommend people use a > different type when it matters, but the language standard itself shouldn't > be driven by technical issues that don't affect most people most of the > time. I'm sure it's good engineering practice to worry about these things, > but the standard isn't there to encourage good engineering practice. C++ does not consider 'char*' as the type of a string. It has a standard template std::basic_string that can be instantiated on char (giving std::string) or encoding type (of unicode characters) char16_t, char32_t, and wchar_t giving rise to u16string, u32string, and wstring. It has a large number of functions to manipulate a string as a sequence (Haskell's statu quo) or as a text thanks to an elaborated localization machinery. -- Gaby, back to lurking mode ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Sat, Mar 24, 2012 at 3:33 PM, Freddie Manners wrote: > To add my tuppence-worth on this, addressed to no-one in particular: > > (1) I think getting hung up on UTF-8 correctness is a distraction here. I > can't imagine anyone suggesting that the C/C++ standards removed support for > (char*) because it wasn't UTF-8 correct: sure, you'd recommend people use a > different type when it matters, but the language standard itself shouldn't > be driven by technical issues that don't affect most people most of the > time. I'm sure it's good engineering practice to worry about these things, > but the standard isn't there to encourage good engineering practice. (I assume you mean Unicode correctness. UTF-8 is only one possible encoding. Also I'm not arguing for removing type String = [Char], I arguing why Text is better than String.) C++'s char* is morally equivalent of our ByteString, not Text. There's no standardized C++ Unicode string type, ICU's UnicodeString is perhaps the closest to one. -- Johan ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On 24 March 2012 22:33, Freddie Manners wrote: > To add my tuppence-worth on this, addressed to no-one in particular: > > (1) I think getting hung up on UTF-8 correctness is a distraction here. I > can't imagine anyone suggesting that the C/C++ standards removed support for > (char*) because it wasn't UTF-8 correct: sure, you'd recommend people use a > different type when it matters, but the language standard itself shouldn't > be driven by technical issues that don't affect most people most of the > time. I'm sure it's good engineering practice to worry about these things, > but the standard isn't there to encourage good engineering practice. It doesn't really have anything to do with UTF-8. UTF-8 is just a particular serialisation of a unicode string. Here's a simple illustration of the problems one faces: Let's say you want to search for the string "fix". Now, the problem is that the sequence 'f','i' could be represented both as ['f', 'i'] or as [chr 0xfb01] (the "fi" ligature). The text-icu package provides a function to normalise a string such that only one of these forms can occur in each string. Because the world's languages are rather complex there are many more such cases which need to be handled properly (if you don't want to run into weird corner cases). > (2) I'd suggest that a proposal that advocated overloaded string literals -- > of which [Char] was an option -- couldn't be much more confusing from a > pedagogical perspective than the fact that numeric literals are overloaded. > Since that seems to be one of the main biases in favour of [Char] in the > current standard, that might be a possible incremental fix. I agree that this proposal should probably include the standardisation of the OverloadedStrings extension. > > Best, > Freddie > > > On 24 March 2012 22:15, Ian Lynagh wrote: >> >> On Sat, Mar 24, 2012 at 08:38:23PM +, Thomas Schilling wrote: >> > On 24 March 2012 20:16, Ian Lynagh wrote: >> > > >> > >> Correctness >> > >> == >> > >> >> > >> Using list-based operations on Strings are almost always wrong >> > > >> > > Data.Text seems to think that many of them are worth reimplementing >> > > for >> > > Text. It looks like someone's systematically gone through Data.List. >> > >> > That's exactly what happened as part of the platform inclusion >> > process. In fact, there was quite a bit of bike shedding whether the >> > Text API should be compatible with the list API or not. In the end >> > the decision was made to add all the list functions even if that >> > encouraged running into unicode issues. I'm pretty sure you >> > participated in that discussion. >> >> As far as I remember, a few functions were added to text and bytestring >> during that, but mostly the discussion was about naming. >> >> Even in the first 0.1 release of bytestring: >> >> http://hackage.haskell.org/packages/archive/text/0.1/doc/html/Data-Text.html >> there is a large amount of Data.List covered, e.g. map, transpose, >> foldl1', minimum, mapAccumR, groupBy. >> >> >> Thanks >> Ian >> >> >> ___ >> Haskell-prime mailing list >> Haskell-prime@haskell.org >> http://www.haskell.org/mailman/listinfo/haskell-prime > > > > ___ > Haskell-prime mailing list > Haskell-prime@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-prime > -- Push the envelope. Watch it bend. ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Sat, Mar 24, 2012 at 3:45 PM, Isaac Dupree wrote: > How is Text for small strings currently (e.g. one English word, if not one > character)? Can we reasonably recommend it for that? > This recent question suggests it's still not great: > http://stackoverflow.com/questions/9398572/memory-efficient-strings-in-haskell It's definitely not as good as it could be with the common case being 2 bytes per code point and then some fixed overhead. The UTF-8 GSoC project last summer was an attempt to see if we could do better, but unfortunately GHC does a worse job streaming out of a byte array containing utf-8 than out of a byte array containing utf-16 (due to bad branch layout.) This resulted in some performance gains and some performance losses, with some more wins and losses. As there are other engineering benefits in favor of utf-16 (e.g. being able to use ICU efficiently) we opted for not switching the decoding. If we can get GHC to the point where it compiles an utf-8 based Text really well, we could reconsider this decision. There's also a design trade-off in Text that favors better asymptotic complexity for some operations (e.g. taking substrings) that adds 2 words of overhead to every string. -- Johan ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On 03/24/2012 02:50 PM, Johan Tibell wrote: [...] Furthermore, the memory overhead of Text is smaller, which means that applications that hold on to many string value will use less heap and thus experience smaller "freezes" due major GC collections, which are linear in the heap size. How is Text for small strings currently (e.g. one English word, if not one character)? Can we reasonably recommend it for that? This recent question suggests it's still not great: http://stackoverflow.com/questions/9398572/memory-efficient-strings-in-haskell ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On 24 March 2012 22:27, Ian Lynagh wrote: > On Sat, Mar 24, 2012 at 05:31:48PM -0400, Brandon Allbery wrote: >> On Sat, Mar 24, 2012 at 16:16, Ian Lynagh wrote: >> >> > On Sat, Mar 24, 2012 at 11:50:10AM -0700, Johan Tibell wrote: >> > > Using list-based operations on Strings are almost always wrong >> > >> > Data.Text seems to think that many of them are worth reimplementing for >> > Text. It looks like someone's systematically gone through Data.List. >> > And in fact, very few functions there /don't/ look like they are >> > directly equivalent to list functions. >> > >> >> I was under the impression they have been very carefully designed to do the >> right thing with characters represented by multiple codepoints, which is >> something the String version *cannot* do. It would help if Bryan were >> involved with this discussion, though. (I'm cc:ing him on this.) Since >> the whole point of Data.Text is to handle stuff like this properly I would >> be surprised if your assertion that >> >> > upcase :: String -> String >> > > upcase = map toUpper >> > >> > This is no more incorrect than >> > upcase = Data.Text.map toUpper >> >> is correct. > > I don't see how it could do any better, given both use > toUpper :: Char -> Char > to do the hard work. That's why there is also a > Data.Text.toUpper :: Text -> Text > > Based on a very quick skim I think that there are only 3 such functions > in Data.Text (toCaseFold, toLower, toUpper), although the 3 > justification functions may handle double-width characters properly. > > > Anyway, my main point is that I don't think that either text or String > should make it any easier for people to get things right. It's true that > currently only text makes correct case-conversions easy, but only > because no-one's written Data.String.to* yet. The reason Text uses UTF16 internally is so that it can be used with the ICU library (written in C, I think) which implements all the difficult things (http://hackage.haskell.org/package/text-icu). Reimplementing all that in Haskell would be a significant undertaking. You could do the same for String, but that would have to encode and re-encode on each invokation. BTW, I checked the version history of the text package and most of the list functions existed already in Tom Harper's version that text was based on in 2009. If you look at the documentation you can see that many of the list-like functions treat some invalid characters specially, so they are different. ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
To add my tuppence-worth on this, addressed to no-one in particular: (1) I think getting hung up on UTF-8 correctness is a distraction here. I can't imagine anyone suggesting that the C/C++ standards removed support for (char*) because it wasn't UTF-8 correct: sure, you'd recommend people use a different type when it matters, but the language standard itself shouldn't be driven by technical issues that don't affect most people most of the time. I'm sure it's good engineering practice to worry about these things, but the standard isn't there to encourage good engineering practice. (2) I'd suggest that a proposal that advocated overloaded string literals -- of which [Char] was an option -- couldn't be much more confusing from a pedagogical perspective than the fact that numeric literals are overloaded. Since that seems to be one of the main biases in favour of [Char] in the current standard, that might be a possible incremental fix. Best, Freddie On 24 March 2012 22:15, Ian Lynagh wrote: > On Sat, Mar 24, 2012 at 08:38:23PM +, Thomas Schilling wrote: > > On 24 March 2012 20:16, Ian Lynagh wrote: > > > > > >> Correctness > > >> == > > >> > > >> Using list-based operations on Strings are almost always wrong > > > > > > Data.Text seems to think that many of them are worth reimplementing for > > > Text. It looks like someone's systematically gone through Data.List. > > > > That's exactly what happened as part of the platform inclusion > > process. In fact, there was quite a bit of bike shedding whether the > > Text API should be compatible with the list API or not. In the end > > the decision was made to add all the list functions even if that > > encouraged running into unicode issues. I'm pretty sure you > > participated in that discussion. > > As far as I remember, a few functions were added to text and bytestring > during that, but mostly the discussion was about naming. > > Even in the first 0.1 release of bytestring: > > http://hackage.haskell.org/packages/archive/text/0.1/doc/html/Data-Text.html > there is a large amount of Data.List covered, e.g. map, transpose, > foldl1', minimum, mapAccumR, groupBy. > > > Thanks > Ian > > > ___ > Haskell-prime mailing list > Haskell-prime@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-prime > ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Sat, Mar 24, 2012 at 05:31:48PM -0400, Brandon Allbery wrote: > On Sat, Mar 24, 2012 at 16:16, Ian Lynagh wrote: > > > On Sat, Mar 24, 2012 at 11:50:10AM -0700, Johan Tibell wrote: > > > Using list-based operations on Strings are almost always wrong > > > > Data.Text seems to think that many of them are worth reimplementing for > > Text. It looks like someone's systematically gone through Data.List. > > And in fact, very few functions there /don't/ look like they are > > directly equivalent to list functions. > > > > I was under the impression they have been very carefully designed to do the > right thing with characters represented by multiple codepoints, which is > something the String version *cannot* do. It would help if Bryan were > involved with this discussion, though. (I'm cc:ing him on this.) Since > the whole point of Data.Text is to handle stuff like this properly I would > be surprised if your assertion that > > > upcase :: String -> String > > > upcase = map toUpper > > > > This is no more incorrect than > >upcase = Data.Text.map toUpper > > is correct. I don't see how it could do any better, given both use toUpper :: Char -> Char to do the hard work. That's why there is also a Data.Text.toUpper :: Text -> Text Based on a very quick skim I think that there are only 3 such functions in Data.Text (toCaseFold, toLower, toUpper), although the 3 justification functions may handle double-width characters properly. Anyway, my main point is that I don't think that either text or String should make it any easier for people to get things right. It's true that currently only text makes correct case-conversions easy, but only because no-one's written Data.String.to* yet. Thanks Ian ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Sat, Mar 24, 2012 at 08:38:23PM +, Thomas Schilling wrote: > On 24 March 2012 20:16, Ian Lynagh wrote: > > > >> Correctness > >> == > >> > >> Using list-based operations on Strings are almost always wrong > > > > Data.Text seems to think that many of them are worth reimplementing for > > Text. It looks like someone's systematically gone through Data.List. > > That's exactly what happened as part of the platform inclusion > process. In fact, there was quite a bit of bike shedding whether the > Text API should be compatible with the list API or not. In the end > the decision was made to add all the list functions even if that > encouraged running into unicode issues. I'm pretty sure you > participated in that discussion. As far as I remember, a few functions were added to text and bytestring during that, but mostly the discussion was about naming. Even in the first 0.1 release of bytestring: http://hackage.haskell.org/packages/archive/text/0.1/doc/html/Data-Text.html there is a large amount of Data.List covered, e.g. map, transpose, foldl1', minimum, mapAccumR, groupBy. Thanks Ian ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Sat, Mar 24, 2012 at 2:31 PM, Brandon Allbery wrote: > I was under the impression they have been very carefully designed to do the > right thing with characters represented by multiple codepoints, which is > something the String version *cannot* do. It would help if Bryan were > involved with this discussion, though. (I'm cc:ing him on this.) Since the > whole point of Data.Text is to handle stuff like this properly I would be > surprised if your assertion that > >> > upcase :: String -> String >> > upcase = map toUpper >> >> This is no more incorrect than >> upcase = Data.Text.map toUpper > > > is correct. This is simply not possible given the Unicode specification. There's no code point that corresponds to the two characters used to represent an upcased version of the essets. I think the list based API predates Bryan. -- Johan ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Sat, Mar 24, 2012 at 1:16 PM, Ian Lynagh wrote: > Data.Text seems to think that many of them are worth reimplementing for > Text. It looks like someone's systematically gone through Data.List. > And in fact, very few functions there /don't/ look like they are > directly equivalent to list functions. I'm not sure why the list-inspired functions are there. It doesn't really matter. It doesn't change the fact that from a Unicode perspective they give the wrong result in most situations. > This is no more incorrect than > upcase = Data.Text.map toUpper No and that's why Bryan added a correct case-modification, case folding, etc to text. > There's no reason that there couldn't be a Data.String.toUpper > corresponding to Data.Text.toUpper. That's true. But this isn't the point we were discussing. We were discussing whether the simplification of treating strings as a list is a good thing (from an educational perspective.) I pointer out that from a correctness perspective it's wrong. > I think Heinrich meant 20% performance in a useful program, not a > micro-benchmark. I that's what he meant and given that "useful program" isn't defined, so the 20% number is completely arbitrary. -- Johan ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Sat, Mar 24, 2012 at 16:16, Ian Lynagh wrote: > On Sat, Mar 24, 2012 at 11:50:10AM -0700, Johan Tibell wrote: > > Using list-based operations on Strings are almost always wrong > > Data.Text seems to think that many of them are worth reimplementing for > Text. It looks like someone's systematically gone through Data.List. > And in fact, very few functions there /don't/ look like they are > directly equivalent to list functions. > I was under the impression they have been very carefully designed to do the right thing with characters represented by multiple codepoints, which is something the String version *cannot* do. It would help if Bryan were involved with this discussion, though. (I'm cc:ing him on this.) Since the whole point of Data.Text is to handle stuff like this properly I would be surprised if your assertion that > upcase :: String -> String > > upcase = map toUpper > > This is no more incorrect than >upcase = Data.Text.map toUpper > is correct. -- brandon s allbery allber...@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On 24 March 2012 20:16, Ian Lynagh wrote: > > Hi Johan, > > On Sat, Mar 24, 2012 at 11:50:10AM -0700, Johan Tibell wrote: >> >> On Sat, Mar 24, 2012 at 12:39 AM, Heinrich Apfelmus >> wrote: >> > Which brings me to the fundamental question behind this proposal: Why do we >> > need Text at all? What are its virtues and how do they compare? What is the >> > trade-off? (I'm not familiar enough with the Text library to answer these.) >> > >> > To put it very pointedly: is a %20 performance increase on the current >> > generation of computers worth the cost in terms of ease-of-use, when the >> > performance can equally be gained by buying a faster computer or more RAM? >> > I'm not sure whether I even agree with this statement, but this is the >> > trade-off we are deciding on. >> >> Correctness >> == >> >> Using list-based operations on Strings are almost always wrong > > Data.Text seems to think that many of them are worth reimplementing for > Text. It looks like someone's systematically gone through Data.List. That's exactly what happened as part of the platform inclusion process. In fact, there was quite a bit of bike shedding whether the Text API should be compatible with the list API or not. In the end the decision was made to add all the list functions even if that encouraged running into unicode issues. I'm pretty sure you participated in that discussion. >> Performance >> === >> >> Depending on the benchmark, the difference can be much bigger than >> 20%. For example, here's a comparison of decoding UTF-8 byte data into >> a String vs a Text value: > > I think Heinrich meant 20% performance in a useful program, not a > micro-benchmark. Generating web sites is a huge application area of Haskell and one where a proper text type is in no way a micro optimisation. ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
Hi Johan, On Sat, Mar 24, 2012 at 11:50:10AM -0700, Johan Tibell wrote: > > On Sat, Mar 24, 2012 at 12:39 AM, Heinrich Apfelmus > wrote: > > Which brings me to the fundamental question behind this proposal: Why do we > > need Text at all? What are its virtues and how do they compare? What is the > > trade-off? (I'm not familiar enough with the Text library to answer these.) > > > > To put it very pointedly: is a %20 performance increase on the current > > generation of computers worth the cost in terms of ease-of-use, when the > > performance can equally be gained by buying a faster computer or more RAM? > > I'm not sure whether I even agree with this statement, but this is the > > trade-off we are deciding on. > > Correctness > == > > Using list-based operations on Strings are almost always wrong Data.Text seems to think that many of them are worth reimplementing for Text. It looks like someone's systematically gone through Data.List. And in fact, very few functions there /don't/ look like they are directly equivalent to list functions. > , as > soon as you move away from English text. You almost always have to > deal with Unicode strings as blobs, considering several code points at > once. For example, > > upcase :: String -> String > upcase = map toUpper This is no more incorrect than upcase = Data.Text.map toUpper There's no reason that there couldn't be a Data.String.toUpper corresponding to Data.Text.toUpper. > Performance > === > > Depending on the benchmark, the difference can be much bigger than > 20%. For example, here's a comparison of decoding UTF-8 byte data into > a String vs a Text value: I think Heinrich meant 20% performance in a useful program, not a micro-benchmark. Thanks Ian ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: Long live String = [Char] (Was: Re: String != [Char])
On 24 March 2012 12:53, Henrik Nilsson wrote: > Hi all, > > Thomas Schilling wrote: > >> I think most here agree that the main advantage of the current >> definition is only pedagogical. > > But that in itself is not a small deal. In fact, it's a pretty > major advantage. > > Moreover, the utter simplicity of String = [Char] is a benefit > in its own right. Let's not forget that this, in practice, > across all Haskell applications, works just fine in the vast > majority of cases. > > I get the sense that the proponents for deprecating, and ultimately > get rid of, String = [Char], are suggesting that this would lead > to noticeable performance improvements across the board by virtue > of preventing programmers from accidentally making a poor choice > of data structure for representing string. But I conjecture that > the performance impact of switching form e.g. String to Text at > the level of complete applications would be negligible in most > cases, simply because most Haskell applications are not dominated > by heavy-duty string processing. And those that are, probably > already uses something like Text, and were written be people > who know a thing or two about appropriate choice of data structures > anyway. > > As to teaching: > >> I don't really >> think that having an abstract type is such a big problem for teaching. >> You can do string processing by doing (pack . myfunction . unpack) > > Here at Nottingham, we're teaching all our 1st-year undergraduates > Haskell. It works, but it is a challenge, and, alas, far from everyone > "gets" it. And this is despite the module being taught by one of > the leading and most experienced Haskell educators (and text book > author), Graham Hutton. > > Without starting an endless discussion about how to best teach > programming languages in general and Haskell in particular to > (near) beginners, I dare say that idioms like the one suggested > above would do nothing to help. > > String != [Char] would break no end of code, text books, tutorials, > lecture slides, would not help with teaching Haskell, all > for very little if any benefit in the grand scheme of things. OK, I agree that breaking text books is a big deal. On the other hand, the lack of a good Text data type forced text books to teach bad approaches to dealing with strings. Haskell should do better. Johan mentioned both semantic and performance problems with Strings. A part he didn't stress is that Strings are also a horribly memory-inefficient way of storing strings. On 64 bit GHC systems a single ASCII character needs 16 bytes of memory (i.e., an overhead of 16x). A non-ASCII character (ord c > 255) actually requires 32 bytes. (This is due to a de-duplication optimisation in the GHC GC). Other implementations may do better, but an abstract type would still be better to enable more freedom for implementors. Correct handling of unicode strings is a Hard Problem and String = [Char] is only better if you ignore all the issues (which is certainly fine a teaching environment). I would be happy to have a simplistic String = [Char] coexist with a Text type if it weren't for the problem that so many things are biased towards String. E.g., error takes a String, Show is used everywhere and produces strings, the pretty printing library uses Strings, Read parses Strings. > On the other hand, a standardised, well thought-out, API for > high-performance strings and appropriate mechanisms such > as a measure of overloading to make it easy and palatable to > use, and that work alongside the present String = [Char], would be a > good thing. As I said, while I'm not a huge fan of having two String types co-exist, I could accept it as a necessary trade-off to keep text books valid and preserve backwards compatibility. (There are also other issues with String. For example, you can't write an instance MyClass String in Haskell2010, and even with GHC extensions it seems wrong and you often end up writing instances that overlap with MyClass [a].) I'm using Data.Text a lot, so I can work around the issue, but unfortunately you run into a lot of issues where the standard library forces the use of String, and that, I believe, is wrong. If changing the standard library is the bigger issue, however, then I'm not sure whether this discussion needs to take place on the haskell-prime list or on the libraries list. / Thomas ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
Hi all, On Sat, Mar 24, 2012 at 12:39 AM, Heinrich Apfelmus wrote: > Which brings me to the fundamental question behind this proposal: Why do we > need Text at all? What are its virtues and how do they compare? What is the > trade-off? (I'm not familiar enough with the Text library to answer these.) > > To put it very pointedly: is a %20 performance increase on the current > generation of computers worth the cost in terms of ease-of-use, when the > performance can equally be gained by buying a faster computer or more RAM? > I'm not sure whether I even agree with this statement, but this is the > trade-off we are deciding on. Correctness == Using list-based operations on Strings are almost always wrong, as soon as you move away from English text. You almost always have to deal with Unicode strings as blobs, considering several code points at once. For example, upcase :: String -> String upcase = map toUpper Is terse, beautiful, and wrong, as several languages map a single lowercase character to two uppercase characters (as I'm sure you're aware.) Perhaps this is OK to ignore when teaching students Haskell, but it really hurts those who want to use Haskell as an engineering language. Performance === Depending on the benchmark, the difference can be much bigger than 20%. For example, here's a comparison of decoding UTF-8 byte data into a String vs a Text value: benchmarking Pure/decode/Text mean: 50.22202 us, lb 50.08306 us, ub 50.37669 us, ci 0.950 std dev: 751.1139 ns, lb 666.2243 ns, ub 865.8246 ns, ci 0.950 variance introduced by outliers: 7.553% variance is slightly inflated by outliers benchmarking Pure/decode/String mean: 188.0507 us, lb 187.4970 us, ub 188.6955 us, ci 0.950 std dev: 3.053076 us, lb 2.647318 us, ub 3.606262 us, ci 0.950 variance introduced by outliers: 9.407% variance is slightly inflated by outliers A difference of almost 4x. Many of the Text vs String benchmarks measure the performance of operations ignoring both decoding and encoding, while any real application would have to do both. On top of that, String is more or less as optimized as it can be; benchmarks are almost completely memory bound. Text on the other hand still has potential of (large) improvements, as GHC doesn't general optimal code for tight loops over arrays. For example, we know that GHC generates bad code for decodeUtf8 as used by Text's stream fusion, hurting any code that uses fusion. Furthermore, the memory overhead of Text is smaller, which means that applications that hold on to many string value will use less heap and thus experience smaller "freezes" due major GC collections, which are linear in the heap size. Cheers, Johan ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Long live String = [Char] (Was: Re: String != [Char])
Hi all, Thomas Schilling wrote: > I think most here agree that the main advantage of the current > definition is only pedagogical. But that in itself is not a small deal. In fact, it's a pretty major advantage. Moreover, the utter simplicity of String = [Char] is a benefit in its own right. Let's not forget that this, in practice, across all Haskell applications, works just fine in the vast majority of cases. I get the sense that the proponents for deprecating, and ultimately get rid of, String = [Char], are suggesting that this would lead to noticeable performance improvements across the board by virtue of preventing programmers from accidentally making a poor choice of data structure for representing string. But I conjecture that the performance impact of switching form e.g. String to Text at the level of complete applications would be negligible in most cases, simply because most Haskell applications are not dominated by heavy-duty string processing. And those that are, probably already uses something like Text, and were written be people who know a thing or two about appropriate choice of data structures anyway. As to teaching: > I don't really > think that having an abstract type is such a big problem for teaching. > You can do string processing by doing (pack . myfunction . unpack) Here at Nottingham, we're teaching all our 1st-year undergraduates Haskell. It works, but it is a challenge, and, alas, far from everyone "gets" it. And this is despite the module being taught by one of the leading and most experienced Haskell educators (and text book author), Graham Hutton. Without starting an endless discussion about how to best teach programming languages in general and Haskell in particular to (near) beginners, I dare say that idioms like the one suggested above would do nothing to help. String != [Char] would break no end of code, text books, tutorials, lecture slides, would not help with teaching Haskell, all for very little if any benefit in the grand scheme of things. So let's not go there. On the other hand, a standardised, well thought-out, API for high-performance strings and appropriate mechanisms such as a measure of overloading to make it easy and palatable to use, and that work alongside the present String = [Char], would be a good thing. All the best, /Henrik -- Henrik Nilsson School of Computer Science The University of Nottingham n...@cs.nott.ac.uk ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
Edward Kmett wrote: Like I said, my objection to including Text is a lot less strong than my feelings on any notion of deprecating String. [..] The pedagogical concern is quite real, remember many introductory lanuage classes have time to present Haskell and the list data type and not much else. Showing parsing through pattern matching on strings makes a very powerful tool, its harder to show that with Text. [..] The major benefits of Text come from FFI opportunities, but even there if you dig into its internals it has to copy out of the array to talk to foreign functions because it lives in unpinned memory unlike ByteString. I agree with Edward Kmett on the virtues of String = [Char] for learning Haskell. I'm teaching beginners regularly and it is simply eye-opening for them that they can use the familiar list operations to solve real world problems which usually involve textual data. Which brings me to the fundamental question behind this proposal: Why do we need Text at all? What are its virtues and how do they compare? What is the trade-off? (I'm not familiar enough with the Text library to answer these.) To put it very pointedly: is a %20 performance increase on the current generation of computers worth the cost in terms of ease-of-use, when the performance can equally be gained by buying a faster computer or more RAM? I'm not sure whether I even agree with this statement, but this is the trade-off we are deciding on. Best regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime