Re: String != [Char]
On Mon, Mar 19, 2012 at 2:55 PM, Daniel Peebles wrote: > If the input is specified to be UTF-8, wouldn't it be better to call the > method unpackUTF8 or something like that? Sure. -- Johan ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
If the input is specified to be UTF-8, wouldn't it be better to call the method unpackUTF8 or something like that? On Mon, Mar 19, 2012 at 12:59 PM, Johan Tibell wrote: > On Mon, Mar 19, 2012 at 9:02 AM, Christian Siefkes > wrote: > > On 03/19/2012 04:53 PM, Johan Tibell wrote: > >> I've been thinking about this question as well. How about > >> > >> class IsString s where > >> unpackCString :: Ptr Word8 -> CSize -> s > > > > What's the Ptr Word8 supposed to contain? A UTF-8 encoded string? > > Yes. > > We could make a distinction between byte and Unicode literals and have: > > class IsBytes a where >unpackBytes :: Ptr Word8 -> Int -> a > > class IsText a where >unpackText :: Ptr Word8 -> Int -> a > > In the latter the caller guarantees that the passed in pointer points > to wellformed UTF-8 data. > > -- Johan > > ___ > Haskell-prime mailing list > Haskell-prime@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-prime > ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Mon, Mar 19, 2012 at 15:39, Simon Peyton-Jones wrote: > Don't forget that with -XOverloadedStrings we already have a IsString > class. (That's not a Haskell Prime extension though.) > I think that's exactly the point; currently it uses [Char] initial format and converts at runtime, which is rather unfortunate given the inefficiency of [Char]. If it has to be done at runtime, it would be nice to at least do it from a more efficient initial format. -- brandon s allbery allber...@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
RE: String != [Char]
Don't forget that with -XOverloadedStrings we already have a IsString class. (That's not a Haskell Prime extension though.) class IsString a where fromString :: String -> a Simon | -Original Message- | From: haskell-prime-boun...@haskell.org [mailto:haskell-prime- | boun...@haskell.org] On Behalf Of Johan Tibell | Sent: 19 March 2012 15:54 | To: Thomas Schilling | Cc: haskell-prime@haskell.org | Subject: Re: String != [Char] | | On Mon, Mar 19, 2012 at 8:45 AM, Thomas Schilling | wrote: | > Regarding the type class for converting to and from that type, there | > is a perhaps more complicated question: The current fromString method | > uses String as the source type which causes unnecessary overhead. This | > is unfortunate since GHC's built-in mechanism actually uses | > unpackCString[Utf8]# which constructs the inefficient String | > representation from a compact memory representation. I think it would | > be best if the new fromString/fromText class allowed an efficient | > mechanism like that. unpackCString# has type Addr# -> [Char] which is | > obviously GHC-specific. | | I've been thinking about this question as well. How about | | class IsString s where | unpackCString :: Ptr Word8 -> CSize -> s | | It's morally equivalent of unpackCString#, but uses standard Haskell types. | | -- Johan | | ___ | Haskell-prime mailing list | Haskell-prime@haskell.org | http://www.haskell.org/mailman/listinfo/haskell-prime ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
This is the best I can do with Bryan's blog posts, but none of the graphs (which contain all the information) show up: http://web.archive.org/web/20100222031602/http://www.serpentine.com/blog/2009/12/10/the-performance-of-data-text/ If someone has some benchmarks that can be ran that would be helpful. On Mon, Mar 19, 2012 at 7:51 AM, Johan Tibell wrote: > Hi Greg, > > There are a few blog posts on Bryan's blog. Here are two of them: > > > http://www.serpentine.com/blog/2009/10/09/announcing-a-major-revision-of-the-haskell-text-library/ > http://www.serpentine.com/blog/2009/12/10/the-performance-of-data-text/ > > Unfortunately the blog seems partly broken. Images are missing and > some articles are missing altogether (i.e. the article is there but > the actualy body text is gone.) > > -- Johan ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Mon, Mar 19, 2012 at 9:02 AM, Christian Siefkes wrote: > On 03/19/2012 04:53 PM, Johan Tibell wrote: >> I've been thinking about this question as well. How about >> >> class IsString s where >> unpackCString :: Ptr Word8 -> CSize -> s > > What's the Ptr Word8 supposed to contain? A UTF-8 encoded string? Yes. We could make a distinction between byte and Unicode literals and have: class IsBytes a where unpackBytes :: Ptr Word8 -> Int -> a class IsText a where unpackText :: Ptr Word8 -> Int -> a In the latter the caller guarantees that the passed in pointer points to wellformed UTF-8 data. -- Johan ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On 03/19/2012 04:53 PM, Johan Tibell wrote: > I've been thinking about this question as well. How about > > class IsString s where > unpackCString :: Ptr Word8 -> CSize -> s What's the Ptr Word8 supposed to contain? A UTF-8 encoded string? Best regards Christian -- |--- Dr. Christian Siefkes --- christ...@siefkes.net --- | Homepage: http://www.siefkes.net/ | Blog: http://www.keimform.de/ |Peer Production Everywhere: http://peerconomy.org/wiki/ |-- OpenPGP Key ID: 0x346452D8 -- A choice of masters is not freedom. -- Bradley M. Kuhn and Richard M. Stallman, Freedom Or Power? signature.asc Description: OpenPGP digital signature ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Mon, Mar 19, 2012 at 8:45 AM, Thomas Schilling wrote: > Regarding the type class for converting to and from that type, there > is a perhaps more complicated question: The current fromString method > uses String as the source type which causes unnecessary overhead. This > is unfortunate since GHC's built-in mechanism actually uses > unpackCString[Utf8]# which constructs the inefficient String > representation from a compact memory representation. I think it would > be best if the new fromString/fromText class allowed an efficient > mechanism like that. unpackCString# has type Addr# -> [Char] which is > obviously GHC-specific. I've been thinking about this question as well. How about class IsString s where unpackCString :: Ptr Word8 -> CSize -> s It's morally equivalent of unpackCString#, but uses standard Haskell types. -- Johan ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On 18 March 2012 19:29, ARJANEN Loïc Jean David wrote: > Good point, but rather than specifying in the standard that the new string > type should be the Text datatype, maybe the new definition should be that > String is a newtype with suitable operations defined on it, and perhaps a > typeclass to convert to and from this newtype. The reason of my remark is > although most implementations compile to native code, an implementation > compiling to, for example, JavaScript might wish to use JavaScript's string > type rather than forcing its users to have a native library installed. I agree that the language standard should not prescribe the implementation of a Text datatype. It should instead require an abstract data type (which may just be a newtype wrapper for [Char] in some implementations) and a (minimal) set of operations on it. Regarding the type class for converting to and from that type, there is a perhaps more complicated question: The current fromString method uses String as the source type which causes unnecessary overhead. This is unfortunate since GHC's built-in mechanism actually uses unpackCString[Utf8]# which constructs the inefficient String representation from a compact memory representation. I think it would be best if the new fromString/fromText class allowed an efficient mechanism like that. unpackCString# has type Addr# -> [Char] which is obviously GHC-specific. -- Push the envelope. Watch it bend. ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On 17 March 2012 01:44, Greg Weber wrote: > the text library and Text data type have shown the worth in real world > Haskell usage with GHC. > I try to avoid String whenever possible, but I still have to deal with > conversions and other issues. > There is a lot of real work to be done to convert away from [Char], > but I think we need to take it out of the language definition as a > first step. I'm pretty sure the majoirty of people would agree that if we were making the Haskell standard nowadays we'd make String type abstract. Unfortunately I fear making the change now will be quite disruptive, though I don't think we've collectively put much effort yet into working out just how disruptive. In principle I'd support changing to reduce the number of string types used in interfaces. From painful professional experience, I think that one of the biggest things where C++ went wrong was not having a single string type that everyone would use (I once had to write a C++ component integrating code that used 5 different string types). Like Python 3, we should have two common string types used in interfaces: string and bytes (with implementations like our current Text and ByteString). BTW, I don't think taking it out of the langauge would be a helpful step. We actually want to tell people "use *this* string type in interfaces", not leave everyone to make their own choice. I think taking it out of the language would tend to encourage everyone to make their own choice. Duncan ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
Hi Greg, There are a few blog posts on Bryan's blog. Here are two of them: http://www.serpentine.com/blog/2009/10/09/announcing-a-major-revision-of-the-haskell-text-library/ http://www.serpentine.com/blog/2009/12/10/the-performance-of-data-text/ Unfortunately the blog seems partly broken. Images are missing and some articles are missing altogether (i.e. the article is there but the actualy body text is gone.) -- Johan ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
I actually was not able to successfully google for Text vs. String benchmarks. If someone can point one out that would be very helpful. On Sat, Mar 17, 2012 at 1:52 AM, Christopher Done wrote: > On 17 March 2012 05:30, Tony Morris wrote: >> Do you know if there is a good write-up of the benefits of Data.Text >> over String? I'm aware of the advantages just by my own usage; hoping >> someone has documented it rather than in our heads. > > Good point, it would be good to collate the experience and wisdom of > this decision with some benchmark results on the HaskellWiki as The > Place to link to when justifying it. > > ___ > Haskell-prime mailing list > Haskell-prime@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-prime ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: What is a punctuation character?
Iavor> report? My understanding is that the intention is that the Iavor> alphabet is unicode codepoints (sometimes referred to as Iavor> unicode characters). Unicode characters are not the same as Unicode codepoints. What we want is Unicode characters. We don't want to be able to write a Unicode codepoint, as that would permit writing half of a surrogate pair, which is malformed Unicode. -- Colin Adams Preston Lancashire () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: What is a punctuation character?
On Mon, Mar 19, 2012 at 5:36 AM, Brandon Allbery wrote: > On Mon, Mar 19, 2012 at 05:56, Gabriel Dos Reis > wrote: >> >> The fact that the Report is silent about encoding used to >> represent concrete Haskell programs in text files adds >> a certain level of non-portability (and confusion.) I found > > > Specifying the encoding can *also* limit portability, if you specify an > encoding that is not widely supported on some target platform. That is why I find the pragma suggestion attractive. -- Gaby ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: What is a punctuation character?
On Mon, Mar 19, 2012 at 05:56, Gabriel Dos Reis < g...@integrable-solutions.net> wrote: > The fact that the Report is silent about encoding used to > represent concrete Haskell programs in text files adds > a certain level of non-portability (and confusion.) I found > Specifying the encoding can *also* limit portability, if you specify an encoding that is not widely supported on some target platform. (Please try to remember that the universe is not composed solely of Windows and Linux. The fact that those are the only ones you care about is not relevant to the standard; nor is the list of platforms that GHC or any other implementation supports.) Encoding does not belong in the language standard; it is an aspect of implementing the language standard on a given platform. -- brandon s allbery allber...@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: What is a punctuation character?
On Mon, Mar 19, 2012 at 4:34 AM, Simon Marlow wrote: >> On Fri, Mar 16, 2012 at 6:49 PM, Ian Lynagh wrote: >> > Hi Gaby, >> > >> > On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote: >> >> >> >> OK, thanks! I guess a take away from this discussion is that what is >> >> a punctuation is far less well defined than it appears... >> > >> > I'm not really sure what you're asking. Haskell's uniSymbol includes >> > all Unicode characters (should that be codepoints? I'm not a Unicode >> > expert) in the punctuation category; I'm not sure what the best >> > reference is, but e.g. table 12 in >> > http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values >> > lists a number of Px categories, and a meta-category P "Punctuation". >> > >> > >> > Thanks >> > Ian >> > >> >> Hi Ian, >> >> I guess what I am asking was partly summarized in Iavor's message. >> >> For me, the issue started with bullet number 4 in section 1.1 >> >> http://www.haskell.org/onlinereport/intro.html#sect1.1 >> >> which states that: >> >> The lexical structure captures the concrete representation >> of Haskell programs in text files. >> >> That combined with the opening section 2.1 (e.g. example of terminal >> syntax) and the fact that the grammar routinely described two non- >> terminals ascXXX (for ASCII characters) and uniXXX for (Unicode character) >> suggested that the concrete syntax of Haskell programs in text files is in >> ASCII charset. Note this does not conflict with the general statement >> that Haskell programs use the Unicode character because the uniXXX could >> use the ASCII charset to introduce Unicode characters -- this is not >> uncommon practice for programming languages using Unicode characters; see >> the link I gave earlier. >> >> However, if I understand Malcolm's message correctly, this is not the >> case. >> Contrary to what I quoted above, Chapter 2 does NOT specify the concrete >> representation of Haskell programs in text files. What it does is to >> capture the structure of what is obtained from interpreting, *in some >> unspecified encoding or unspecified alphabet*, the concrete >> representation of Haskell programs in text files. This conclusion is >> unfortunate, but I believe it is correct. >> Since the encoding or the alphabet is unspecified, it is no longer >> necessarily the case that two Haskell implementations would agree on the >> same lexical interpretation when presented with the same exact text file >> containing a Haskell program. >> >> In its current form, you are correct that the Report should say >> "codepoint" >> instead of characters. >> >> I join Iavor's request in clarifying the alphabet used in the grammar. > > The report gives meaning to a sequence of codepoints only, it says nothing > about how that sequence of codepoints is represented as a string of bytes in > a file, nor does it say anything about what those files are called, or even > whether there are files at all. Thanks, Simon. The fact that the Report is silent about encoding used to represent concrete Haskell programs in text files adds a certain level of non-portability (and confusion.) I found last night that a proposal has been made to add some support for encoding specification http://hackage.haskell.org/trac/haskell-prime/wiki/UnicodeInHaskellSource I believe that is a good start. What are the odds of it being considered for Haskell 2012? I suspect the pragma proposal works only if something is said about the position of that pragma in the source file (e.g. it must be the first line, or file N bytes in the source file) otherwise we have an infinite descent. > > Perhaps some clarification is in order in a future revision, and we should > use the correct terminology where appropriate. We should also clarify that > "punctuation" means exactly the Punctuation class. That would be great. Do you have any comment about the UnicodeInHaskellSource proposal? > With regards to normalisation and equivalence, my understanding is that > Haskell does not support either: two identifiers are equal if and only if > they are represented by the same sequence of codepoints. Again, we could add > a clarifying sentence to the report. > Ugh. Writing a parser for Haskell was an interesting exercise :-) -- Gaby ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
RE: What is a punctuation character?
> On Fri, Mar 16, 2012 at 6:49 PM, Ian Lynagh wrote: > > Hi Gaby, > > > > On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote: > >> > >> OK, thanks! I guess a take away from this discussion is that what is > >> a punctuation is far less well defined than it appears... > > > > I'm not really sure what you're asking. Haskell's uniSymbol includes > > all Unicode characters (should that be codepoints? I'm not a Unicode > > expert) in the punctuation category; I'm not sure what the best > > reference is, but e.g. table 12 in > > http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values > > lists a number of Px categories, and a meta-category P "Punctuation". > > > > > > Thanks > > Ian > > > > Hi Ian, > > I guess what I am asking was partly summarized in Iavor's message. > > For me, the issue started with bullet number 4 in section 1.1 > > http://www.haskell.org/onlinereport/intro.html#sect1.1 > > which states that: > >The lexical structure captures the concrete representation >of Haskell programs in text files. > > That combined with the opening section 2.1 (e.g. example of terminal > syntax) and the fact that the grammar routinely described two non- > terminals ascXXX (for ASCII characters) and uniXXX for (Unicode character) > suggested that the concrete syntax of Haskell programs in text files is in > ASCII charset. Note this does not conflict with the general statement > that Haskell programs use the Unicode character because the uniXXX could > use the ASCII charset to introduce Unicode characters -- this is not > uncommon practice for programming languages using Unicode characters; see > the link I gave earlier. > > However, if I understand Malcolm's message correctly, this is not the > case. > Contrary to what I quoted above, Chapter 2 does NOT specify the concrete > representation of Haskell programs in text files. What it does is to > capture the structure of what is obtained from interpreting, *in some > unspecified encoding or unspecified alphabet*, the concrete > representation of Haskell programs in text files. This conclusion is > unfortunate, but I believe it is correct. > Since the encoding or the alphabet is unspecified, it is no longer > necessarily the case that two Haskell implementations would agree on the > same lexical interpretation when presented with the same exact text file > containing a Haskell program. > > In its current form, you are correct that the Report should say > "codepoint" > instead of characters. > > I join Iavor's request in clarifying the alphabet used in the grammar. The report gives meaning to a sequence of codepoints only, it says nothing about how that sequence of codepoints is represented as a string of bytes in a file, nor does it say anything about what those files are called, or even whether there are files at all. Perhaps some clarification is in order in a future revision, and we should use the correct terminology where appropriate. We should also clarify that "punctuation" means exactly the Punctuation class. With regards to normalisation and equivalence, my understanding is that Haskell does not support either: two identifiers are equal if and only if they are represented by the same sequence of codepoints. Again, we could add a clarifying sentence to the report. Cheers, Simon ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime