Re: What is a punctuation character?
On Tue, Mar 20, 2012 at 5:37 PM, Iavor Diatchki wrote: > Hello, > > So I looked at what GHC does with Unicode and to me it is seems quite > reasonable: > > * The alphabet is Unicode code points, so a valid Haskell program is > simply a list of those. > * Combining characters are not allowed in identifiers, so no need for > complex normalization rules: programs should always use the "short" > version of a character, or be rejected. > * Combining characters may appear in string literals, and there they > are left "as is" without any modification (so some string literals may > be longer than what's displayed in a text editor.) > > Perhaps this is simply what the report already states (I haven't > checked, for which I apologize) but, if not, perhaps we should clarify > things. > > -Iavor > PS: I don't think that there is any need to specify a particular > representation for the unicode code-points (e.g., utf-8 etc.) in the > language standard. Thanks Iavor. If the report intended to talk about code points only (and indeed ruling out normalization suggests that), then the Report needs to be clarified. As you know, there is a distinction between a Unicode code point and a Unicode character http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf#G25564 Until I sent my original query, I had been reading the Report as meaning Unicode characters (as the grammar seemed to suggest), but now it is clear to me that only code points were intended. That seemed to be confirmed by your investigation of the GHC code base. -- Gaby ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: What is a punctuation character?
Hello, So I looked at what GHC does with Unicode and to me it is seems quite reasonable: * The alphabet is Unicode code points, so a valid Haskell program is simply a list of those. * Combining characters are not allowed in identifiers, so no need for complex normalization rules: programs should always use the "short" version of a character, or be rejected. * Combining characters may appear in string literals, and there they are left "as is" without any modification (so some string literals may be longer than what's displayed in a text editor.) Perhaps this is simply what the report already states (I haven't checked, for which I apologize) but, if not, perhaps we should clarify things. -Iavor PS: I don't think that there is any need to specify a particular representation for the unicode code-points (e.g., utf-8 etc.) in the language standard. On Fri, Mar 16, 2012 at 6:23 PM, Iavor Diatchki wrote: > Hello, > I am also not an expert but I got curious and did a bit of Wikipedia > reading. Based on what I understood, here are two (related) questions > that it might be nice to clarify in a future version of the report: > > 1. What is the alphabet used by the grammar in the Haskell report? My > understanding is that the intention is that the alphabet is unicode > codepoints (sometimes referred to as unicode characters). There is no > way to refer to specific code-points by escaping as in Java (the link > that Gaby shared), you just have to write the code-points directly > (and there are plenty of encodings for doing that, e.g. UTF-8 etc.) > > 2. Do we respect "unicode equivalence" > (http://en.wikipedia.org/wiki/Canonical_equivalence) in Haskell source > code. The issue here is that, apparently, some sequences of unicode > code points/characters are supposed to be morally the same. For > example, it would appear that there are two different ways to write > the Spanish letter ñ: it has its own number, but it can also be made > by writing "n" followed by a modifier to put the wavy sign on top. > > I would guess that implementing "unicode equivalence" would not be > too hard---supposedly the unicode standard specifies a "text > normalization procedure". However, this would complicate the report > specification, because now the alphabet becomes not just unicode > code-points, but equivalence classes of code points. > > Thoughts? > > -Iavor > > > > > > > On Fri, Mar 16, 2012 at 4:49 PM, Ian Lynagh wrote: >> >> Hi Gaby, >> >> On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote: >>> >>> OK, thanks! I guess a take away from this discussion is that what >>> is a punctuation is far less well defined than it appears... >> >> I'm not really sure what you're asking. Haskell's uniSymbol includes all >> Unicode characters (should that be codepoints? I'm not a Unicode expert) >> in the punctuation category; I'm not sure what the best reference is, >> but e.g. table 12 in >> http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values >> lists a number of Px categories, and a meta-category P "Punctuation". >> >> >> Thanks >> Ian >> >> >> ___ >> Haskell-prime mailing list >> Haskell-prime@haskell.org >> http://www.haskell.org/mailman/listinfo/haskell-prime ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
Hi, Thomas Schilling wrote: I agree that the language standard should not prescribe the implementation of a Text datatype. It should instead require an abstract data type (which may just be a newtype wrapper for [Char] in some implementations) and a (minimal) set of operations on it. Regarding the type class for converting to and from that type, there is a perhaps more complicated question: The current fromString method uses String as the source type which causes unnecessary overhead. Is this still a problem if String would be replaced by an implementation-dependend newtype? Presumably, GHC would use a more efficient representation behind the newtype, so the following would be efficient in practice (or not?) newtype String = ... class IsString a where fromString :: String -> a The standard could even prescribe that an instance for [Char] exists: explode :: String -> [Char] explode = ... instance IsString [Char] where fromString = explode Tillmann ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
Re: String != [Char]
On Tue, Mar 20, 2012 at 2:25 AM, Simon Marlow wrote: > Is there a reason not to put all these methods in the IsString class, with > appropriate default definitions? You would need a UTF-8 encoder (& decoder) > of course, but it would reduce the burden on clients and improve backwards > compatibility. That sounds fine to me. I'm leaning towards only having unpackUTF8String (in addition to the existing method), as in the absence of proper byte literals we would have literals which change types, depending on which bytes they contain*. Ugh! * Is it even possible to create non-UTF8 literals without using escaped sequences? -- Johan ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
RE: String != [Char]
> On Mon, Mar 19, 2012 at 9:02 AM, Christian Siefkes > wrote: > > On 03/19/2012 04:53 PM, Johan Tibell wrote: > >> I've been thinking about this question as well. How about > >> > >> class IsString s where > >> unpackCString :: Ptr Word8 -> CSize -> s > > > > What's the Ptr Word8 supposed to contain? A UTF-8 encoded string? > > Yes. > > We could make a distinction between byte and Unicode literals and have: > > class IsBytes a where > unpackBytes :: Ptr Word8 -> Int -> a > > class IsText a where > unpackText :: Ptr Word8 -> Int -> a > > In the latter the caller guarantees that the passed in pointer points to > wellformed UTF-8 data. Is there a reason not to put all these methods in the IsString class, with appropriate default definitions? You would need a UTF-8 encoder (& decoder) of course, but it would reduce the burden on clients and improve backwards compatibility. Cheers, Simon ___ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime