Re: What is a punctuation character?

2012-03-20 Thread Gabriel Dos Reis
On Tue, Mar 20, 2012 at 5:37 PM, Iavor Diatchki
 wrote:
> Hello,
>
> So I looked at what GHC does with Unicode and to me it is seems quite
> reasonable:
>
> * The alphabet is Unicode code points, so a valid Haskell program is
> simply a list of those.
> * Combining characters are not allowed in identifiers, so no need for
> complex normalization rules: programs should always use the "short"
> version of a character, or be rejected.
> * Combining characters may appear in string literals, and there they
> are left "as is" without any modification (so some string literals may
> be longer than what's displayed in a text editor.)
>
> Perhaps this is simply what the report already states (I haven't
> checked, for which I apologize) but, if not, perhaps we should clarify
> things.
>
> -Iavor
> PS:  I don't think that there is any need to specify a particular
> representation for the unicode code-points (e.g., utf-8 etc.) in the
> language standard.

Thanks Iavor.

If the report intended to talk about code points only (and indeed ruling
out normalization suggests that), then the Report needs to be
clarified.  As you know, there is a distinction between a Unicode code
point and a Unicode character

http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf#G25564

Until I sent my original query, I had been reading the Report as meaning
Unicode characters (as the grammar seemed to suggest), but now it is
clear to me that only code points were intended.  That seemed to be
confirmed by your investigation of the GHC code base.

-- Gaby

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: What is a punctuation character?

2012-03-20 Thread Iavor Diatchki
Hello,

So I looked at what GHC does with Unicode and to me it is seems quite
reasonable:

* The alphabet is Unicode code points, so a valid Haskell program is
simply a list of those.
* Combining characters are not allowed in identifiers, so no need for
complex normalization rules: programs should always use the "short"
version of a character, or be rejected.
* Combining characters may appear in string literals, and there they
are left "as is" without any modification (so some string literals may
be longer than what's displayed in a text editor.)

Perhaps this is simply what the report already states (I haven't
checked, for which I apologize) but, if not, perhaps we should clarify
things.

-Iavor
PS:  I don't think that there is any need to specify a particular
representation for the unicode code-points (e.g., utf-8 etc.) in the
language standard.





On Fri, Mar 16, 2012 at 6:23 PM, Iavor Diatchki
 wrote:
> Hello,
> I am also not an expert but I got curious and did a bit of Wikipedia
> reading.  Based on what I understood, here are two (related) questions
> that it might be nice to clarify in a future version of the report:
>
> 1. What is the alphabet used by the grammar in the Haskell report?  My
> understanding is that the intention is that the alphabet is unicode
> codepoints (sometimes referred to as unicode characters).  There is no
> way to refer to specific code-points by escaping as in Java (the link
> that Gaby shared), you just have to write the code-points directly
> (and there are plenty of encodings for doing that, e.g. UTF-8 etc.)
>
> 2. Do we respect "unicode equivalence"
> (http://en.wikipedia.org/wiki/Canonical_equivalence) in Haskell source
> code.  The issue here is that, apparently, some sequences of unicode
> code points/characters are supposed to be morally the same.  For
> example, it would appear that there are two different ways to write
> the Spanish letter ñ: it has its own number, but it can also be made
> by writing "n" followed by a modifier to put the wavy sign on top.
>
> I would guess that implementing "unicode equivalence"  would not be
> too hard---supposedly the unicode standard specifies a "text
> normalization procedure".  However, this would complicate the report
> specification, because now the alphabet becomes not just unicode
> code-points, but equivalence classes of code points.
>
> Thoughts?
>
> -Iavor
>
>
>
>
>
>
> On Fri, Mar 16, 2012 at 4:49 PM, Ian Lynagh  wrote:
>>
>> Hi Gaby,
>>
>> On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:
>>>
>>> OK, thanks!  I guess a take away from this discussion is that what
>>> is a punctuation is far less well defined than it appears...
>>
>> I'm not really sure what you're asking. Haskell's uniSymbol includes all
>> Unicode characters (should that be codepoints? I'm not a Unicode expert)
>> in the punctuation category; I'm not sure what the best reference is,
>> but e.g. table 12 in
>>    http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values
>> lists a number of Px categories, and a meta-category P "Punctuation".
>>
>>
>> Thanks
>> Ian
>>
>>
>> ___
>> Haskell-prime mailing list
>> Haskell-prime@haskell.org
>> http://www.haskell.org/mailman/listinfo/haskell-prime

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-20 Thread Tillmann Rendel

Hi,

Thomas Schilling wrote:

I agree that the language standard should not prescribe the
implementation of a Text datatype.  It should instead require an
abstract data type (which may just be a newtype wrapper for [Char] in
some implementations) and a (minimal) set of operations on it.

Regarding the type class for converting to and from that type, there
is a perhaps more complicated question: The current fromString method
uses String as the source type which causes unnecessary overhead.


Is this still a problem if String would be replaced by an 
implementation-dependend newtype? Presumably, GHC would use a more 
efficient representation behind the newtype, so the following would be 
efficient in practice (or not?)


  newtype String
= ...

  class IsString a where
fromString :: String -> a

The standard could even prescribe that an instance for [Char] exists:

  explode :: String -> [Char]
  explode = ...

  instance IsString [Char] where
fromString = explode

Tillmann

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: String != [Char]

2012-03-20 Thread Johan Tibell
On Tue, Mar 20, 2012 at 2:25 AM, Simon Marlow  wrote:
> Is there a reason not to put all these methods in the IsString class, with 
> appropriate default definitions?  You would need a UTF-8 encoder (& decoder) 
> of course, but it would reduce the burden on clients and improve backwards 
> compatibility.

That sounds fine to me. I'm leaning towards only having
unpackUTF8String (in addition to the existing method), as in the
absence of proper byte literals we would have literals which change
types, depending on which bytes they contain*. Ugh!

* Is it even possible to create non-UTF8 literals without using
escaped sequences?

-- Johan

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


RE: String != [Char]

2012-03-20 Thread Simon Marlow
> On Mon, Mar 19, 2012 at 9:02 AM, Christian Siefkes 
> wrote:
> > On 03/19/2012 04:53 PM, Johan Tibell wrote:
> >> I've been thinking about this question as well. How about
> >>
> >> class IsString s where
> >>     unpackCString :: Ptr Word8 -> CSize -> s
> >
> > What's the Ptr Word8 supposed to contain? A UTF-8 encoded string?
> 
> Yes.
> 
> We could make a distinction between byte and Unicode literals and have:
> 
> class IsBytes a where
> unpackBytes :: Ptr Word8 -> Int -> a
> 
> class IsText a where
> unpackText :: Ptr Word8 -> Int -> a
> 
> In the latter the caller guarantees that the passed in pointer points to
> wellformed UTF-8 data.

Is there a reason not to put all these methods in the IsString class, with 
appropriate default definitions?  You would need a UTF-8 encoder (& decoder) of 
course, but it would reduce the burden on clients and improve backwards 
compatibility.

Cheers,
Simon



___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime