Re: What is a punctuation character?

2012-03-20 Thread Iavor Diatchki
Hello,

So I looked at what GHC does with Unicode and to me it is seems quite
reasonable:

* The alphabet is Unicode code points, so a valid Haskell program is
simply a list of those.
* Combining characters are not allowed in identifiers, so no need for
complex normalization rules: programs should always use the short
version of a character, or be rejected.
* Combining characters may appear in string literals, and there they
are left as is without any modification (so some string literals may
be longer than what's displayed in a text editor.)

Perhaps this is simply what the report already states (I haven't
checked, for which I apologize) but, if not, perhaps we should clarify
things.

-Iavor
PS:  I don't think that there is any need to specify a particular
representation for the unicode code-points (e.g., utf-8 etc.) in the
language standard.





On Fri, Mar 16, 2012 at 6:23 PM, Iavor Diatchki
iavor.diatc...@gmail.com wrote:
 Hello,
 I am also not an expert but I got curious and did a bit of Wikipedia
 reading.  Based on what I understood, here are two (related) questions
 that it might be nice to clarify in a future version of the report:

 1. What is the alphabet used by the grammar in the Haskell report?  My
 understanding is that the intention is that the alphabet is unicode
 codepoints (sometimes referred to as unicode characters).  There is no
 way to refer to specific code-points by escaping as in Java (the link
 that Gaby shared), you just have to write the code-points directly
 (and there are plenty of encodings for doing that, e.g. UTF-8 etc.)

 2. Do we respect unicode equivalence
 (http://en.wikipedia.org/wiki/Canonical_equivalence) in Haskell source
 code.  The issue here is that, apparently, some sequences of unicode
 code points/characters are supposed to be morally the same.  For
 example, it would appear that there are two different ways to write
 the Spanish letter ñ: it has its own number, but it can also be made
 by writing n followed by a modifier to put the wavy sign on top.

 I would guess that implementing unicode equivalence  would not be
 too hard---supposedly the unicode standard specifies a text
 normalization procedure.  However, this would complicate the report
 specification, because now the alphabet becomes not just unicode
 code-points, but equivalence classes of code points.

 Thoughts?

 -Iavor






 On Fri, Mar 16, 2012 at 4:49 PM, Ian Lynagh ig...@earth.li wrote:

 Hi Gaby,

 On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:

 OK, thanks!  I guess a take away from this discussion is that what
 is a punctuation is far less well defined than it appears...

 I'm not really sure what you're asking. Haskell's uniSymbol includes all
 Unicode characters (should that be codepoints? I'm not a Unicode expert)
 in the punctuation category; I'm not sure what the best reference is,
 but e.g. table 12 in
    http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values
 lists a number of Px categories, and a meta-category P Punctuation.


 Thanks
 Ian


 ___
 Haskell-prime mailing list
 Haskell-prime@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-prime

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: What is a punctuation character?

2012-03-20 Thread Gabriel Dos Reis
On Tue, Mar 20, 2012 at 5:37 PM, Iavor Diatchki
iavor.diatc...@gmail.com wrote:
 Hello,

 So I looked at what GHC does with Unicode and to me it is seems quite
 reasonable:

 * The alphabet is Unicode code points, so a valid Haskell program is
 simply a list of those.
 * Combining characters are not allowed in identifiers, so no need for
 complex normalization rules: programs should always use the short
 version of a character, or be rejected.
 * Combining characters may appear in string literals, and there they
 are left as is without any modification (so some string literals may
 be longer than what's displayed in a text editor.)

 Perhaps this is simply what the report already states (I haven't
 checked, for which I apologize) but, if not, perhaps we should clarify
 things.

 -Iavor
 PS:  I don't think that there is any need to specify a particular
 representation for the unicode code-points (e.g., utf-8 etc.) in the
 language standard.

Thanks Iavor.

If the report intended to talk about code points only (and indeed ruling
out normalization suggests that), then the Report needs to be
clarified.  As you know, there is a distinction between a Unicode code
point and a Unicode character

http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf#G25564

Until I sent my original query, I had been reading the Report as meaning
Unicode characters (as the grammar seemed to suggest), but now it is
clear to me that only code points were intended.  That seemed to be
confirmed by your investigation of the GHC code base.

-- Gaby

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


RE: What is a punctuation character?

2012-03-19 Thread Simon Marlow
 On Fri, Mar 16, 2012 at 6:49 PM, Ian Lynagh ig...@earth.li wrote:
  Hi Gaby,
 
  On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:
 
  OK, thanks!  I guess a take away from this discussion is that what is
  a punctuation is far less well defined than it appears...
 
  I'm not really sure what you're asking. Haskell's uniSymbol includes
  all Unicode characters (should that be codepoints? I'm not a Unicode
  expert) in the punctuation category; I'm not sure what the best
  reference is, but e.g. table 12 in
     http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values
  lists a number of Px categories, and a meta-category P Punctuation.
 
 
  Thanks
  Ian
 
 
 Hi Ian,
 
 I guess what I am asking was partly summarized in Iavor's message.
 
 For me, the issue started with bullet number 4 in section 1.1
 
  http://www.haskell.org/onlinereport/intro.html#sect1.1
 
 which states that:
 
The lexical structure captures the concrete representation
of Haskell programs in text files.
 
 That combined with the opening section 2.1 (e.g. example of terminal
 syntax) and the fact that the grammar  routinely described two non-
 terminals ascXXX (for ASCII characters) and uniXXX for (Unicode character)
 suggested that the concrete syntax of Haskell programs in text files is in
 ASCII charset.  Note this does not conflict with the general statement
 that Haskell programs use the Unicode character because the uniXXX could
 use the ASCII charset to introduce Unicode characters -- this is not
 uncommon practice for programming languages using Unicode characters; see
 the link I gave earlier.
 
 However, if I understand Malcolm's message correctly, this is not the
 case.
 Contrary to what I quoted above, Chapter 2 does NOT specify the concrete
 representation of Haskell programs in text files.  What it does is to
 capture the structure of what is obtained from interpreting, *in some
 unspecified encoding or unspecified alphabet*,  the concrete
 representation of Haskell programs in text files.  This conclusion is
 unfortunate, but I believe it is correct.
 Since the encoding or the alphabet is unspecified, it is no longer
 necessarily the case that two Haskell implementations would agree on the
 same lexical interpretation when presented with the same exact text file
 containing  a Haskell program.
 
 In its current form, you are correct that the Report should say
 codepoint
 instead of characters.
 
 I join Iavor's request in clarifying the alphabet used in the grammar.

The report gives meaning to a sequence of codepoints only, it says nothing 
about how that sequence of codepoints is represented as a string of bytes in a 
file, nor does it say anything about what those files are called, or even 
whether there are files at all.

Perhaps some clarification is in order in a future revision, and we should use 
the correct terminology where appropriate.  We should also clarify that 
punctuation means exactly the Punctuation class.

With regards to normalisation and equivalence, my understanding is that Haskell 
does not support either: two identifiers are equal if and only if they are 
represented by the same sequence of codepoints.  Again, we could add a 
clarifying sentence to the report.

Cheers,
Simon



___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: What is a punctuation character?

2012-03-19 Thread Gabriel Dos Reis
On Mon, Mar 19, 2012 at 4:34 AM, Simon Marlow simon...@microsoft.com wrote:
 On Fri, Mar 16, 2012 at 6:49 PM, Ian Lynagh ig...@earth.li wrote:
  Hi Gaby,
 
  On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:
 
  OK, thanks!  I guess a take away from this discussion is that what is
  a punctuation is far less well defined than it appears...
 
  I'm not really sure what you're asking. Haskell's uniSymbol includes
  all Unicode characters (should that be codepoints? I'm not a Unicode
  expert) in the punctuation category; I'm not sure what the best
  reference is, but e.g. table 12 in
     http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values
  lists a number of Px categories, and a meta-category P Punctuation.
 
 
  Thanks
  Ian
 

 Hi Ian,

 I guess what I am asking was partly summarized in Iavor's message.

 For me, the issue started with bullet number 4 in section 1.1

      http://www.haskell.org/onlinereport/intro.html#sect1.1

 which states that:

        The lexical structure captures the concrete representation
        of Haskell programs in text files.

 That combined with the opening section 2.1 (e.g. example of terminal
 syntax) and the fact that the grammar  routinely described two non-
 terminals ascXXX (for ASCII characters) and uniXXX for (Unicode character)
 suggested that the concrete syntax of Haskell programs in text files is in
 ASCII charset.  Note this does not conflict with the general statement
 that Haskell programs use the Unicode character because the uniXXX could
 use the ASCII charset to introduce Unicode characters -- this is not
 uncommon practice for programming languages using Unicode characters; see
 the link I gave earlier.

 However, if I understand Malcolm's message correctly, this is not the
 case.
 Contrary to what I quoted above, Chapter 2 does NOT specify the concrete
 representation of Haskell programs in text files.  What it does is to
 capture the structure of what is obtained from interpreting, *in some
 unspecified encoding or unspecified alphabet*,  the concrete
 representation of Haskell programs in text files.  This conclusion is
 unfortunate, but I believe it is correct.
 Since the encoding or the alphabet is unspecified, it is no longer
 necessarily the case that two Haskell implementations would agree on the
 same lexical interpretation when presented with the same exact text file
 containing  a Haskell program.

 In its current form, you are correct that the Report should say
 codepoint
 instead of characters.

 I join Iavor's request in clarifying the alphabet used in the grammar.

 The report gives meaning to a sequence of codepoints only, it says nothing 
 about how that sequence of codepoints is represented as a string of bytes in 
 a file, nor does it say anything about what those files are called, or even 
 whether there are files at all.

Thanks, Simon.

The fact that the Report is silent about encoding used to
represent concrete Haskell programs in text files adds
a certain level of non-portability (and confusion.)  I found
last night that a proposal has been made to add some
support for encoding specification

http://hackage.haskell.org/trac/haskell-prime/wiki/UnicodeInHaskellSource

I believe that is a good start.  What are the odds of it being considered
for Haskell 2012?  I suspect the pragma proposal works only if something
is said about the position of that pragma in the source file (e.g. it
must be the
first line, or file N bytes in the source file) otherwise we have an
infinite descent.



 Perhaps some clarification is in order in a future revision, and we should 
 use the correct terminology where appropriate.  We should also clarify that 
 punctuation means exactly the Punctuation class.

That would be great.  Do you have any comment about the
UnicodeInHaskellSource proposal?

 With regards to normalisation and equivalence, my understanding is that 
 Haskell does not support either: two identifiers are equal if and only if 
 they are represented by the same sequence of codepoints.  Again, we could add 
 a clarifying sentence to the report.


Ugh.

Writing a parser for Haskell was an interesting exercise :-)

-- Gaby

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: What is a punctuation character?

2012-03-19 Thread Brandon Allbery
On Mon, Mar 19, 2012 at 05:56, Gabriel Dos Reis 
g...@integrable-solutions.net wrote:

 The fact that the Report is silent about encoding used to
 represent concrete Haskell programs in text files adds
 a certain level of non-portability (and confusion.)  I found


Specifying the encoding can *also* limit portability, if you specify an
encoding that is not widely supported on some target platform.  (Please try
to remember that the universe is not composed solely of Windows and Linux.
 The fact that those are the only ones you care about is not relevant to
the standard; nor is the list of platforms that GHC or any other
implementation supports.)

Encoding does not belong in the language standard; it is an aspect of
implementing the language standard on a given platform.

-- 
brandon s allbery  allber...@gmail.com
wandering unix systems administrator (available) (412) 475-9364 vm/sms
___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: What is a punctuation character?

2012-03-19 Thread Gabriel Dos Reis
On Mon, Mar 19, 2012 at 5:36 AM, Brandon Allbery allber...@gmail.com wrote:
 On Mon, Mar 19, 2012 at 05:56, Gabriel Dos Reis
 g...@integrable-solutions.net wrote:

 The fact that the Report is silent about encoding used to
 represent concrete Haskell programs in text files adds
 a certain level of non-portability (and confusion.)  I found


 Specifying the encoding can *also* limit portability, if you specify an
 encoding that is not widely supported on some target platform.

That is why I find the pragma suggestion attractive.

-- Gaby

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: What is a punctuation character?

2012-03-19 Thread Colin Paul Adams


Iavor report?  My understanding is that the intention is that the
Iavor alphabet is unicode codepoints (sometimes referred to as
Iavor unicode characters).

Unicode characters are not the same as Unicode codepoints. What we want
is Unicode characters.

We don't want to be able to write a Unicode codepoint, as that would
permit writing half of a surrogate pair, which is malformed Unicode.
-- 
Colin Adams
Preston Lancashire
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: What is a punctuation character?

2012-03-17 Thread Gabriel Dos Reis
On Fri, Mar 16, 2012 at 6:49 PM, Ian Lynagh ig...@earth.li wrote:
 Hi Gaby,

 On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:

 OK, thanks!  I guess a take away from this discussion is that what
 is a punctuation is far less well defined than it appears...

 I'm not really sure what you're asking. Haskell's uniSymbol includes all
 Unicode characters (should that be codepoints? I'm not a Unicode expert)
 in the punctuation category; I'm not sure what the best reference is,
 but e.g. table 12 in
    http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values
 lists a number of Px categories, and a meta-category P Punctuation.


 Thanks
 Ian


Hi Ian,

I guess what I am asking was partly summarized in Iavor's message.

For me, the issue started with bullet number 4 in section 1.1

 http://www.haskell.org/onlinereport/intro.html#sect1.1

which states that:

   The lexical structure captures the concrete representation
   of Haskell programs in text files.

That combined with the opening section 2.1 (e.g. example of terminal syntax)
and the fact that the grammar  routinely described two non-terminals
ascXXX (for ASCII characters) and uniXXX for (Unicode character)
suggested that the concrete syntax of Haskell programs in text files
is in ASCII charset.  Note this does not conflict with the
general statement that Haskell programs use the Unicode character
because the uniXXX could use the ASCII charset to introduce Unicode
characters -- this is not uncommon practice for programming languages
using Unicode characters; see the link I gave earlier.

However, if I understand Malcolm's message correctly, this is not the case.
Contrary to what I quoted above, Chapter 2 does NOT specify the concrete
representation of Haskell programs in text files.  What it does is to capture
the structure of what is obtained from interpreting, *in some unspecified
encoding or unspecified alphabet*,  the concrete representation of Haskell
programs in text files.  This conclusion is unfortunate, but I believe
it is correct.
Since the encoding or the alphabet is unspecified, it is no longer necessarily
the case that two Haskell implementations would agree on the same lexical
interpretation when presented with the same exact text file containing
 a Haskell program.

In its current form, you are correct that the Report should say codepoint
instead of characters.

I join Iavor's request in clarifying the alphabet used in the grammar.

Thanks,

-- Gaby

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: What is a punctuation character?

2012-03-16 Thread Gabriel Dos Reis
On Fri, Mar 16, 2012 at 1:18 PM, Brandon Allbery allber...@gmail.com wrote:
 On Fri, Mar 16, 2012 at 14:08, Gabriel Dos Reis
 g...@integrable-solutions.net wrote:

 The lexical structure chapter defines the non-terminal uniSymbol as

     uniSymbol ::= any Unicode symbol or punctuation

 There is a slight ambiguity here: is that description supposed to
 be parsed as:
   (a) Unicode (symbol or punctuation), or
   (b) (Unicode symbol) or punctuation?


 (a) and I thought the report specified that the language's lexemes are
 defined in terms of Unicode properties so (a) is the only meaningful
 interpretation.  (b) is not particularly meaningful, as your own question
 demonstrates.

It is not clear what the language's lexemes are defined in terms of
Unicode properties
really means.  Why would you need ascSmall (and similar ASCII
character categories) then
when you already have uniSmall and associates?

It is not clear that (b) is all that not particularly meaningful.
Have a look at the production
symbol: it excludes double quote() and apostrophe (') from uniSymbol.

-- Gaby

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: What is a punctuation character?

2012-03-16 Thread Brandon Allbery
On Fri, Mar 16, 2012 at 14:30, Gabriel Dos Reis 
g...@integrable-solutions.net wrote:

 It is not clear what the language's lexemes are defined in terms of
 Unicode properties
 really means.  Why would you need ascSmall (and similar ASCII
 character categories) then
 when you already have uniSmall and associates?


I have to assume that is a leftover from an earlier version of the report,
because it is indeed already included.

See in section 2.1:

Haskell uses the Unicode
[11http://www.haskell.org/onlinereport/haskell.html#$unicode]
character set. However, source programs are currently biased toward the
ASCII character set used in earlier versions of Haskell.

I understand this to indicate that Unicode character classes are intended,
and it does indeed hint that references to ASCII are references to older
versions of the language (and should probably be considered fossils, as
ASCII itself is; the American Standard Code for Information Interchange was
obsoleted by ISO 8859, and modern references to ASCII usually should be
taken to mean ISO 8859/1).


 It is not clear that (b) is all that not particularly meaningful.
 Have a look at the production
 symbol: it excludes double quote() and apostrophe (') from uniSymbol.


The notion of symbol with certain lexicals that have other meanings *that
are specified elsewhere in the report* is not precise enough?  It may be
difficult to characterize things with your required precision, since every
general statement will necessarily have to carry part or potentially all of
the entire Report within it if it is not sufficient to use the statement's
context (as describing some part of the Report).

-- 
brandon s allbery  allber...@gmail.com
wandering unix systems administrator (available) (412) 475-9364 vm/sms
___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: What is a punctuation character?

2012-03-16 Thread Brandon Allbery
On Fri, Mar 16, 2012 at 15:20, Gabriel Dos Reis 
g...@integrable-solutions.net wrote:

 I believe this part has seen very little change from the Revised
 Haskell 98 Report.


I was in fact looking at the Haskell 98 report at the time.


 It is not clear that it is an unintended leftover.  Section 2.1 that


Nothing is ever clear.  This useless pedanticism being stipulated, there is
no purpose to a completely overlapping category unless it is intended to
relate to an earlier standard (say Haskell 1.4).

 Unicode support is clearly intended.  Also clearly, ASCII support is
 intended.
 However, the Report does not say what the concrete syntax of a Unicode
 character
 should be. (At least I have been unable to find it from the report.)


Maybe what needs to be pedantically specified is that the link to the
Unicode standard is intended to be inclusion of that standard by reference
(the [11] in the section I quoted is an endnote referencing the Unicode
standard) and not merely informational.  Or are you insisting we are not
precise enough unless we enumerate all the Unicode characters explicitly in
the Haskell standard?

-- 
brandon s allbery  allber...@gmail.com
wandering unix systems administrator (available) (412) 475-9364 vm/sms
___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: What is a punctuation character?

2012-03-16 Thread Gabriel Dos Reis
On Fri, Mar 16, 2012 at 3:22 PM, Brandon Allbery allber...@gmail.com wrote:
 On Fri, Mar 16, 2012 at 15:20, Gabriel Dos Reis
 g...@integrable-solutions.net wrote:

 I believe this part has seen very little change from the Revised
 Haskell 98 Report.


 I was in fact looking at the Haskell 98 report at the time.


 It is not clear that it is an unintended leftover.  Section 2.1 that


 Nothing is ever clear.  This useless pedanticism being stipulated, there is

I very much appreciate any clarification you have on the topic.  However, I
believe we do best when we leave phrases like useless pedanticism
or pedantically  out.  They are rarely constructive and no substance to an
otherwise informative discussion.  At best, they would distract us.

(In matter of programming language definition, pedanticism should be the
least of our worries -- and it probably should not come with a modifier
such as useless, we should probably wear it as badge of honor.)

 no purpose to a completely overlapping category unless it is intended to
 relate to an earlier standard (say Haskell 1.4).

which in itself is not an unambiguous interpretation :-)


 Unicode support is clearly intended.  Also clearly, ASCII support is
 intended.
 However, the Report does not say what the concrete syntax of a Unicode
 character
 should be. (At least I have been unable to find it from the report.)


 Maybe what needs to be pedantically specified is that the link to the
 Unicode standard is intended to be inclusion of that standard by reference
 (the [11] in the section I quoted is an endnote referencing the Unicode
 standard) and not merely informational.  Or are you insisting we are not
 precise enough unless we enumerate all the Unicode characters explicitly in
 the Haskell standard?

Giving a link to the Unicode standard does not really help with the
original questions.
I know where to find the Unicode standard; that wasn't the issue.

One of the underlying questions is: what is the concrete syntax of a
Unicode character
in a Haskell program?  Note that Chapter 2 goes to a great pain to
specify the ASCII
concrete syntax.

To put things in perspective, have look at this specification of
programs supposed
to be written using Unicode characters.

   http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.2

-- Gaby

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: What is a punctuation character?

2012-03-16 Thread Malcolm Wallace
 no purpose to a completely overlapping category unless it is intended to
 relate to an earlier standard (say Haskell 1.4).

I believe all Haskell Reports, even since 1.0, have specified that the language 
uses Unicode.  If it helps to bring perspective to this discussion, it is my 
impression that the initial designers of Haskell did not know very much about 
Unicode, but wanted to avoid the trap of being stuck with ASCII-only, and so 
decided to reference whatever Unicode does, as the most obvious and 
unambiguous way of not having to think about (or specify) these lexical issues 
themselves.

 One of the underlying questions is: what is the concrete syntax of a
 Unicode character in a Haskell program?  Note that Chapter 2 goes to a great 
 pain to
 specify the ASCII concrete syntax.

In my view, the Haskell Report is deliberately agnostic on concrete syntax for 
Unicode, believing that to be outside the scope of a programming language 
standard, whilst entirely within the scope of the Unicode standards body.  
Seeing as there are (in practice) numerous concrete representations of Unicode 
(UTF-8 and other encodings), it is largely up to individual compiler 
implementations which encodings they support for (a) source text, and (b) 
input/output at runtime.

Regards,
Malcolm

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: What is a punctuation character?

2012-03-16 Thread Gabriel Dos Reis
On Fri, Mar 16, 2012 at 6:00 PM, Malcolm Wallace malcolm.wall...@me.com wrote:
 no purpose to a completely overlapping category unless it is intended to
 relate to an earlier standard (say Haskell 1.4).

 I believe all Haskell Reports, even since 1.0, have specified that the 
 language uses Unicode.  If it helps to bring perspective to this 
 discussion, it is my impression that the initial designers of Haskell did not 
 know very much about Unicode, but wanted to avoid the trap of being stuck 
 with ASCII-only, and so decided to reference whatever Unicode does, as the 
 most obvious and unambiguous way of not having to think about (or specify) 
 these lexical issues themselves.


OK.

 One of the underlying questions is: what is the concrete syntax of a
 Unicode character in a Haskell program?  Note that Chapter 2 goes to a great 
 pain to
 specify the ASCII concrete syntax.

 In my view, the Haskell Report is deliberately agnostic on concrete syntax 
 for Unicode, believing that to be outside the scope of a programming language 
 standard, whilst entirely within the scope of the Unicode standards body.

The trouble is the Unicode standards body believes that the concrete syntax
is entirely within the scope of the programming language definition
(or any client
using Unicode characters), whilst largely restricting itself to the
talking about
code points which are more abstract.  So, the trick of reference the
Unicode standards
is not satisfactory :-(

 Seeing as there are (in practice) numerous concrete representations of 
 Unicode (UTF-8 and other encodings), it is largely up to individual compiler 
 implementations which encodings they support for (a) source text, and (b) 
 input/output at runtime.

OK, thanks!  I guess a take away from this discussion is that what
is a punctuation is far less well defined than it appears...

A common practice (exemplified by the link I gave earlier) is to restrict the
concrete -syntax- of the input program to the ASCII charset, and use Unicode
escape sequences to include the entire Unicode charset.  It is common to use
\uNN or \UNN to introduce Unicode characters, but I suspect that is
out of question for Haskell programs because it would clash with
lambda abstraction.

-- Gaby

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime


Re: What is a punctuation character?

2012-03-16 Thread Iavor Diatchki
Hello,
I am also not an expert but I got curious and did a bit of Wikipedia
reading.  Based on what I understood, here are two (related) questions
that it might be nice to clarify in a future version of the report:

1. What is the alphabet used by the grammar in the Haskell report?  My
understanding is that the intention is that the alphabet is unicode
codepoints (sometimes referred to as unicode characters).  There is no
way to refer to specific code-points by escaping as in Java (the link
that Gaby shared), you just have to write the code-points directly
(and there are plenty of encodings for doing that, e.g. UTF-8 etc.)

2. Do we respect unicode equivalence
(http://en.wikipedia.org/wiki/Canonical_equivalence) in Haskell source
code.  The issue here is that, apparently, some sequences of unicode
code points/characters are supposed to be morally the same.  For
example, it would appear that there are two different ways to write
the Spanish letter ñ: it has its own number, but it can also be made
by writing n followed by a modifier to put the wavy sign on top.

I would guess that implementing unicode equivalence  would not be
too hard---supposedly the unicode standard specifies a text
normalization procedure.  However, this would complicate the report
specification, because now the alphabet becomes not just unicode
code-points, but equivalence classes of code points.

Thoughts?

-Iavor






On Fri, Mar 16, 2012 at 4:49 PM, Ian Lynagh ig...@earth.li wrote:

 Hi Gaby,

 On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:

 OK, thanks!  I guess a take away from this discussion is that what
 is a punctuation is far less well defined than it appears...

 I'm not really sure what you're asking. Haskell's uniSymbol includes all
 Unicode characters (should that be codepoints? I'm not a Unicode expert)
 in the punctuation category; I'm not sure what the best reference is,
 but e.g. table 12 in
    http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values
 lists a number of Px categories, and a meta-category P Punctuation.


 Thanks
 Ian


 ___
 Haskell-prime mailing list
 Haskell-prime@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-prime

___
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime