Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-07 Thread David Carlisle
On 7 May 2015 at 02:07, Ross Moore ross.mo...@mq.edu.au wrote:

 Hi David,

 ..



 No disagreement to this.


OK:-)


 In the current versions d835dc00 is two characters in luatex
 and one character in xetex
 as the implementation detail that xetex's underlying storage is mostly
 UTF-16 is exposed.


 This seems to be premature of XeTeX then.
 It seems to be making an assumption on how those bytes
 will ultimately be used.



I don't think it's so much assuming that as just choosing to use UTF16
as an internal string format tends to lead that way. Unlike UTF-8, UTF-16
can not represent all code points in the 0-10 range.
If  I switch to java(script) notation which does define numeric references
as utf-16 units rather than unicode code points, if you do not make it an
error you can encode an isolated surrogate such as \ud835 but there
is no way to store the two character sequence U+D835 U+DC00
\ud835\udc00 is the single character U+1D400, so you can only store
such character sequence if you store each text block as a sequence of
separate strings keeping unpaired surrogates apart \ud835,\udc00
which is a lot of effort for supporting input that should never appear.


 If it is
 not possible to prevent ^^^ or utf8 encoded surrogate pairs combining
 then it is better to
 prevent them being formed.


 Hmm.
 What if you have an entirely different purpose in mind for those bytes?
 You still need to be able to create them and do further processing with
 them.


luatex has a different mechanism for this, it allows utf8 encoding and ^^^
numeric
references to access the first 256 slots _above_ 10:

quoting the luatex manual:

Output in byte-sized chunks can be achieved by using characters just
 outside of the valid Unicode range,
 starting at the value 1 114 112 (0x11). When the time comes to print a
 character c = 1 114 112,
 LuaTeX will actually print the single byte corresponding to c minus
 1,114,112.


This allows explicit byte-level access to file writing (so you can write
binary data such as images)
 without having to second guess and invert the character encoding the
system uses to write characters to a file.


David


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-06 Thread Arthur Reutenauer
  While working on these bugs, we also discussed how surrogate
characters were handled in XeTeX.  Surrogate characters are the 2048
code points that are used in UTF-16 to encode characters with code
points above 65536: a pair of them makes up one Unicode character;
however they're not meant to be used in isolation, even though they have
code points like other characters (they're not just byte sequences).

  Right now, XeTeX allows isolated surrogate characters, and also
combines sequences such as d835dc00 into one Unicode character.
We want to flag the former case but are not sure how: should we make the
characters invalid (with catcode 15)?  Or we could map them to the
standard unknown character (U+FFFD).  The latter case is more nasty
and should definitely be forbidden -- the ^^ notation should only be
used for proper characters (so instead of the above, the Unicode code
point of the resulting Unicode character should be used, in this case
^1d400).

  Any thoughts?

Best,

Arthur


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-06 Thread David Carlisle
On 6 May 2015 at 23:04, Arthur Reutenauer
arthur.reutena...@normalesup.org wrote:
   While working on these bugs, we also discussed how surrogate
 characters were handled in XeTeX.  Surrogate characters are the 2048
 code points that are used in UTF-16 to encode characters with code
 points above 65536: a pair of them makes up one Unicode character;
 however they're not meant to be used in isolation, even though they have
 code points like other characters (they're not just byte sequences).

   Right now, XeTeX allows isolated surrogate characters, and also
 combines sequences such as d835dc00 into one Unicode character.
 We want to flag the former case but are not sure how: should we make the
 characters invalid (with catcode 15)?  Or we could map them to the
 standard unknown character (U+FFFD).  The latter case is more nasty
 and should definitely be forbidden -- the ^^ notation should only be
 used for proper characters (so instead of the above, the Unicode code
 point of the resulting Unicode character should be used, in this case
 ^1d400).

   Any thoughts?


A major difference between using catcode 15 and the engine's input
filter substituting
U+FFFD is that the former could be over-ridden at the macro layer.
Whether that's a good thing
or not depends a bit on what happens if a document puts the catcodes
back to (say) 12.

if you just get undefined characters and missing glyphs, then you get
what you ask for
and probably it should be allowed just because.  If the internals
can't reliably deal with an
unpaired surrogate (eg it crashes some font library api) then the
engine had better ensure
it doesn't easily happen and FFFD is as good as anything probably.

If you do go for catcode 15, then (as suggested in the thread on
unicode-letters.def)
it could be set in the macro layer or the engine could initialise
these catcodes.
Doing it at the macro layer is probably more in the spirit of the
traditional catcode initialisation
which is very minimalist.

As you say, combining d835dc00 into one token just wrong,
and I think it should do (twice) whatever you decide to do for
unpaired surrogates.

David


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-06 Thread David Carlisle
 The character itself, as bytes that is, is not wrong and users should be able 
 to create these.
 But preferably through macros that ensure that they come correctly paired.

placing two character tokens representing a surrogate pair should not
though magically turn itself
into a single character. The UTF-8 or  encoding should refer to
the unicode code point not
to the UTF-16 encoding,

In the current versions d835dc00 is two characters in luatex
and one character in xetex
as the implementation detail that xetex's underlying storage is mostly
UTF-16 is exposed. If it is
not possible to prevent ^^^ or utf8 encoded surrogate pairs combining
then it is better to
prevent them being formed.

this is no different to XML where  #xd835; #xdc00; always refers to
two (invalid) characters not
to  #x1d400;

David


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-06 Thread Ross Moore
Hi Arthur,

On 07/05/2015, at 8:04, Arthur Reutenauer arthur.reutena...@normalesup.org 
wrote:

  While working on these bugs, we also discussed how surrogate
 characters were handled in XeTeX.  Surrogate characters are the 2048
 code points that are used in UTF-16 to encode characters with code
 points above 65536: a pair of them makes up one Unicode character;
 however they're not meant to be used in isolation, even though they have
 code points like other characters (they're not just byte sequences).
 
  Right now, XeTeX allows isolated surrogate characters, and also
 combines sequences such as d835dc00 into one Unicode character.
 We want to flag the former case but are not sure how: should we make the
 characters invalid (with catcode 15)?  

That would definitely be wrong.
The character itself, as bytes that is, is not wrong and users should be able 
to create these.
But preferably through macros that ensure that they come correctly paired.

IMHO, this is a macro issue, not an engine issue.

The same kind of thing applies with combining accents and diacritics.
I've written macros that take an argument and follow it with a combining 
character.
This is useful for generating correct UTF8 bytes to put into XML packets, as 
needed for the XMP Metadata that is required in PDF files that must validate 
for ISO specifications.

Similar macros could be used to construct upper-plane characters from 
surrogates, given only the math style and Latin letter. For these, single 
surrogate characters will be needed in the macro definitions, with the ultimate 
matching pair to be determined algorithmically, probably using an \ifcase  
instance. Single characters thus need to be able to be input, so as to create 
the macro definition.

OK, a clever macro programmer can change the catcodes to become valid local to 
the macro definition. But that is really complicating things.


 Or we could map them to the
 standard unknown character (U+FFFD).  The latter case is more nasty
 and should definitely be forbidden -- the ^^ notation should only be
 used for proper characters (so instead of the above, the Unicode code
 point of the resulting Unicode character should be used, in this case
 ^1d400).

I disagree. 
The ^^ notation can be used in macros to create the required bytes, for writing 
out into a file other than the  .dvi  or .pdf  output.
pdfTeX (or other engine) then can cause that file to become embedded as a file 
object stream in the final PDF.


 
  Any thoughts?
 
Best,
 
Arthur


Hope this helps,

Ross




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-06 Thread Ross Moore
Hi David,

On 07/05/2015, at 9:26 AM, David Carlisle wrote:

 The character itself, as bytes that is, is not wrong and users should be 
 able to create these.
 But preferably through macros that ensure that they come correctly paired.
 
 placing two character tokens representing a surrogate pair should not
 though magically turn itself
 into a single character.

Agreed.
You don't know whether you want a single character until 
you know what kind of output is being generated.
That need not be known on input.

 The UTF-8 or  encoding should refer to
 the unicode code point not
 to the UTF-16 encoding,

No disagreement to this.

 
 In the current versions d835dc00 is two characters in luatex
 and one character in xetex
 as the implementation detail that xetex's underlying storage is mostly
 UTF-16 is exposed.

This seems to be premature of XeTeX then.
It seems to be making an assumption on how those bytes 
will ultimately be used.

 If it is
 not possible to prevent ^^^ or utf8 encoded surrogate pairs combining
 then it is better to
 prevent them being formed.

Hmm. 
What if you have an entirely different purpose in mind for those bytes?
You still need to be able to create them and do further processing with them.

Maybe there should be a primitive that sets a flag controlling what
happens to surrogates' bytes on input?
It may well be that XeTeX's current behaviour is best for putting
content into PDF pages; but not best in other situations. So a macro
programmer should have a means to change this, when needed.

 
 this is no different to XML where  #xd835; #xdc00; always refers to
 two (invalid) characters not
 to  #x1d400;

Seems fine to me.
If application software wants/needs to combine them, it can do so.

 
 David


Cheers,

Ross


Ross Moore

Senior Lecturer
Mathematics Department  |   Level 2, E7A 
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955   |  F: +61 2 9850 8114
M: +61 407 288 255  |  http://www.maths.mq.edu.au/

CRICOS Provider Number 2J. Think before you print. Please consider the 
environment before printing this email.

This message is intended for the addressee named and may contain confidential 
information. If you are not the intended recipient, please delete it and notify 
the sender. Views expressed in this message are those of the individual sender, 
and are not necessarily the views of Macquarie University.



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-05 Thread David Carlisle
On 4 May 2015 at 16:27, Jonathan Kew jfkth...@gmail.com wrote:

 ...

 A fix for this bug, so that \string generates single Unicode characters
 even for values above U+, is currently on the utf16-issues branch in
 the XeTeX repository on sourceforge.[1]

 A bug with characters above U+ within \scantokens[2] is also fixed on
 this branch.


 There are also a couple of new primitives available in this branch:

 (1) \Uchar number

 where number is a number in the range 0..10

 is an expandable command that produces a character token with the given
 Unicode value, and catcode=12 (other character). This is different from
 TeX's \char primitive from a macro-programming point of view, in that it
 expands to a character token rather than being a typesetting command.

 (I believe this is similar to the \Uchar command available in luatex.)


 (2) \Ucharcat number1 number2

 where number1 is a number in the range 0..10
 and number2 is a number in the ranges 1..4, 6..8, 10..12

 is an expandable command that produces a character token with Unicode
 value number1 and catcode number2. This allows macro programmers to
 create character tokens with various catcode assignments much more easily
 than is otherwise possible.


 Feedback and testing is invited; but note that currently this will require
 pulling the code from sourceforge and building the new xetex, as binary
 packages are not available.

 If testing in the next day or two doesn't uncover any alarming problems,
 these fixes/features will be merged to the master branch and to TeXLive, in
 preparation for the TL2015 release.

 JK



Thanks for this!

I've build the version from this branch and it does appear to address all
the test cases I had for characters above , and \Uchar(cat) will be
incredibly useful in defining expandable operations on token lists, and for
code that should be compatible with both luatex and xetex.

David


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


[XeTeX] Bug fixes and new features related to Unicode character codes, surrogates, etc

2015-05-04 Thread Jonathan Kew

On 23/4/15 20:59, David Carlisle wrote:

I can confirm that \string does convert character tokens
to two tokens giving the UTF-16 representation.

With the attached file luatex produces

90,33
34,33
233,33
233,33
65530,33
65537,33
65537,33


which is in each case the unicode value of the character followed by
that of !

xetex produces

90,33
34,33
233,33
233,33
65530,33
55296,56321
55296,56321


where the last two lines show that \string has generated U+D800 U+DC01
which does correspond to the UTF-16 encoding of U+10001 confirming
that \string on a character token has produced two tokens that have been
picked up separately as #1 and #2 of the \test macro.


A fix for this bug, so that \string generates single Unicode characters 
even for values above U+, is currently on the utf16-issues branch in 
the XeTeX repository on sourceforge.[1]


A bug with characters above U+ within \scantokens[2] is also fixed 
on this branch.



There are also a couple of new primitives available in this branch:

(1) \Uchar number

where number is a number in the range 0..10

is an expandable command that produces a character token with the given 
Unicode value, and catcode=12 (other character). This is different from 
TeX's \char primitive from a macro-programming point of view, in that it 
expands to a character token rather than being a typesetting command.


(I believe this is similar to the \Uchar command available in luatex.)


(2) \Ucharcat number1 number2

where number1 is a number in the range 0..10
and number2 is a number in the ranges 1..4, 6..8, 10..12

is an expandable command that produces a character token with Unicode 
value number1 and catcode number2. This allows macro programmers to 
create character tokens with various catcode assignments much more 
easily than is otherwise possible.



Feedback and testing is invited; but note that currently this will 
require pulling the code from sourceforge and building the new xetex, as 
binary packages are not available.


If testing in the next day or two doesn't uncover any alarming problems, 
these fixes/features will be merged to the master branch and to TeXLive, 
in preparation for the TL2015 release.


JK


[1] https://sourceforge.net/p/xetex/code/ci/utf16-issues/tree/
[2] https://sourceforge.net/p/xetex/bugs/80/



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex