subject:"\[XeTeX\] Assignment of codes \(particularly \\catcode\) based on Unicode data"

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-07 Thread Apostolos Syropoulos

> pieces of software, so is cross-compatible with other stuff. Third, as a
> non-Greek I can't comment on the technical correctness of what you say!

Obviously we are educated people here and I would not dare to say that you do 
not speak
English...


> Is there some place I could see this discussed in detail? (I'm a bit
> confused as to what 'GREEK CAPITAL LETTER EPSILON WITH PSILI' represents
> if it's not the upper case of 'GREEK SMALL LETTER EPSILON WITH
> PSILI': I
> notice in xgreek you map U+1F18 to U+0395 for upper casing and U+1F10
> for lower casing.)
 
First of all let me say that Greek is not English and therefore there are
things here that you do not encounter in English. Now, one must realize that
the acute accent (tonos in Unicode parlance), is not part of the letter as 

the umlaut in the case of the letter ö. The tonos is there to indicate where
the stress should go in a word. The tonos is dropped in words where all
letters are capital or uppercase. As in English and other languages the
first letter of the word that follows a period is always is in uppercase
form. However, the rest of the word is in lowercase form and this means that
accents should not be dropped and this is exactly the reason why you write

Έλα! Άρη, έλα! (Do you see the accent in Epsilon and Alpha?) 


Naturally, the same principle applies to texts that are written in the polytonic
version of Greek. Now there is an exception to this general rule. In Greek
the word ή means or. However, the feminine article is η so in order to avoid
confusion, ή retains its accent when is uppercased.



Now, I think it Jonathan that brought in to the discussion the "problem" with
the dialytika that appear when a word is transformed into uppercase. This 
problem
was solved in Omega by using an Omega Transformation Process. However, I solved
the same problem in an OpenType font that included Caps & Small Caps. The 
following
text 


Ο άυλος αυλός ή η Αγγελική με το σκάι;

should be transformed as follows:

Ο ΑΫΛΟΣ ΑΥΛΟΣ Ή Η ΑΓΓΕΛΙΚΗ ΜΕ ΤΟ ΣΚΑΪ;

A.S.



--
Apostolos Syropoulos
Xanthi, Greece



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Philip Taylor

David Carlisle wrote:

> I don't think that's the right question. Even if everyone, including 
> the Unicode technical committee, agreed some properties are
> incorrect for some characters, it isn't clear we should change them
> at this level.

You are (inadvertently) conflating my question with earlier discussions.

My question was asked solely in the context of Apostolos's suggestion that :

> somewhere it is explained why this is not correct. Otherwise, people 
> would see strange things and might wonder why they see them.

and I was trying to ascertain how best this explanation might be cast.
If it is the case (that for UNIV) all agree that Unicode is wrong and
Apostolos is correct, then a simple explanation that 'Unicode is wrong'
would be all that is needed. But if (say) 50% of UNIV agree that Unicode
is correct, then the explanation would have to be cast bearing this in mind.

** Phil.

--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Julian Bradfield

On 2015-05-06, Apostolos Syropoulos  wrote:
> I checked a bit the file and I have noticed that 
> \L 1F10 1F18 1F10 % 
> while xgreek.sty defines 
> \global\lccode"1F10="1F10 \global\uccode"1F10="0395
>
> You see the uppercase of 'GREEK SMALL LETTER EPSILON WITH PSILI'
> is 'GREEK LETTER EPSILON' and not 'GREEK LETTER EPSILON WITH PSILI. 

Not in standard representations of Ancient Greek it isn't, and
polytonic greek is mostly used for that.

I thought you didn't even use the psili at all in modern greek?


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Joseph Wright

On 06/05/2015 21:06, David Carlisle wrote:
> On 6 May 2015 at 20:15, Philip Taylor  wrote:
>>
>>
>> Apostolos Syropoulos wrote:
>>
>>> It seems to me that most people have no idea what Unicode is and what is 
>>> really
>>> involved.
>>
>> OK, so if we restrict the Universe of Discourse to the set of native
>> Hellenic speakers who know what Unicode is, know the importance of being
>> able to use it to identify the correct upper case of (for example)
>> 'GREEK SMALL LETTER EPSILON WITH PSILI', and hold an informed opinion on
>> the matter, would you expect that 100% of these would agree that the
>> uppercase is 'GREEK LETTER EPSILON' and not 'GREEK LETTER EPSILON WITH
>> PSILI', or would you expect that some percentage (perhaps small) would
>> hold the opposite point of view ?
>>
>> ** Phil.
>>
> 
> I don't think that's the right question. Even if everyone, including
> the Unicode technical committee,
> agreed some properties are incorrect for some characters, it isn't
> clear we should change
> them at this level.
> 
> I think that unicode-letters.def makes most sense as a
> fully automated representation of the UCD data files in TeX syntax.
> 
> That way everyone knows what data is in there.
> 
> Individual language packages have far fewer characters to worry about
> and can over-ride
> the base settings where appropriate.

Indeed: provided hyphenation is correct then we are OK. (LuaTeX of
course is rather more flexible there than XeTeX.)
--
Joseph Wright



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Joseph Wright

On 06/05/2015 16:04, Apostolos Syropoulos wrote:
> Hello,
> 
> I checked a bit the file and I have noticed that 
> 
> 
> \L 1F10 1F18 1F10 % 
> 
> while xgreek.sty defines 
> 
> 
> \global\lccode"1F10="1F10 \global\uccode"1F10="0395
> 
> You see the uppercase of 'GREEK SMALL LETTER EPSILON WITH PSILI'
> is 'GREEK LETTER EPSILON' and not 'GREEK LETTER EPSILON WITH PSILI. 
> 
> Some time ago I reported this to the Unicode people and they told me 
> 
> something like "we cannot change it now" (I do not remember the exact 
> 
> wording but the essence remains the same.) Naturally, all \lccodes and
> \uccodes for Greek letters are wrong and I suspect many more are wrong. 

This is slightly at a tangent from my original question (whether we are
processing the Unicode data in the right way), but is worth
consideration. It also has some impact on expl3 code related to case
changing (which does not use \lccode/\uccode).

I guess one could imagine deviating from the Unicode data but there are
issues. First, the current position is at least easy to explain. Second,
the current approach is the same position taken by I guess many other
pieces of software, so is cross-compatible with other stuff. Third, as a
non-Greek I can't comment on the technical correctness of what you say!
Is there some place I could see this discussed in detail? (I'm a bit
confused as to what 'GREEK CAPITAL LETTER EPSILON WITH PSILI' represents
if it's not the upper case of 'GREEK SMALL LETTER EPSILON WITH PSILI': I
notice in xgreek you map U+1F18 to U+0395 for upper casing and U+1F10
for lower casing.)
--
Joseph Wright


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Joseph Wright

On 06/05/2015 15:09, Jonathan Kew wrote:
> On 6/5/15 14:14, Joseph Wright wrote:
> 
>> Based on the current files, we have a block to set \XeTeXcharclass,
>> which only applies to XeTeX. The logic followed in that code is that
>> characters in the file LineBreak.txt which have class "ID" (ideographs)
>> not only set the \XeTeXcharclass class to 1 but also set the \catcode of
>> the code point to 11. That leads to a difference between the two Unicode
>> engines. My current feeling is that the data file should split this
>> process such that the category code change applies to both XeTeX and
>> LuaTeX, with the XeTeX-specific code separate. Does this make sense and
>> indeed does the current assignment make sense?
>>
> 
> ISTM that the most appropriate (default) \catcode for characters with
> class ID is clearly letter (11), and would suggest that LuaTeX should
> follow XeTeX in this.

Well for LaTeX at least the team get to make the call here and I think
we will pull everything into line.

> So yes, splitting out the XeTeX-specific code and having LuaTeX share
> the catcode assignments makes sense.

OK, if there are no objections I have a plan on this (I'll actually keep
all of the data, I think, and alter the assignment code).

> After all, if users can write control sequences such as
> 
>   \hello
>   \halló
>   \Здравствуйте
>   \ሰላም
>   \सलाम
> 
> they should equally well be able to write
> 
>   \你好
>   \こんにちわ
> 
> and have each of these treated as single control sequences, too. This
> will not work if category ID characters are given catcode 12.

Entirely reasonable.

> If you're making improvements to unicode-letters.def, I would suggest
> also adding a section that assigns catcode 15 (invalid) to the code
> values "D800 - "DFFF (i.e. the UTF-16 surrogates, which should never be
> used in isolation as characters).

Noted: easy enough to add.
--
Joseph Wright




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread David Carlisle

On 6 May 2015 at 20:15, Philip Taylor  wrote:
>
>
> Apostolos Syropoulos wrote:
>
>> It seems to me that most people have no idea what Unicode is and what is 
>> really
>> involved.
>
> OK, so if we restrict the Universe of Discourse to the set of native
> Hellenic speakers who know what Unicode is, know the importance of being
> able to use it to identify the correct upper case of (for example)
> 'GREEK SMALL LETTER EPSILON WITH PSILI', and hold an informed opinion on
> the matter, would you expect that 100% of these would agree that the
> uppercase is 'GREEK LETTER EPSILON' and not 'GREEK LETTER EPSILON WITH
> PSILI', or would you expect that some percentage (perhaps small) would
> hold the opposite point of view ?
>
> ** Phil.
>

I don't think that's the right question. Even if everyone, including
the Unicode technical committee,
agreed some properties are incorrect for some characters, it isn't
clear we should change
them at this level.

I think that unicode-letters.def makes most sense as a
fully automated representation of the UCD data files in TeX syntax.

That way everyone knows what data is in there.

Individual language packages have far fewer characters to worry about
and can over-ride
the base settings where appropriate.

David

[Joseph's original message was cross posted to luatex list,
is there a particular reason that has been dropped?
it seems unfortunate as  a major part of the question was
how to arrange to get the same settings on both systems]

--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Philip Taylor

Apostolos Syropoulos wrote:

> It seems to me that most people have no idea what Unicode is and what is 
> really
> involved. 

OK, so if we restrict the Universe of Discourse to the set of native
Hellenic speakers who know what Unicode is, know the importance of being
able to use it to identify the correct upper case of (for example)
'GREEK SMALL LETTER EPSILON WITH PSILI', and hold an informed opinion on
the matter, would you expect that 100% of these would agree that the
uppercase is 'GREEK LETTER EPSILON' and not 'GREEK LETTER EPSILON WITH
PSILI', or would you expect that some percentage (perhaps small) would
hold the opposite point of view ?

** Phil.

--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Apostolos Syropoulos

>
> How united is the Hellenic-speaking world about this, Apostolos ?  Is it
> a universal truth, universally accepted, or are there some (even just a
> few) who maintain that Unicode is right and everyone else is wrong ?
> 


It seems to me that most people have no idea what Unicode is and what is really
involved. 


A.S.


--
Apostolos Syropoulos
Xanthi, Greece


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Philip Taylor

Apostolos Syropoulos wrote:
>> I'd suggest that the basic (Xe|Lua)TeX formats should simply follow
>> Unicode properties.
> 
> In addition, I would suggest that somewhere it is explained why this
> is not correct. Otherwise, people would see strange things and might 
> wonder why they see them.

How united is the Hellenic-speaking world about this, Apostolos ?  Is it
a universal truth, universally accepted, or are there some (even just a
few) who maintain that Unicode is right and everyone else is wrong ?

** Phil.

--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Apostolos Syropoulos

> I'd suggest that the basic (Xe|Lua)TeX formats should simply follow
> Unicode properties.

In addition, I would suggest that somewhere it is explained why this
is not correct. Otherwise, people would see strange things and might 

wonder why they see them.

A.S.


--

Apostolos SyropoulosXanthi, Greece


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Jonathan Kew


On 6/5/15 16:29, Philip Taylor wrote:



Apostolos Syropoulos wrote:


the uppercase of 'GREEK SMALL LETTER EPSILON WITH PSILI'
is 'GREEK LETTER EPSILON' and not 'GREEK LETTER EPSILON WITH PSILI.

Some time ago I reported this to the Unicode people and they told me

something like "we cannot change it now" (I do not remember the exact

wording but the essence remains the same.) Naturally, all \lccodes and
\uccodes for Greek letters are wrong and I suspect many more are wrong.


Nasty.  In that case I would propose a user-selectable option :

\Unicodecompliance

with possible values

"strict" (as per current Unicode standard)

and

"loose" (as advised by consensus of native speakers)

One might need to factor this out by language, as in :


\Unicodecompliance {Greek} {strict}
\Unicodecompliance {Greek} {loose}

or perhaps

\Unicodecompliance (Greek=loose, Turkish=strict, ...)



I'd suggest that the basic (Xe|Lua)TeX formats should simply follow 
Unicode properties. A package designed to support any particular 
language is of course free to offer other options and make whatever 
adjustments may be appropriate.


JK



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Philip Taylor



Apostolos Syropoulos wrote:

> the uppercase of 'GREEK SMALL LETTER EPSILON WITH PSILI'
> is 'GREEK LETTER EPSILON' and not 'GREEK LETTER EPSILON WITH PSILI. 
> 
> Some time ago I reported this to the Unicode people and they told me 
> 
> something like "we cannot change it now" (I do not remember the exact 
> 
> wording but the essence remains the same.) Naturally, all \lccodes and
> \uccodes for Greek letters are wrong and I suspect many more are wrong. 

Nasty.  In that case I would propose a user-selectable option :

\Unicodecompliance

with possible values

"strict" (as per current Unicode standard)

and

"loose" (as advised by consensus of native speakers)

One might need to factor this out by language, as in :


\Unicodecompliance {Greek} {strict}
\Unicodecompliance {Greek} {loose}

or perhaps

\Unicodecompliance (Greek=loose, Turkish=strict, ...)

Philip Taylor




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Apostolos Syropoulos

Hello,

I checked a bit the file and I have noticed that 


\L 1F10 1F18 1F10 % 

while xgreek.sty defines 


\global\lccode"1F10="1F10 \global\uccode"1F10="0395

You see the uppercase of 'GREEK SMALL LETTER EPSILON WITH PSILI'
is 'GREEK LETTER EPSILON' and not 'GREEK LETTER EPSILON WITH PSILI. 

Some time ago I reported this to the Unicode people and they told me 

something like "we cannot change it now" (I do not remember the exact 

wording but the essence remains the same.) Naturally, all \lccodes and
\uccodes for Greek letters are wrong and I suspect many more are wrong. 


A.S.

PS Of course people who use the xgreek package have no problem.

 --
Apostolos Syropoulos
Xanthi, Greece


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Jonathan Kew


On 6/5/15 14:14, Joseph Wright wrote:


Based on the current files, we have a block to set \XeTeXcharclass,
which only applies to XeTeX. The logic followed in that code is that
characters in the file LineBreak.txt which have class "ID" (ideographs)
not only set the \XeTeXcharclass class to 1 but also set the \catcode of
the code point to 11. That leads to a difference between the two Unicode
engines. My current feeling is that the data file should split this
process such that the category code change applies to both XeTeX and
LuaTeX, with the XeTeX-specific code separate. Does this make sense and
indeed does the current assignment make sense?



ISTM that the most appropriate (default) \catcode for characters with 
class ID is clearly letter (11), and would suggest that LuaTeX should 
follow XeTeX in this.


So yes, splitting out the XeTeX-specific code and having LuaTeX share 
the catcode assignments makes sense.


After all, if users can write control sequences such as

  \hello
  \halló
  \Здравствуйте
  \ሰላም
  \सलाम

they should equally well be able to write

  \你好
  \こんにちわ

and have each of these treated as single control sequences, too. This 
will not work if category ID characters are given catcode 12.


If you're making improvements to unicode-letters.def, I would suggest 
also adding a section that assigns catcode 15 (invalid) to the code 
values "D800 - "DFFF (i.e. the UTF-16 surrogates, which should never be 
used in isolation as characters).


JK



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex

[XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

2015-05-06 Thread Joseph Wright

Hello all,

As some people will have seen, the LaTeX team have recently integrated
setting of codes (\catcode, \lccode, etc.) for the entire Unicode range
 into the kernel when XeTeX/LuaTeX are in use. This is not a functional
change for end users but does mean that the team now have some control
over these important settings. Notably, the new data file we have
created (unicode-letters.def) is compatible with plain TeX and works
with both XeTeX and LuaTeX. We are therefore hopeful that it will
provide useful not only to LaTeX users but also to those using
plain-basef formats.

For the initial pass we have adopted the settings applied by
unicode-letters.tex (XeTeX)/luatex-unicode-letters.tex (LuaTeX) as-is.
We have constructed a new (TeX) script to generate this data from the
raw Unicode data files.

Most of the settings are straight-forward and shared between XeTeX and
LuaTeX. For example, characters marked as Unicode as letters have
\catcode 11, \lccode and \uccode are set up based on case relationships,
etc. However, we would like to raise one area that may need revision.

Based on the current files, we have a block to set \XeTeXcharclass,
which only applies to XeTeX. The logic followed in that code is that
characters in the file LineBreak.txt which have class "ID" (ideographs)
not only set the \XeTeXcharclass class to 1 but also set the \catcode of
the code point to 11. That leads to a difference between the two Unicode
engines. My current feeling is that the data file should split this
process such that the category code change applies to both XeTeX and
LuaTeX, with the XeTeX-specific code separate. Does this make sense and
indeed does the current assignment make sense?

We are very keen to hear about any other logic changes that may be
required in the data file. This is a complex area and we have at present
done little other than copy the current logic.
--
Joseph Wright


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

Re: [XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

[XeTeX] Assignment of codes (particularly \catcode) based on Unicode data

16 matches

Site Navigation

Mail list logo

Footer information