Re: Code pages and Unicode

2011-08-25 Thread Asmus Freytag

On 8/24/2011 7:45 PM, Richard Wordingham wrote:


Which earlier coding system supported Welsh?  (I'm thinking of 'W WITH
CIRCUMFLEX', U+0174 and U+0175.)  How was the use of the canonical
decompositions incompatible with the character encodings of legacy
systems?  Latin-1 has the same codes as ISO-8859-1, but that's as far
as having the same codes goes. Was the use of combining jamo
incompatible with legacy Hangul encodings?


See, how time flies.

Early adopters were interested in 1:1 transcoding, using a single 256 
entry table for an 8-bit character set, with guaranteed predictable 
length. Early designs of Unicode (and 10646) attempted to address these 
concerns, because they promised severe impediments to migration.


Some characters were included as part of the merger, without the same 
rigorous process as is in force for characters today. At that time, 
scuttling the deal over a few characters here or there would not have 
been a reasonable action. So you will always find some exceptions to 
many of the principles - which doesn't make them less valid.


Obviously D800 D800 000E DC00 is non-conformant with current UTF-16. 
Remembering that there is a guarantee that there will be no more 
surrogate points, an extension form has to be non-conformant with 
current UTF-16! 


And that's the reason why there's no interest in this part of the 
discussion. Nobody will need an extension next Tuesday, or in a decade 
or even in several decades - or ever. Haven't seen an upgrade to Morse 
code recently to handle Unicode, for example. Technology has a way of 
moving on.


So, best thing is to drop this silly discussion, and let those future 
people that might be facing a real *requirement* use their good judgment 
to come to a technical solution appropriate to their time - instead of 
wasting collective cycles of discussion how to make 1990's technology 
work for an unknown future requirement. It's just bad engineering.

Everyone should know how to extend UTF-8 and UTF-32 to cover the 31-bit
range.


I disagree (as would anyone with a bit of long-term perspective). Nobody 
needs to look into this for decades, so let it rest.


A./



RE: Code pages and Unicode

2011-08-25 Thread Erkki I Kolehmainen
+1

I'm also guilty of pushing through one particular proposal (much to Ken's 
disliking) that I most certainly would no longer even try, but, alas, times 
were different.

Sincerely, Erkki 

-Alkuperäinen viesti-
Lähettäjä: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] 
Puolesta Asmus Freytag
Lähetetty: 25. elokuuta 2011 9:00
Vastaanottaja: Richard Wordingham
Kopio: Ken Whistler; unicode@unicode.org
Aihe: Re: Code pages and Unicode

On 8/24/2011 7:45 PM, Richard Wordingham wrote:

 Which earlier coding system supported Welsh?  (I'm thinking of 'W WITH
 CIRCUMFLEX', U+0174 and U+0175.)  How was the use of the canonical
 decompositions incompatible with the character encodings of legacy
 systems?  Latin-1 has the same codes as ISO-8859-1, but that's as far
 as having the same codes goes. Was the use of combining jamo
 incompatible with legacy Hangul encodings?

See, how time flies.

Early adopters were interested in 1:1 transcoding, using a single 256 
entry table for an 8-bit character set, with guaranteed predictable 
length. Early designs of Unicode (and 10646) attempted to address these 
concerns, because they promised severe impediments to migration.

Some characters were included as part of the merger, without the same 
rigorous process as is in force for characters today. At that time, 
scuttling the deal over a few characters here or there would not have 
been a reasonable action. So you will always find some exceptions to 
many of the principles - which doesn't make them less valid.

 Obviously D800 D800 000E DC00 is non-conformant with current UTF-16. 
 Remembering that there is a guarantee that there will be no more 
 surrogate points, an extension form has to be non-conformant with 
 current UTF-16! 

And that's the reason why there's no interest in this part of the 
discussion. Nobody will need an extension next Tuesday, or in a decade 
or even in several decades - or ever. Haven't seen an upgrade to Morse 
code recently to handle Unicode, for example. Technology has a way of 
moving on.

So, best thing is to drop this silly discussion, and let those future 
people that might be facing a real *requirement* use their good judgment 
to come to a technical solution appropriate to their time - instead of 
wasting collective cycles of discussion how to make 1990's technology 
work for an unknown future requirement. It's just bad engineering.
 Everyone should know how to extend UTF-8 and UTF-32 to cover the 31-bit
 range.

I disagree (as would anyone with a bit of long-term perspective). Nobody 
needs to look into this for decades, so let it rest.

A./





Re: RTL PUA?

2011-08-25 Thread Philippe Verdy
2011/8/25 Peter Constable peter...@microsoft.com:
 From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On 
 Behalf Of Philippe Verdy

 But I suspect that the strong opposition given by Peter Constable...

 Yet again, I think you're putting words in my mouth. The only thing I think 
 I've explicitly spoken against in this thread is changing the default bidi 
 category of PUA characters to ON.

Something that will break all existing implementations, but will not
solve the problem, it will just reduce the number of Bidi controls
needed in texts: BC=ON only means means that the resolved direction of
PUA characters will come from the resolved direction of previous
(non-PUA) characters. It does not work at the beginning of paragraphs.
The actual direction properties should be overridable to be another
*strong* RTL direction than the default, instead of changing it to be
extremely weak and contextual.

 In fact when Peter says that the Bidi processing and the OpenType layout
 engine are in separate layers (so that the OpenType layout works in a lower
 layer and all BiDi processing is done before any font details are inspected),
 I think that this is a perfect lie:

 The Unicode Bidi Algorithm uses _character_ properties and operates on 
 _characters_. OpenType Layout tables deal only with glyphs.

You're repeating again what I also know and used in my arguments. I
have never stated that the Bidi algorithm operates at the glyph level,
I have clearly said the opposite. You are only searching a
contradiction which does not even appear.

 At least the Uniscribe layout already has to inspect the content of any 
 OpenType
 font, at least to process its cmap and implement the font fallback 
 mechanism,
 just to see which font will match the characters in the input string to 
 render.

 If it can do that, it can also inspect later a table in the selected font to 
 see which
 PUAs are RTL or LTR. And it can do that as a source of information for BiDi 
 ...

 In theory, that could be done. A huge problem with your suggestion, though, 
 is that the bidi algorithm deals only with characters and makes no references 
 whatsoever to font data, and for that reason -- I would hazard to guess -- 
 most implementations of the Unicode bidi algorithm do not rely in any way on 
 font data and would need significant re-engineering to do so.

You repeat again your argument that I have not contradicted. but this
has nothing to do with what I want to express. And any way a
reengineering will be needed in all the proposed solutions (except if
we have to encode the Bidi controls around those PUAs, something that
we really want to avoid as often as we avoid them for non-PUA
characters).

The Bidi algorithm is not changed in any way, it still uses the
character properties, except that the source of the property values
for PUA should be overridable (not only from the standard UCD, for PUA
characters), as already permitted in the Unicode standard which just
assigns them *default* property values.

If a Bidi algorithm implementation does not allow such overrides, it
is already broken and has to be fixed, because it was insufficiently
engineered. The fact that it cannot process font data at the step
specified in OpenType specifications is a defect of this
specification, which is incomplete. But even if you don't want to add
such data table in fonts, the external data will have to come from
somewere else. Otherwise only the default property values will be
used.




Re: RTL PUA?

2011-08-25 Thread Philippe Verdy
2011/8/25 Peter Constable peter...@microsoft.com:
 From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy

2011/8/22 Joó Ádám a...@jooadam.hu:
 Speaking of actual implementation, I’m convinced that this format
 should be the same as it is for encoded characters ...

 As well, the small properties files can be embedded, in a very compact
 form, in the PUA font.

 In one sense having data regarding PUA character properties embedded within a 
 font could make sense since the interpretation of instances of those PUA 
 characters will be tied to particular fonts.

 However, I don't see this as really being workable: rendering implementations 
 will typically do certain types of processes without access to any font data.

Remove the future will in your sentence... you're assuming how future
implementations will work.

And the certain types of process element is extremely fuzzy. Those
that want to use PUA as RTL characters will never be satisfied, they
want an access to some properties data that are not only those from
the UCD.

But you're right in one thing: the font is not expected to contain all
those properties. I am still convinced that this is the best place for
BC property values which are tied to the font, for rendering purpose.
Only the properties for PUA characters that have absolutely no use in
rendering should not be in fonts (for example collation weights, case
mappings, custom character name aliases if one wants).

Some other properties may be needed for rendering purpose: notably
text segmentation data for handling line breaks (many PUA are
currently used for custom sinograms in the Han script, that allows
linebreak to occur before and after each of them; but this behavior
would not be perceived as correct for most scripts.

However, I don't think that line breaking properties data are very
well fitting in fonts, because such segmentation is not needed only
for rendering. However for most of those non-rendering purpose (e.g.
plain-text search), we genenrally don't want to have the search result
depending on soft line breaks. Soft line breaks are only meant for
rendering purpose, and so this breakability may become also under the
control of the font.

On the opposite, hard line breaks are controlled by existing non-PUA
control characters, so they are not a problem and don't need to be
overriden. Those hard line breaks are very often expected to be
searchable, unlike soft line breaks which should remain invisible in
plain-text searches as they are only the result of some rendering
process.




searching for PUA characters

2011-08-25 Thread Lorna Priest
The recent discussion on PUA characters reminded me of a question I've 
had. I am wondering if anyone has a tool whereby we could search for all 
documents on a local computer (or server) that use PUA codepoints. I 
suppose what I'd like is to be able to identify beginning and ending 
codepoints to search for, such as F130..F32F or something along that 
line.


SIL has a corporate PUA, however many (most) of the characters are now 
in Unicode and I'd like to be able to help people identify which 
documents need converting to the official USVs.


Lorna Priest




Re: searching for PUA characters

2011-08-25 Thread Bill Poser
On Thu, Aug 25, 2011 at 1:17 PM, Lorna Priest lorna_pri...@sil.org wrote:

 The recent discussion on PUA characters reminded me of a question I've had.
 I am wondering if anyone has a tool whereby we could search for all
 documents on a local computer (or server) that use PUA codepoints. I suppose
 what I'd like is to be able to identify beginning and ending codepoints to
 search for, such as F130..F32F or something along that line.


I have a utility called unidesc, part of my uniutils package (
http://billposer.org/Software/unidesc.html), that identifies the ranges to
which characters belong. You could run this on the various files and check
the output for Private Use Area. To obtain a sorted list of the ranges
found in a file (rather than the default of the range to which each portion
of the file belongs), use the -r option. This is runs on Linux and BSD
systems, so probably can be compiled for MacOS too. I don't know about MS
Windows.