Re: Code pages and Unicode
On 8/24/2011 7:45 PM, Richard Wordingham wrote: Which earlier coding system supported Welsh? (I'm thinking of 'W WITH CIRCUMFLEX', U+0174 and U+0175.) How was the use of the canonical decompositions incompatible with the character encodings of legacy systems? Latin-1 has the same codes as ISO-8859-1, but that's as far as having the same codes goes. Was the use of combining jamo incompatible with legacy Hangul encodings? See, how time flies. Early adopters were interested in 1:1 transcoding, using a single 256 entry table for an 8-bit character set, with guaranteed predictable length. Early designs of Unicode (and 10646) attempted to address these concerns, because they promised severe impediments to migration. Some characters were included as part of the merger, without the same rigorous process as is in force for characters today. At that time, scuttling the deal over a few characters here or there would not have been a reasonable action. So you will always find some exceptions to many of the principles - which doesn't make them less valid. Obviously D800 D800 000E DC00 is non-conformant with current UTF-16. Remembering that there is a guarantee that there will be no more surrogate points, an extension form has to be non-conformant with current UTF-16! And that's the reason why there's no interest in this part of the discussion. Nobody will need an extension next Tuesday, or in a decade or even in several decades - or ever. Haven't seen an upgrade to Morse code recently to handle Unicode, for example. Technology has a way of moving on. So, best thing is to drop this silly discussion, and let those future people that might be facing a real *requirement* use their good judgment to come to a technical solution appropriate to their time - instead of wasting collective cycles of discussion how to make 1990's technology work for an unknown future requirement. It's just bad engineering. Everyone should know how to extend UTF-8 and UTF-32 to cover the 31-bit range. I disagree (as would anyone with a bit of long-term perspective). Nobody needs to look into this for decades, so let it rest. A./
RE: Code pages and Unicode
+1 I'm also guilty of pushing through one particular proposal (much to Ken's disliking) that I most certainly would no longer even try, but, alas, times were different. Sincerely, Erkki -Alkuperäinen viesti- Lähettäjä: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] Puolesta Asmus Freytag Lähetetty: 25. elokuuta 2011 9:00 Vastaanottaja: Richard Wordingham Kopio: Ken Whistler; unicode@unicode.org Aihe: Re: Code pages and Unicode On 8/24/2011 7:45 PM, Richard Wordingham wrote: Which earlier coding system supported Welsh? (I'm thinking of 'W WITH CIRCUMFLEX', U+0174 and U+0175.) How was the use of the canonical decompositions incompatible with the character encodings of legacy systems? Latin-1 has the same codes as ISO-8859-1, but that's as far as having the same codes goes. Was the use of combining jamo incompatible with legacy Hangul encodings? See, how time flies. Early adopters were interested in 1:1 transcoding, using a single 256 entry table for an 8-bit character set, with guaranteed predictable length. Early designs of Unicode (and 10646) attempted to address these concerns, because they promised severe impediments to migration. Some characters were included as part of the merger, without the same rigorous process as is in force for characters today. At that time, scuttling the deal over a few characters here or there would not have been a reasonable action. So you will always find some exceptions to many of the principles - which doesn't make them less valid. Obviously D800 D800 000E DC00 is non-conformant with current UTF-16. Remembering that there is a guarantee that there will be no more surrogate points, an extension form has to be non-conformant with current UTF-16! And that's the reason why there's no interest in this part of the discussion. Nobody will need an extension next Tuesday, or in a decade or even in several decades - or ever. Haven't seen an upgrade to Morse code recently to handle Unicode, for example. Technology has a way of moving on. So, best thing is to drop this silly discussion, and let those future people that might be facing a real *requirement* use their good judgment to come to a technical solution appropriate to their time - instead of wasting collective cycles of discussion how to make 1990's technology work for an unknown future requirement. It's just bad engineering. Everyone should know how to extend UTF-8 and UTF-32 to cover the 31-bit range. I disagree (as would anyone with a bit of long-term perspective). Nobody needs to look into this for decades, so let it rest. A./
Re: RTL PUA?
2011/8/25 Peter Constable peter...@microsoft.com: From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Philippe Verdy But I suspect that the strong opposition given by Peter Constable... Yet again, I think you're putting words in my mouth. The only thing I think I've explicitly spoken against in this thread is changing the default bidi category of PUA characters to ON. Something that will break all existing implementations, but will not solve the problem, it will just reduce the number of Bidi controls needed in texts: BC=ON only means means that the resolved direction of PUA characters will come from the resolved direction of previous (non-PUA) characters. It does not work at the beginning of paragraphs. The actual direction properties should be overridable to be another *strong* RTL direction than the default, instead of changing it to be extremely weak and contextual. In fact when Peter says that the Bidi processing and the OpenType layout engine are in separate layers (so that the OpenType layout works in a lower layer and all BiDi processing is done before any font details are inspected), I think that this is a perfect lie: The Unicode Bidi Algorithm uses _character_ properties and operates on _characters_. OpenType Layout tables deal only with glyphs. You're repeating again what I also know and used in my arguments. I have never stated that the Bidi algorithm operates at the glyph level, I have clearly said the opposite. You are only searching a contradiction which does not even appear. At least the Uniscribe layout already has to inspect the content of any OpenType font, at least to process its cmap and implement the font fallback mechanism, just to see which font will match the characters in the input string to render. If it can do that, it can also inspect later a table in the selected font to see which PUAs are RTL or LTR. And it can do that as a source of information for BiDi ... In theory, that could be done. A huge problem with your suggestion, though, is that the bidi algorithm deals only with characters and makes no references whatsoever to font data, and for that reason -- I would hazard to guess -- most implementations of the Unicode bidi algorithm do not rely in any way on font data and would need significant re-engineering to do so. You repeat again your argument that I have not contradicted. but this has nothing to do with what I want to express. And any way a reengineering will be needed in all the proposed solutions (except if we have to encode the Bidi controls around those PUAs, something that we really want to avoid as often as we avoid them for non-PUA characters). The Bidi algorithm is not changed in any way, it still uses the character properties, except that the source of the property values for PUA should be overridable (not only from the standard UCD, for PUA characters), as already permitted in the Unicode standard which just assigns them *default* property values. If a Bidi algorithm implementation does not allow such overrides, it is already broken and has to be fixed, because it was insufficiently engineered. The fact that it cannot process font data at the step specified in OpenType specifications is a defect of this specification, which is incomplete. But even if you don't want to add such data table in fonts, the external data will have to come from somewere else. Otherwise only the default property values will be used.
Re: RTL PUA?
2011/8/25 Peter Constable peter...@microsoft.com: From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy 2011/8/22 Joó Ádám a...@jooadam.hu: Speaking of actual implementation, I’m convinced that this format should be the same as it is for encoded characters ... As well, the small properties files can be embedded, in a very compact form, in the PUA font. In one sense having data regarding PUA character properties embedded within a font could make sense since the interpretation of instances of those PUA characters will be tied to particular fonts. However, I don't see this as really being workable: rendering implementations will typically do certain types of processes without access to any font data. Remove the future will in your sentence... you're assuming how future implementations will work. And the certain types of process element is extremely fuzzy. Those that want to use PUA as RTL characters will never be satisfied, they want an access to some properties data that are not only those from the UCD. But you're right in one thing: the font is not expected to contain all those properties. I am still convinced that this is the best place for BC property values which are tied to the font, for rendering purpose. Only the properties for PUA characters that have absolutely no use in rendering should not be in fonts (for example collation weights, case mappings, custom character name aliases if one wants). Some other properties may be needed for rendering purpose: notably text segmentation data for handling line breaks (many PUA are currently used for custom sinograms in the Han script, that allows linebreak to occur before and after each of them; but this behavior would not be perceived as correct for most scripts. However, I don't think that line breaking properties data are very well fitting in fonts, because such segmentation is not needed only for rendering. However for most of those non-rendering purpose (e.g. plain-text search), we genenrally don't want to have the search result depending on soft line breaks. Soft line breaks are only meant for rendering purpose, and so this breakability may become also under the control of the font. On the opposite, hard line breaks are controlled by existing non-PUA control characters, so they are not a problem and don't need to be overriden. Those hard line breaks are very often expected to be searchable, unlike soft line breaks which should remain invisible in plain-text searches as they are only the result of some rendering process.
searching for PUA characters
The recent discussion on PUA characters reminded me of a question I've had. I am wondering if anyone has a tool whereby we could search for all documents on a local computer (or server) that use PUA codepoints. I suppose what I'd like is to be able to identify beginning and ending codepoints to search for, such as F130..F32F or something along that line. SIL has a corporate PUA, however many (most) of the characters are now in Unicode and I'd like to be able to help people identify which documents need converting to the official USVs. Lorna Priest
Re: searching for PUA characters
On Thu, Aug 25, 2011 at 1:17 PM, Lorna Priest lorna_pri...@sil.org wrote: The recent discussion on PUA characters reminded me of a question I've had. I am wondering if anyone has a tool whereby we could search for all documents on a local computer (or server) that use PUA codepoints. I suppose what I'd like is to be able to identify beginning and ending codepoints to search for, such as F130..F32F or something along that line. I have a utility called unidesc, part of my uniutils package ( http://billposer.org/Software/unidesc.html), that identifies the ranges to which characters belong. You could run this on the various files and check the output for Private Use Area. To obtain a sorted list of the ranges found in a file (rather than the default of the range to which each portion of the file belongs), use the -r option. This is runs on Linux and BSD systems, so probably can be compiled for MacOS too. I don't know about MS Windows.