Re: RTL PUA?
What is needed is a way to specify the properties in a platform-independent way, where platform means not only OS but also font technology. The font format used by all smart font technologies (OT, AAT, Graphite) are all based on the TrueType font file format which allows you to add any number of custom tables. If the people responsible for the OT, AAT Graphite specs agreed on it amongst themselves, it might be possible to specify an embedded table of properties for PUA characters that all the different rendering engines could read and make use of. That might not be completely font-technology independent - but pretty close. - C
Re: RTL PUA?
2011/8/25 Peter Constable peter...@microsoft.com: From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Philippe Verdy But I suspect that the strong opposition given by Peter Constable... Yet again, I think you're putting words in my mouth. The only thing I think I've explicitly spoken against in this thread is changing the default bidi category of PUA characters to ON. Something that will break all existing implementations, but will not solve the problem, it will just reduce the number of Bidi controls needed in texts: BC=ON only means means that the resolved direction of PUA characters will come from the resolved direction of previous (non-PUA) characters. It does not work at the beginning of paragraphs. The actual direction properties should be overridable to be another *strong* RTL direction than the default, instead of changing it to be extremely weak and contextual. In fact when Peter says that the Bidi processing and the OpenType layout engine are in separate layers (so that the OpenType layout works in a lower layer and all BiDi processing is done before any font details are inspected), I think that this is a perfect lie: The Unicode Bidi Algorithm uses _character_ properties and operates on _characters_. OpenType Layout tables deal only with glyphs. You're repeating again what I also know and used in my arguments. I have never stated that the Bidi algorithm operates at the glyph level, I have clearly said the opposite. You are only searching a contradiction which does not even appear. At least the Uniscribe layout already has to inspect the content of any OpenType font, at least to process its cmap and implement the font fallback mechanism, just to see which font will match the characters in the input string to render. If it can do that, it can also inspect later a table in the selected font to see which PUAs are RTL or LTR. And it can do that as a source of information for BiDi ... In theory, that could be done. A huge problem with your suggestion, though, is that the bidi algorithm deals only with characters and makes no references whatsoever to font data, and for that reason -- I would hazard to guess -- most implementations of the Unicode bidi algorithm do not rely in any way on font data and would need significant re-engineering to do so. You repeat again your argument that I have not contradicted. but this has nothing to do with what I want to express. And any way a reengineering will be needed in all the proposed solutions (except if we have to encode the Bidi controls around those PUAs, something that we really want to avoid as often as we avoid them for non-PUA characters). The Bidi algorithm is not changed in any way, it still uses the character properties, except that the source of the property values for PUA should be overridable (not only from the standard UCD, for PUA characters), as already permitted in the Unicode standard which just assigns them *default* property values. If a Bidi algorithm implementation does not allow such overrides, it is already broken and has to be fixed, because it was insufficiently engineered. The fact that it cannot process font data at the step specified in OpenType specifications is a defect of this specification, which is incomplete. But even if you don't want to add such data table in fonts, the external data will have to come from somewere else. Otherwise only the default property values will be used.
Re: RTL PUA?
2011/8/25 Peter Constable peter...@microsoft.com: From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy 2011/8/22 Joó Ádám a...@jooadam.hu: Speaking of actual implementation, I’m convinced that this format should be the same as it is for encoded characters ... As well, the small properties files can be embedded, in a very compact form, in the PUA font. In one sense having data regarding PUA character properties embedded within a font could make sense since the interpretation of instances of those PUA characters will be tied to particular fonts. However, I don't see this as really being workable: rendering implementations will typically do certain types of processes without access to any font data. Remove the future will in your sentence... you're assuming how future implementations will work. And the certain types of process element is extremely fuzzy. Those that want to use PUA as RTL characters will never be satisfied, they want an access to some properties data that are not only those from the UCD. But you're right in one thing: the font is not expected to contain all those properties. I am still convinced that this is the best place for BC property values which are tied to the font, for rendering purpose. Only the properties for PUA characters that have absolutely no use in rendering should not be in fonts (for example collation weights, case mappings, custom character name aliases if one wants). Some other properties may be needed for rendering purpose: notably text segmentation data for handling line breaks (many PUA are currently used for custom sinograms in the Han script, that allows linebreak to occur before and after each of them; but this behavior would not be perceived as correct for most scripts. However, I don't think that line breaking properties data are very well fitting in fonts, because such segmentation is not needed only for rendering. However for most of those non-rendering purpose (e.g. plain-text search), we genenrally don't want to have the search result depending on soft line breaks. Soft line breaks are only meant for rendering purpose, and so this breakability may become also under the control of the font. On the opposite, hard line breaks are controlled by existing non-PUA control characters, so they are not a problem and don't need to be overriden. Those hard line breaks are very often expected to be searchable, unlike soft line breaks which should remain invisible in plain-text searches as they are only the result of some rendering process.
Re: RTL PUA?
2011/8/24 John Hudson j...@tiro.ca: Philippe, I'll need to think about this some more and try to get a better grasp of what you're suggesting. But some immediate thoughts come to mind: If BiDi is to be applied to shaped glyph strings, surely that means needing to step backwards through the processing that arrived at those shaped glyph strings in order to correctly identify their relationship to underlying character codes, since it is the characters, not the glyphs, that have directional properties. There's nothing in an OT font that says e.g. GID 456 /lam_alif.fina/ is an RTL glyph, so the directionality has to be processed at the character level and mapped up through the GSUB features to the glyphs. No backward stepping is needed: process the text using grapheme cluster boundaries as a minimum unit of processing: apply normalization, try to cmap all their characters from the same font (use fallback fonts if needed), then if this fails try to cmap their individual character components to find a font match. This done, each character is now mapped to a definitive font and a putative (incompletely resolved) glyph id in that font. Note that PUAs will be isolated at this point (they form their own grapheme cluster). You can then check if the font provides an override for the BC property, from the default strong LTR value. Then independantly: - you can process the list of glyphs one by one, trying to match all applicable GSUB's only if they occur on the same font as the font associated with the previous character. You can also easily select the typographic variants of that font, for a single glyph. - you can update the current Bidi level of the character, using the BC property value overrides specified in the font containing the PUA, or the normative value for non-PUA, otherwise the default BC property value for PUA. If finally the remaining glyph id's are no longer substitutable, you can then apply GPOS rules (or legacy tables for base-to-base kerning) reliably, because you also know if the BiDi level is even (LTR) or odd (RTL). You can then consider the glyph metrics to accumulate widths in order to detect if an automatic line-break can occur. When a forced or automatic linebreak does occur, you can then adjust the justification of glyph ids. Because you also know at that point what is the directionality of all characters (including the first glyph of the line, and if this line starts a paragraph, from which you have determined what is the main direction of the baseline). You can also automatically adjust the widths of kashidas (or even automatically insert them for microjustification of glyphs, according to the joining properties of the associated characters). Then you can reorder the glyph ids that are in runs opposed to the main direction of the baseline for the paragraph. Some more refinements are needed for handling some text decorations (such as underlines which is not necessarily continuous in all styles and may need to avoid cutting through strokes; but this would require some metrics from the font, associated to glyphs with descenders). All the above can be done in parallel (i.e. character per character, each one being handled glyph id by glyph id, as long as there are matchable GSUB or GPOS). The memory requirement is limited to as many glyphs that can fit in the margin of a single line; Finally the line can be fully drawn with the reordered glyphs (you may need to clip the kashidas to their autojustified width, to avoid them to overlap too far away the surrounding joined characters).
RE: Designing a format for research use of the PUA in a RTL mode (from Re: RTL PUA?)
Thank you to Doug and to Asmus for replying. Originally I was thinking of the format simply being so as to help to level the infrastructural ground as between a PUA (Private Use Area) application using left-to-right characters and a PUA application using right-to-left characters. However, the research needs to proceed in the best direction so as to get the best possible result, so I am happy for my original idea to be augmented and changed if that is what is needed. Do any people who would like to use PUA applications that use right-to-left characters have any views on a format please? Is such a format regarded as useful? What does it need to do? What would be the features of a very minimal RTL constructed script that would exhibit all of the features for which a researcher might want to use the Private Use Area for research with a real-world RTL script please? I am thinking of making a small font with some characters that consist of a leftward pointing arrow with a broad tail with the tail having markings to give a clue to the sound. These markings would be based on the hatching system used for representing colours in monochrome. For example, vertical lines for r because that is red or rouge, horizontal lines for b because that is blue or bleu. I thought of having an o as an o drawn with a left arrow attached to it. I could then produce a glyph for a br ligature and maybe a rb ligature. I am thinking that the ligature glyphs could be wider, have only one leftward pointing arrow yet have two types of markings on the tail of the arrow, side by side. Would that and a space be enough for a constructed script that would exhibit the needed properties for a demonstration or would some more glyphs be needed? My thinking is that the font, complete with its PUA.RTL assignment statement, could be a benchmark test font for testing a special researcher's edition of a wordprocessing application or a desktop publishing application. By using a font for a minimal constructed script, the task of producing and testing the special researcher's edition of a software application could be separated from the complexities of a full real script, perhaps therefore increasing the chances of the special researcher's edition of a software package being produced. I feel that I could make the font as a TrueType font. In order to produce an OpenType font I would need to consolidate what I have started to learn about OpenType fonts, though I would be happy for the TrueType font to be adapted by other people if they so wish. William Overington 24 August 2011
Re: RTL PUA?
John Hudson 於 2011年8月23日 下午9:08 寫道: I think you may be right that quite a lot of existing OTL functionality wouldn't be affected by applying BiDi after glyph shaping: logical order and resolved order are often identical in terms of GSUB input. But it is in the cases where they are not identical that there needs to be a clearly defined and standard way to do things on which font developers can rely. [A parallel is canonical combining class ordering and GPOS mark positioning: there are huge numbers of instances, even for quite complicated combinations of base plus multiple marks, in which it really doesn't matter what order the marks are in for the typeform to display correctly; but there are some instances in which you absolutely need to have a particular mark sequence.] And this is really the key point. There really isn't anything inherent to OpenType that absolutely *requires* the bidi algorithm be run in character space. It would theoretically be possible to manage things in a fashion so that it's run afterwards, à la AAT. But font designers *must* know which way it's being done in practice, and, in practice, all OT engines run the bidi algorithm in character space and not in glyph space. At this point, trying to arrange things so that it can be done in glyph space instead is a practical impossibility. = Hoani H. Tinikini John H. Jenkins jenk...@apple.com
Re: RTL PUA?
2011/8/24 John H. Jenkins jenk...@apple.com: John Hudson 於 2011年8月23日 下午9:08 寫道: I think you may be right that quite a lot of existing OTL functionality wouldn't be affected by applying BiDi after glyph shaping: logical order and resolved order are often identical in terms of GSUB input. But it is in the cases where they are not identical that there needs to be a clearly defined and standard way to do things on which font developers can rely. [A parallel is canonical combining class ordering and GPOS mark positioning: there are huge numbers of instances, even for quite complicated combinations of base plus multiple marks, in which it really doesn't matter what order the marks are in for the typeform to display correctly; but there are some instances in which you absolutely need to have a particular mark sequence.] And this is really the key point. There really isn't anything inherent to OpenType that absolutely *requires* the bidi algorithm be run in character space. It would theoretically be possible to manage things in a fashion so that it's run afterwards, à la AAT. But font designers *must* know which way it's being done in practice, and, in practice, all OT engines run the bidi algorithm in character space and not in glyph space. At this point, trying to arrange things so that it can be done in glyph space instead is a practical impossibility. One problem of interpretation: I have never suggested that the Bidi algorithm would need to run in the glyph space. You can still run it in the character space. Reread my suggestions where I clearly and explicitly spoke about how boundaries between runs of characters that are in a resolved direction and runs of glyphs that are in the same resolved direction just has to be kept. The only borderline case occuring only if one wants to create some ligaturing feature (substitution and/or positioning) between glyphs belonging to distinct successive runs, something that is still for now unsupported, even though it is visually possible (and may even be wanted, notably for kerning or microjustification of lines displaying runs in both directions). This does not even mean that glyph ids will be reordered for RTL runs of glyphs or RTL runs of characters. In OpenType, there is clearly the need in all cases to maintain at least a mapping from positions in the character streams for each directional run to the positions in the glyphs stream. But such mapping is evidently not needed for each character or even for each grapheme cluster, and it does not have to be bijectively reversible (for example, distinct positions of directional runs in the characters streams may map to the same position in the glyphs stream). And this surjective(*) mapping does not even have to be monotonic(*) between each character or grapheme cluster, but only strictly monotonic(*) between non-empty directional runs (otherwise it would be impossible, in the final drawing step, to compute the relative positions of runs in the rendered line, because it would be impossible to sort these non-empty runs along the baseline axis; note also that empty runs that may occur in the glyphs space can be skipped, and in fact must be skipped to assert the condition of strict monotony). Note (*): mathematical meaning of these terms. For example, most Indic scripts exhibit a *non-monotonic* surjective function that maps the positions of successive grapheme clusters in the characters stream, to the positions in the glyphs stream (but given that Indic scripts are only strong LTR, this is not a limitation: all streams of Indic characters or streams of Indic glyphs will never include in their middle any boundary between non-empty runs with opposite resolved directions).
RE: RTL PUA?
From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy 2011/8/22 Joó Ádám a...@jooadam.hu: Speaking of actual implementation, I’m convinced that this format should be the same as it is for encoded characters ... As well, the small properties files can be embedded, in a very compact form, in the PUA font. In one sense having data regarding PUA character properties embedded within a font could make sense since the interpretation of instances of those PUA characters will be tied to particular fonts. However, I don't see this as really being workable: rendering implementations will typically do certain types of processes without access to any font data. Peter
RE: RTL PUA?
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Philippe Verdy Lookup tables in fonts (at least OpenType) do not work at the character level, but at the glyph level: they substitute glyph ids by other glyph ids. That much is true. Sequences of glyph ids are already reordered in visual order by the layout engine when they are searched in OpenType lookups, should they be RTL glyphs, or Indic glyphs with special reordering requirements (independant of the logical ordering of characters/code points). OpenType lookup tables are agnostic wrt LTR or RTL; sequences of glyphs IDs in a lookup are from start to finish. For Indic scripts, some re-orderings are assumed to have been applied before lookups are processed. As for bidi, it is _not_ the case that a glyph sequence in a lookup table is ordered in LTR visual order, as Philippe's statement suggests. Rather, they are ordered from start to finish. One might choose to perceive that in LTR/RTL terms; you certainly don't have to, though which way you perceive it will have to correlate with whether you think of an implementation as actually having done some level reordering before OpenType Layout tables are processed--which certainly is not mandatory for implementations. The only lookup table in fonts that work at the character/code point level is their cmap Note that the 'cmap' is not typically referred to as a lookup table since there is a distinct set of data structures in OpenType that are formally called Lookup tables. Not all fonts need a cmap; for some of them, a default cmap may be implied or automatically constructed -- for example Symbol fonts in Windows, that are implicitly mapped in a PUA range; Not true. All OpenType fonts require a cmap table. This is true even of symbol encoded fonts. Strictly speaking, symbol-encoded fonts are not encoded using Unicode, and so are not mapped in a PUA range. It is true, though, that they use 16-bit code points and that in many symbol-encoded fonts the code point range used does have numerical values that correlate to those of Unicode PUA characters in the BMP. But years ago Bob Hallissy and I confirmed that symbol-encoded fonts could work with code points in other numerical ranges. Peter
RE: RTL PUA?
From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy 2011/8/22 Peter Constable peter...@microsoft.com: Of course _OpenType_ cannot, but any rendering engine that uses OpenType _must_ resolve the bidi level of _all_ characters in a sequence that it is given to render. Given our current situation, a default rendering implementation would resolve PUA characters to an even (LTR) level unless, of course, bidi control characters -- particularly RLO -- are used to override the directionality of the character, as you mention. [snip] So now I perceive your opinion : - you don't want the solution proposed by Michael Everson (simply adding a range of RTL PUA), that I also think is not necessary, but is clearly a possible solution. You're putting words in my mouth: I don't think I've expressed any opinion in this thread for or against Michael's proposal. All I've commented on is that the OpenType spec has a model for glyph mirroring that can work for mirroring of PUA characters as well, that bidi category is one of several properties that can affect text processing, and that people shouldn’t expect PUA characters to behave as they desire in all software titles. - you propose to use BiDi overrrides. Again, you're putting words in my mouth. All I said is that default behaviour can be expected to observe bidi categories unless bidi controls are used to override the assigned categories for characters in a run. Peter
Re: RTL PUA?
On 22 August 2011 22:40, John Hudson j...@tiro.ca wrote: Glyph ID inputs for OTL processing are according to reading/resolved order. This is typically the same as logical order, but the term logical order really applies to character strings, not glyph strings, which are much more maleable. The order of input strings in GSUB lookups or contexts is dependent not only on the underlying character order, but also on the results of previous GSUB lookups. So while, unlike AAT and Graphite, OpenType Layout doesn't explicitly provide for glyph re-ordering, some kinds of glyph reordering are possible using sequences of contextual lookups to duplicate a glyph in a second location in the string and then remove the first instance. We use this in some Devanagari fonts to enable subsequent ligation of short ikar variants to the left of a consonant base with reph marks to the right of that base. र्क्मि JH Open Font Format GSUB tables contain glyphs in visual (or reading order) and not in logical order. The logical order of characters is transformed to a visual order of characters before the font layer. Thus for Devanagari, the I matra is moved left and the reph is moved right. It is this visual syllable of characters that is transformed to a visual syllable of glyphIds using the cmap table in the Open font. Consider Devanagari RKMI. The logical order of characters is Ra Halant Ka Halant Ma IMatra. This logical order is transformed to the visual order of characters IMatra Ka Halant Ma Ra Halant, then to the visual order of GIds IMatraGId KaGId HalantGId MaGId RaGId HalantGId. GSub tables will transform KaGId HalantGId - HalfKaGId. RaGId HalantGId will be transformed to rephGId. So finally, the visual order of GIds will be IMatraGId HalfKaGId MaGId rephGId. This can be further beautified but is already in a readable form. It will appear as र्क्मि So no reordering is needed in the Open font for general case. For special cases, (example RDMI , logical Ra Halant Da Halant Ma IMatra), reordering in the font is needed. Here, DaGId HalantGId MaGId does not reduce to HalfDAGId MaGid in most Hindi fonts, but remains as DaGId HalantGId MaGId . The visual order of Gids for the syllable is IMatraGId DaGId HalantGId MaGId RaGId HalantGId. It is converted to IMatraGId DaGId HalantGId MaGId rephGId by the font. Now the font has to reorder the IMatrGId and the rephGId over the retained HalantGId. The correct glyph sequence would be DaGId rephGId HalantGId IMatraGId MaGId र्द,मि (Using Gmail I have created RDMI. The halant is shown by the comma. How does one create Halant?) Comparing RKMI and RDMI we can see that font level reordering of glyphIds is needed only for RKMI. vinod kumar -- पृथिवी सस्यशालिनी the earth be green
Re: RTL PUA?
On Mon, 22 Aug 2011 20:58:23 +0200 Philippe Verdy verd...@wanadoo.fr wrote: The computing order of features should not then be: - BiDi algorithm for reordering grapheme clusters (I trust you mean the ordering of clusters relative to one another, not the ordering within clusters.) - font search and font fallback (using cmap) - GSUB (lookups of ligatures or discretionary glyph variants) - GPOS but really: - font lookup and font fallback (using cmap) - GSUB (lookups of ligatures or discretionary glyph variants) - BiDi algorithm for reordering glyphs representing the grapheme clusters or ligatured grapheme clusters - GPOS You've forgotten the conversion from encoding order to mechanical typing order. That is done before GSUB, but needs some assistance from GSUB for multipart characters (typically circumposed vowels). The BiDi algorithm absolutely does not have to be changed. But you have to remember that preposed combining marks (and fragments) must inherit the BiDi class of the base letter. I'm glad you know what a circumposed Indic vowel looks like when subject to a right-to-left override. Richard.
Re: RTL PUA?
2011/8/23 Richard Wordingham richard.wording...@ntlworld.com: The BiDi algorithm absolutely does not have to be changed. But you have to remember that preposed combining marks (and fragments) must inherit the BiDi class of the base letter. I'm glad you know what a circumposed Indic vowel looks like when subject to a right-to-left override. Yes I know the case of preposed Indic vowels, e.g. vowel I in Devanagari, as well of vowels that are splitted in two parts (one before the consonnant cluster and one after it). However, this applies here to an LTR script, for which we don't have BiDi issues. PUA Indic characters can safely be represented using existing PUA characters without needing any directionality property override from their default strong LTR value. The case of RTL PUA will be in fact much more rare now than other PUAs (except if someone creates a RTL conscript). Typically, it will be used for rare or special characters that are not encoded (or won't be encoded, such as specific glyph variants of letters in an existing RTL script, including Arabic, for which a text author wants a PUA to maintain a distinction that he cannot manage by other means just in the encoded text, or because the character has not demonstrated for now a sufficient proof of usage, due to extremely rare usage, found for example in a single old book or manuscript, or to characters that were invented specifically by some author, for esoteric reasons). It could also apply to the need for encoding things that we don't consider as characters (for example if someone wants to encode some custom decorating swash to Arabic text, that have no logical or phonetical reading and no other semantic by itself). Note that in this case, the PUA will act as a custom variation sequence (variation sequences must be assigned if we want to use something else than a PUA) or as a custom diacritic...
Designing a format for research use of the PUA in a RTL mode (from Re: RTL PUA?)
On Monday 22 August 2011, William_J_G Overington wjgo_10...@btinternet.com wrote: Would a third option work? In the Description section of the Macintosh Roman section of a TrueType font, include a line of text in a plain text format of which the following line of text is an example. PUA.RTL=$E000-$E1FF,$E440-$E447,$E541,$E549,$E57C,$EA00-$EA0F,$EC07; One could specify precisely which Private Use Area characters were to become RTL when using that particular font. One would need rendering software that looked for such a string of text in the font file, yet, as far as I am aware, no approval from any committee in order to put this solution into practical use. Thinking further on this, I am putting forward the following suggestion for discussion, in the hope that it might be of use. Suppose that a a special researcher's edition of a wordprocessing application or a desktop publishing application at start up looks in a specified directory for a file with the following file name. pua_major.txt If pua_major.txt exists, then it is opened and it is searched for a PUA.RTL assignment statement. If a PUA.RTL assignment statement is not found in the file, it is taken as if the following had been included in the file. PUA.RTL=; If pua_major.txt is found, then that is an end of the searching process and no search for PUA.RTL would take place in a font file. If pua_major.txt is not found, then the application looks in a specified directory for a file with the following file name. pua_minor.txt If pua_minor.txt exists, then it is opened and it is searched for a PUA.RTL assignment statement. If a PUA.RTL assignment statement is not found in the file, it is taken as if the following had been included in the file. PUA.RTL=; Also, if the file is not found, the PUA.RTL assignment statement is taken as the following. PUA.RTL=; However, the value of PUA.RTL thus determined would be kept in reserve and only used if there were no PUA.RTL assignment statement in the font that is being used. This method would allow the choice of where to specify right-to-left directionality for some Private Use characters to be made either as being in a font file or in a text file, with the choice of whether the text file is an override or a backup of any such information within a font. Would such a format solve the needs of those who want to use right-to-left Private Use characters? If not, could people say what other features are needed please in the hope that a suitable system can be specified by consensus within this thread? William Overington 23 August 2011
RE: Designing a format for research use of the PUA in a RTL mode (from Re: RTL PUA?)
William_J_G Overington wjgo underscore 10009 at btinternet dot com wrote: Suppose that a a special researcher's edition of a wordprocessing application or a desktop publishing application at start up looks in a specified directory for a file with the following file name. pua_major.txt If pua_major.txt exists, then it is opened and it is searched for a PUA.RTL assignment statement. If a PUA.RTL assignment statement is not found in the file, it is taken as if the following had been included in the file. PUA.RTL=; ... Of all applications, a word processor or DTP application would want to know more about the properties of characters than just whether they are RTL. Line breaking, word breaking, and case mapping come to mind. I would think the format used by standard UCD files, or the XML equivalent, would be preferable to making one up: E100;ENGSVANYALI LETTER P;Lo;0;R;N; E101;ENGSVANYALI LETTER B;Lo;0;R;N; E102;ENGSVANYALI LETTER M;Lo;0;R;N; ... -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: RTL PUA?
Philippe Verdy verd...@wanadoo.fr wrote: The computing order of features should not then be: - BiDi algorithm for reordering grapheme clusters - font search and font fallback (using cmap) - GSUB (lookups of ligatures or discretionary glyph variants) - GPOS but really: - font lookup and font fallback (using cmap) - GSUB (lookups of ligatures or discretionary glyph variants) - BiDi algorithm for reordering glyphs representing the grapheme clusters or ligatured grapheme clusters - GPOS I can see the advantages of such an approach -- performing GSUB prior to BiDi would enable cross-directional contextual substitutions, which are currently impossible -- but the existing model in which BiDi is applied to characters *not glyphs* isn't likely to change. Switching from processing GSUB lookups in logical order rather than reading order would break too many things. JH -- Tiro Typeworkswww.tiro.com Gulf Islands, BC t...@tiro.com The criminologist's definition of 'public order crimes' comes perilously close to the historian's description of 'working-class leisure-time activity.' - Sidney Harring, _Policing a Class Society_
Re: Designing a format for research use of the PUA in a RTL mode (from Re: RTL PUA?)
On 8/23/2011 7:22 AM, Doug Ewell wrote: Of all applications, a word processor or DTP application would want to know more about the properties of characters than just whether they are RTL. Line breaking, word breaking, and case mapping come to mind. I would think the format used by standard UCD files, or the XML equivalent, would be preferable to making one up: The right answer would follow the XML format of the UCD. That's the only format that allows all necessary information contained in one file, and it would leverage of any effort that users of the main UCD have made in parsing the XML format. An XML format shold also be flexible in that you can add/remove not just characters, but properties as needed. The worst thing do do, other than designing something from scratch, would be to replicate the UnicodeData.txt layout with its random, but fixed collection of properties and insanely many semi-colons. None of the existing UCD txt files carries all the needed data in a single file. A./
RE: Designing a format for research use of the PUA in a RTL mode (from Re: RTL PUA?)
Asmus Freytag asmusf at ix dot netcom dot com wrote: The right answer would follow the XML format of the UCD. Question: Since the ucdxml formats became available, has any consensus emerged as to whether the flat or grouped formats are preferred? Obviously they both contain the same data, but one is much smaller and the other might be more convenient in some ways. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: RTL PUA?
Behdad Esfahbod wrote: I can see the advantages of such an approach -- performing GSUB prior to BiDi would enable cross-directional contextual substitutions, which are currently impossible -- but the existing model in which BiDi is applied to characters *not glyphs* isn't likely to change. Switching from processing GSUB lookups in logical order rather than reading order would break too many things. You can't get cross-directional-run GSUB either way because by definition GSUB in an RTL run runs RTL, and GSUB in an LTR run runs LTR. If you do it before Bidi, you get, eg, kerning between two glyphs which end up being reordered far apart from eachother. You really want GSUB to be applied on the visual glyph string, but which direction it runs is a different issue. Kerning is GPOS, not GSUB. But generally I agree. My point was that Philippe's suggestion, although it could be the basis of an alternative form of layout that might have some benefits if fully worked out, is a radical departure from how OpenType works. J. -- Tiro Typeworkswww.tiro.com Gulf Islands, BC t...@tiro.com The criminologist's definition of 'public order crimes' comes perilously close to the historian's description of 'working-class leisure-time activity.' - Sidney Harring, _Policing a Class Society_
Re: RTL PUA?
John Hudson 於 2011年8月23日 下午2:33 寫道: Behdad Esfahbod wrote: I can see the advantages of such an approach -- performing GSUB prior to BiDi would enable cross-directional contextual substitutions, which are currently impossible -- but the existing model in which BiDi is applied to characters *not glyphs* isn't likely to change. Switching from processing GSUB lookups in logical order rather than reading order would break too many things. You can't get cross-directional-run GSUB either way because by definition GSUB in an RTL run runs RTL, and GSUB in an LTR run runs LTR. If you do it before Bidi, you get, eg, kerning between two glyphs which end up being reordered far apart from eachother. You really want GSUB to be applied on the visual glyph string, but which direction it runs is a different issue. Kerning is GPOS, not GSUB. But generally I agree. My point was that Philippe's suggestion, although it could be the basis of an alternative form of layout that might have some benefits if fully worked out, is a radical departure from how OpenType works. I'll toss in my obligatory, That's how AAT does it reference. It has advantages and disadvantages—but, as you say, OT would have to be heavily redesigned to do it. = John H. Jenkins 井作恆 Жбь А. ЖЩэпЮьц jenk...@apple.com
Re: RTL PUA?
2011/8/23 John Hudson j...@tiro.ca: Behdad Esfahbod wrote: I can see the advantages of such an approach -- performing GSUB prior to BiDi would enable cross-directional contextual substitutions, which are currently impossible -- but the existing model in which BiDi is applied to characters *not glyphs* isn't likely to change. Switching from processing GSUB lookups in logical order rather than reading order would break too many things. You can't get cross-directional-run GSUB either way because by definition GSUB in an RTL run runs RTL, and GSUB in an LTR run runs LTR. If you do it before Bidi, you get, eg, kerning between two glyphs which end up being reordered far apart from eachother. You really want GSUB to be applied on the visual glyph string, but which direction it runs is a different issue. Kerning is GPOS, not GSUB. But generally I agree. My point was that Philippe's suggestion, although it could be the basis of an alternative form of layout that might have some benefits if fully worked out, is a radical departure from how OpenType works. Rereading closely the OpenType spec, in fact I don't see any major problem if even the Bidi algorithm is applied last, even after applying not only the GSUB's (ligaturing, custom Indic reordering of multipart vowels or ra forms), but also the GPOS (yes, this is for kerning, i.e. base-to-base, but also for mark-to-base and mark-to-mark positioning). I admit that this wouldviolate some existing rules implied in some implementations, but at least it would offer some more intererests. However, if one really wants to implment kerning between LTR runs and RTL runs (e.g. between an Arabic letter and a Latin letter), one would need to make sure that Bidi reordering has been performed before GPOS (and this is really the case...). Processing such kerning pairs would require another convention than the resolved direction. It would require that such kerning pairs are scanned only so that the first item of the pair will always be the left-most. GPOS is in fact more powerful than that because it can also involve more than simple pairs, using contexts longer on both the right and the left of tested glyphs. But the existence of such complex positioning rules would create difficulties for the actual readers of the rendered text, because he will not know from which side he must start to read a word that displays for example a run of Latin letters on one side, and a run of Arabic letters on the other side. Let's say that he starts by reading the Arabic part, in normal order, how to read the LAtin part of this strange «word». It's is still not a stupid case: such positioning problems occur at the boundaries of words, where there are whitespaces. Once you have resolved the direction of those whitespaces, there's then a boundary with the next word which may use another direction. What happens on those whitespaces is that you may find typographic elements (such as swashes) which should not overflow on the next part. Currently it is assuled that writers will use a larger whitespace character if needed, to avoid collisions. But if the whitespace is very narrow, or is zero-width, the problem resurrects immediately of kerning, in its traditional typographic definition, which is to improve the legibility of the rendered text, to exhibit a visually constant spacing between words and between letters, so that inter-letter separation will not be confused with interword separation. I admit that this (extremely rare) problem is much less critical with the Arabic script (because it is always cursive and most letters in the same word are joined), but this means that the probem may be more significant between Latin and Hebrew, or more probably between Greek and Hebrew (in very old historic texts, where even the Greek script did not have a strong LTR directionality, and where whitespace was not always used between words).
Re: RTL PUA?
Philippe Verdy wrote: Rereading closely the OpenType spec... I suggest you read also the script-specific OT layout specifications. http://www.microsoft.com/typography/SpecificationsOverview.mspx You'll note, for example, that the Arabic font spec doesn't even mention BiDi, because it is assumed that this has been resolved before glyph runs for OTL processing are even identified. This makes sense to me because BiDi is a character-centric operation. The Microsoft font specs describe what Uniscribe (and DWrite) do with text and fonts for particular scripts, and there may be some differences in other implementations. For example, Uniscribe performs s invalid mark sequence checks that others, preferring to see this as a task for spellcheckers, do not. But the glyph selection and positioning results should be the same across implementations. Font makers need to know how text is processed and OTL features applied in order to make fonts that work with resulting glyph runs and input strings. Changing the point in the glyph string resolution when BiDi is applied breaks everything. It's a complete non-starter. JH -- Tiro Typeworkswww.tiro.com Gulf Islands, BC t...@tiro.com The criminologist's definition of 'public order crimes' comes perilously close to the historian's description of 'working-class leisure-time activity.' - Sidney Harring, _Policing a Class Society_
Re: RTL PUA?
2011/8/24 John Hudson j...@tiro.ca: Philippe Verdy wrote: Rereading closely the OpenType spec... I suggest you read also the script-specific OT layout specifications. http://www.microsoft.com/typography/SpecificationsOverview.mspx You'll note, for example, that the Arabic font spec doesn't even mention BiDi, because it is assumed that this has been resolved before glyph runs for OTL processing are even identified. This makes sense to me because BiDi is a character-centric operation. The Microsoft font specs describe what Uniscribe (and DWrite) do with text and fonts for particular scripts, and there may be some differences in other implementations. For example, Uniscribe performs s invalid mark sequence checks that others, preferring to see this as a task for spellcheckers, do not. But the glyph selection and positioning results should be the same across implementations. Font makers need to know how text is processed and OTL features applied in order to make fonts that work with resulting glyph runs and input strings. Changing the point in the glyph string resolution when BiDi is applied breaks everything. It's a complete non-starter. I had already read this subspecs. And I think you're wrong, because the list of glyphs is in resolved order, even after all ligature substitution, glyph breaking (for Indic scripts) has a completely independant order from the logical reading of characters. You can perfectly run the BiDi algorithm after the glyph substitutions. All what the Bidi algorithm is to delimit runs of characters that are to be rendered in one direction or the other. The same limits will also be boundaries across the associated runs of glyph ids. There's in fact absolutely no need of the Bidi algorithm to process all glyph substitutions, because they will be performed exactly the same way. The two algorithms are in fact completely independant of each other, at least if you don't need to apply substitutions that span distinct runs. However there's a dependancy between the BiDi algorithm and the glyph positioning, because each RTL or LTR run needs to have its own left-side bearing, and its own right side bearing, in order to mutually space these runs correctly. IT also influences the direction by which you'll advance the coordinates along the baseline for positioning the fully resolved glyph ids. This requires then to know the principal direction of each run of glyph ids. In fact you have absolutely not demonstrated anything that this concept would even break anything, except ligatures between RTL and LTR characters, i.e. between resolved RTL and LTR glyphs, something that can only occur over the a boundary between a resolved RTL run of glyph ids, and a resolved LTR run run of glyphs ids. But I was said that OpenType layout does not support such thing, or that this possible behavior is for now undocumented in OpenType specs, but this is not the case of AAT layout and Graphite layout, but I admit that this would cause problems on how to position such ligature glyphs that would have an ambiguous direction, because it would then belong to two successive directional runs at the character level). As the above paragraph may not be very clear to understand, let's suppose that you wanted to create a GSUB ligature between ARABIC LAM (resolved to RTL at the character level) and LATIN CAPITAL LETTER A (resolved to LTR at the character level, in the Bidi algorithm). You would cmap this ligature to a LAM_A glyph id. Technically, nothing in OpenType GSUB's forbids you do to that in your font. But the OpenType engine that needs to maintain an equivalence of boundaries between runs of characters (from Bidi) and runs of glyph ids (from the cmap, then after GSUB substitutions) will not know if the LAM_A glyph belongs to the first run (terminated by the RTL character LAM) or the second run (starting by the LTR character A) without providing *with each* GSUB rule an indication of where to place the new direction boundary if there was a direction boundary in the middle of the source list of glyphs, before its substitution. Yes this is a very borderline case, because I have never seen it or needed it in practice. Unicode prefers reencoding a new similar character with the opposite strong direction (for example the HEBREW ALEF SYMBOL for maths, which is very similar to the Hebrew letter but has a opposite direction ; but here I wonder how it would create a ligature with another strong LTR character that is also not a diacritic, even if there's an evidence that such pair can be GPOS'itionned, i.e. kerned). What is only assumed is that GSUB will preserve the boundaries between runs of characters that are in the same direction; but of course it does not always preserve the boundaries between the logical character clusters. This may explain your concern that this could potentially break something, but only if you don't care about preserving unambiguously the boundaries between directional runs, and
Re: RTL PUA?
Philippe, I'll need to think about this some more and try to get a better grasp of what you're suggesting. But some immediate thoughts come to mind: If BiDi is to be applied to shaped glyph strings, surely that means needing to step backwards through the processing that arrived at those shaped glyph strings in order to correctly identify their relationship to underlying character codes, since it is the characters, not the glyphs, that have directional properties. There's nothing in an OT font that says e.g. GID 456 /lam_alif.fina/ is an RTL glyph, so the directionality has to be processed at the character level and mapped up through the GSUB features to the glyphs. I think you may be right that quite a lot of existing OTL functionality wouldn't be affected by applying BiDi after glyph shaping: logical order and resolved order are often identical in terms of GSUB input. But it is in the cases where they are not identical that there needs to be a clearly defined and standard way to do things on which font developers can rely. [A parallel is canonical combining class ordering and GPOS mark positioning: there are huge numbers of instances, even for quite complicated combinations of base plus multiple marks, in which it really doesn't matter what order the marks are in for the typeform to display correctly; but there are some instances in which you absolutely need to have a particular mark sequence.] I've lost track of what the putative benefit of processing BiDi post glyph shaping is. I think I missed part of your earlier exchange with Behdad. JH -- Tiro Typeworkswww.tiro.com Gulf Islands, BC t...@tiro.com The criminologist's definition of 'public order crimes' comes perilously close to the historian's description of 'working-class leisure-time activity.' - Sidney Harring, _Policing a Class Society_
Re: RTL PUA?
On 8/21/2011 7:34 PM, Doug Ewell wrote: So what you are asking about is a directional control character that would assign subsequent characters a BC of 'AL', right? You don't want to call this a LANGUAGE MARK or anything else that implies language identification, because of the existence of real language identification mechanisms and the history of Unicode and language tagging. An ARM (Arabic RTL Mark) would be a sensible addition to the standard. It would close a small gap in design that currently prevents a fully faithful plain text export of bidi text from rich text (higher level protocol) formats. In a HLP you can assign any run to behave as if it was following a character with bidi property AL. When you export this text as plain text, unless there is an actual AL character, you cannot get the same behavior (other than by the heavy-handed method of completely overriding the directionality, making your plain text less editable). So, yes, there's a bit of a use case for such a mark. (It's effect is limited to treatment of numeric expressions, so it's not an Arabic language mark, but one that triggers the same bidi context as the presence of an Arabic Script (AL) character.) A./ -- Doug Ewell • d...@ewellic.org Sent via BlackBerry by ATT -Original Message- From: Richard Wordinghamrichard.wording...@ntlworld.com Sender: unicode-bou...@unicode.org Date: Mon, 22 Aug 2011 03:19:39 To: Unicode Mailing Listunicode@unicode.org Subject: Re: RTL PUA? On Sun, 21 Aug 2011 23:55:46 + Doug Ewelld...@ewellic.org wrote: What's a LANGUAGE MARK? There are *three* strong directionalities - 'L' left-to-right, 'AL' right-to-left as in Arabic, 'R' right-to-left (as in Hebrew, I suspect). 'AL' and 'R' have different effects on certain characters next to digits - it's the mind-numbing part of the BiDi algorithm. With one a $ sign after a string of European (or is it Arabic?) digits appears on the left and in the other it appears on the right. I can't remember whether 'higher-level protocols' have an effect on this logic. LRM has a BC of L, RLM has a BC of R, but no invisible character has a BC of AL. That's why I tentatively raised the notion of ARABIC LANGUAGE MARK. Incidentally, an RLO gives characters with a temporary BC of R, not AL. Richard.
RE: RTL PUA?
I don't buy the assumption that all the world is either AAT, Graphite or Uniscribe. Anyhow, this discussion is going off topic, the issue is should Unicode specify an RTL PUA area, not whether some products, however respectable, provide a bypass. Jony -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Shriramana Sharma Sent: Monday, August 22, 2011 8:12 AM To: unicode@unicode.org Subject: Re: RTL PUA? On 08/22/2011 08:24 AM, Peter Constable wrote: I'm not saying that there shouldn't be_some_ software that can do what you expect. But there will likely be some different views on what ought to be included within that some. Peter, given that both AAT and Graphite have provisions for assigning custom properties including BC to PUA characters, it seems Uniscribe is the only one missing out. Those advocating RTL PUA areas seem to reject AAT and Graphite as hacks or wow *one* application [*]. [* = LibreOffice is the *only* multipurpose application running on /Windows/ to support Graphite and I'm not counting SIL WorldPad. On *nix platforms, *any* number of applications that use HB-NG for rendering will be able to handle Graphite in the near future because HB-Graphite integration is already done. That is to say, once GTK and Qt fully switch to HB-NG.] Anyhow, if you Microsoft guys added support in Uniscribe for ascribing custom properties including BC to PUA characters (or have you already done it) it would be what would satisfy these PUA RTL users and convince them that no RTL PUA zones are needed, it seems. The suggestion has been made that fonts should be able to carry some additional custom tables specifying custom properties for PUA characters, which seems reasonable. I'm not sure if the OT GDEF table or the AAT PROP table completely satisfies this requirement. People interesting in using custom properties for the PUA (which includes me for Indic script) should then sit up and formulate the syntax for such tables. If Uniscribe, AAT, and Harfbuzz then provided generic support for parsing such tables and rendering PUA characters accordingly, it would be an all-around solution both for RTL PUA as well as Indic PUA, I suppose. (But I'm not sure how such a custom table would interact with the innate ability of Graphite to handle custom properties. It should probably be either the new proposed custom table or Graphite.) [sigh] -- Shriramana Sharma
Re: RTL PUA?
On 22 Aug 2011, at 03:57, Peter Constable wrote: From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Asmus Freytag Treating PUA characters as ON is very problematic As would be changing the default property of PUA characters from L to ON. Which is why that will not be proposed. Michael Everson * http://www.evertype.com/
Re: RTL PUA?
On 22 Aug 2011, at 05:53, Shriramana Sharma wrote: While I don't know much about RTL scripts, if the logic order is ALEF + LAMED, but the presentation order is LAMED + ALEF *because of the RTL nature* do you write the rule as ALEF + LAMED = ALEF_LAMED_LIGATURE or LAMED + ALEF = ALEF_LAMED_LIGATURE ? The specific shape of that ligature is not a result of the directionality property. Michael Everson * http://www.evertype.com/
Re: RTL PUA?
On Mon, Aug 22, 2011 at 10:42:05AM +0530, Shriramana Sharma wrote: On 08/22/2011 08:24 AM, Peter Constable wrote: I'm not saying that there shouldn't be_some_ software that can do what you expect. But there will likely be some different views on what ought to be included within that some. Peter, given that both AAT and Graphite have provisions for assigning custom properties including BC to PUA characters, it seems Uniscribe is the only one missing out. Those advocating RTL PUA areas seem to reject AAT and Graphite as hacks or wow *one* application [*]. I personally would say to make some blocks in Plane 16 default to R, some AL and some ON. For fonts based on rendering engines that don't allow fonts to change characters properties this would be crutial, for those engines that are capable of changing the properties it would present no problem (the font can change this properties arbitrary even if it defaults to RTL...). [* = LibreOffice is the *only* multipurpose application running on /Windows/ to support Graphite and I'm not counting SIL WorldPad. On *nix platforms, *any* number of applications that use HB-NG for rendering will be able to handle Graphite in the near future because HB-Graphite integration is already done. That is to say, once GTK and Qt fully switch to HB-NG.] That said, the HarfBuzz-ng itself (i.e. it's own engine) tries to imitate the Uniscribe. Most probably, Graphite fonts will still be an exception on these systems... [sigh] -- Shriramana Sharma -- Petr Tomasek http://www.etf.cuni.cz/~tomasek Jabber: but...@jabbim.cz EA 355:001 DU DU DU DU EA 355:002 TU TU TU TU EA 355:003 NU NU NU NU NU NU NU EA 355:004 NA NA NA NA NA
Re: RTL PUA?
On 08/22/2011 04:34 PM, Behdad Esfahbod wrote: On 08/22/11 06:53, Shriramana Sharma wrote: While I don't know much about RTL scripts, if the logic order is ALEF + LAMED, but the presentation order is LAMED + ALEF*because of the RTL nature* do you write the rule as ALEF + LAMED = ALEF_LAMED_LIGATURE or LAMED + ALEF = ALEF_LAMED_LIGATURE ? Depends on your specific shaping engine logic. OpenType assumes native direction per script. So if you have Arabic text between LRO/PDF, you have to reverse the order then apply OpenType shaping. Other engines may decide to handle these differently. But the general statement is true: ligatures are visual artifacts and hence only form in one direction, not the other (except if it's, say, the ff ligature). Hi Behdad. I only asked whether the OT *tables* would contain the entries in the logical order or the visual order. Clearly it would still be the visual order (but Philippe Verdy seemed to imagine/suggest otherwise). It is clear that in the *script itself* the ligature would form in the direction of writing. -- Shriramana Sharma
Re: RTL PUA?
On 08/22/2011 05:26 PM, Behdad Esfahbod wrote: OpenType tables contain entries in the logical order of the script in question. Ie. Arabic tables are always RTL. Yes I understand, but still, to clarify: The font tables themselves contain only ASCII characters I presume. In it do you write: ALEF + LAMED = ALEF_LAMED_LIGATURE or LAMED + ALEF = ALEF_LAMED_LIGATURE ? IIUC, in logical order ALEF precedes LAMED, and in visual order, ALEF stands to the right of LAMED. -- Shriramana Sharma
Re: RTL PUA?
On 08/22/2011 12:21 PM, Jonathan Rosenne wrote: I don't buy the assumption that all the world is either AAT, Graphite or Uniscribe. Nobody asserted that either. It is only pointed out that major implementations are able to provide what you seek. Anyhow, this discussion is going off topic, the issue is should Unicode specify an RTL PUA area, not whether some products, however respectable, provide a bypass. I don't see why you call it a *bypass*. Only if the road in front of you presents obstacles and does not allow you to proceed further, you need to take a bypass. If we are considering the Standard as the road which we need to take, the road doesn't present any obstacle to using PUA characters as RTL, so Graphite etc are not providing a *bypass* but in fact just being good generous implementations that allow custom properties for the PUA as the Standard allows. The request being made to allocate BC=R areas in the PUA is sure to generate an impression that conformant implementations should consider such a property normative, which then would violate the definition of the PUA that conformant implementations need not treat any property of the PUA as normative. Returning to your concerns, it is being asserted that since implementations are *already* able to provide for custom properties for the PUA, there is *no* need for Unicode to specify an RTL PUA area and furthermore as such a specification would violate the definition of the PUA, it should also *not* be done. One both *need* not do it and *should* not do it. -- Shriramana Sharma
Re: RTL PUA?
Um... Computers are hardware, and don't understand a thing. What I think you mean is computer _software_. (I know, I'm being pedantic, but with good reason.) Sorry, I just can’t resist pointing out that difference between hardware and software is only the fact that the former is material, with all the consequences that follows. In any other way they are completely interchangeable. As for the other part of your mail, Peter, sorry, but it really doesn’t make any sense to me. As John has pointed out, you can adjust the properties of private use characters on Apple computers. Perhaps there is a way to do so on Windows, Unix and other systems as well. What Philippe and Doug are proposing, and I also strongly agree with, is to have a standard way of interchange of these properties. I don’t think it is neccessary to go into the advantages of standards. Speaking of actual implementation, I’m convinced that this format should be the same as it is for encoded characters (whether it is the plain text format of the Unicode Character Database, XML or anything else). Rendering engines should – maybe they already do so – accept multiple files containing character properties, which could make upgrades to the newer versions of the standard a matter of downloading the new standard set, and provide a way of overriding private use (or even standard if one is so inclined) characters’ properties. Introduction of unencoded scripts would therefore become a matter of distributing a small properties file and the corresponding fonts. Á
Re: RTL PUA?
On 08/22/2011 08:26 AM, Shriramana Sharma wrote: On 08/22/2011 05:26 PM, Behdad Esfahbod wrote: OpenType tables contain entries in the logical order of the script in question. Ie. Arabic tables are always RTL. Yes I understand, but still, to clarify: The font tables themselves contain only ASCII characters I presume. In it do you write: ALEF + LAMED = ALEF_LAMED_LIGATURE or LAMED + ALEF = ALEF_LAMED_LIGATURE ? IIUC, in logical order ALEF precedes LAMED, and in visual order, ALEF stands to the right of LAMED. In the ligature tables, it's recorded as ALEF + LAMED = ALEF_LAMED_LIGATURE. The font tables are concerned with what happens when this character follows that one, not what happens when this character stands on the right of that one. So it's stored in logical order. ~mark
Re: RTL PUA?
2011/8/22 Peter Constable peter...@microsoft.com: From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy As I explained in an earlier message, the layout engine doesn't use the default property value but the resolved bidi level. Once again, you refuse to understand my arguments. I don't think I'm refusing to understand anything. I'm merely taking your assertions _as stated_ and evaluating whether I think they are accurate or not. Perhaps what you intend to convey assumes things not clear in what you've stated, since you think I'm not understanding you. What I'm saying is that OpenType CANNOT resolve the bidi level of PUAs (with the exception where we use additional BiDi controls, Of course _OpenType_ cannot, but any rendering engine that uses OpenType _must_ resolve the bidi level of _all_ characters in a sequence that it is given to render. Given our current situation, a default rendering implementation would resolve PUA characters to an even (LTR) level unless, of course, bidi control characters -- particularly RLO -- are used to override the directionality of the character, as you mention. which remains a hack, because it adds unnecessary unvisible markup around the encoded texts, and complexifies the use of strings and substrings). We'll, depending on how you define hack, some might reasonably suggest that any usage of PUA is a hack. (Of course, some who may not use the term in the same way might argue that it is certainly not a hack.) You can turn the problem as you want, but PUAs (as well as unknown characters) still have default properties that, in fine, will get used in absence of a more precise definition (i.e. an explicit override) of the actual BiDi property needed for the character. So now I perceive your opinion : - you don't want the solution proposed by Michael Everson (simply adding a range of RTL PUA), that I also think is not necessary, but is clearly a possible solution. - you propose to use BiDi overrrides. I also think (like Michael Everson) that this is an unpractical hack (Michael Everson that has to work and discuss with old scripts, or many new unencoded characters to add to existing scripts (notably Arabic) trying to encode them, finding various ways to represent them, and *test* his solutions, will certainly think that embedding each occurence of a PUA substring in BiDi controls, including in the middle of Arabic words, is certainly a very bad hack. - He must certainly think (I also think it too), that PUA characters are NOT hacks. They are architectural to the well-being of the UCS, essential in various situations to preserve the software conformance with the standard. In fact, for old and rare scripts, using PUAs will remain essential for long, because those scripts will need more and more time now to get encoded, requiring more extensive researches, more collaborations with less technical-aware people that cannot understand why they'll have to test the proposed solutions using test fonts and test input methods tht require them to enter BiDi controls around all those PUA characters. The only problem here is the strong LTR property of all existing PUAs, as if they were only needed for rare Han sinograms, or for symbols. Note that, for using a PUA for rare letters found in Arabic, it is impossible to embed the whole Arabic text in Bidi overrides: this would completely break the normal behavior of the non-PUA characters found in the text, notably sequences of Arabic digits, because the BiDi controls are effectively disabling the BiDi algorithm so that it will return a single RTL run for all the text in these controls. IF BiDi controls are used, they have to be inserted ONLY between subranges containing the PUAs, and only those. The solution proposed by Michael (a new block of RTL PUAs, probably in plane 14) still has an advantage: no BiDi controls are needed at all. The BiDi algorithm does not have to be disabled. All other aspects of RTL scripts (or mixed RTL/LTR scripts) are preserved (including mirroring behaviors for auto-LTR characters (at the begining of paragraphs) and characters whose directionality depends on the resolved direction of the precening text. I don't think this is necessary though: I see no reason why implementations *have to* keep the strong LTR property of existing PUAs. This strong LTR property is only the consequence of the fact that this is only the *default* value of those PUAs, and applications should not be restricted from changing this property as they want, especially for PUAs. But to change this property value, we need an explicit PUA agreement about their usage, in such a way that it can be understood by a computer. This means an external source of character properties. My opinion is that this need is most often sufficient if it solves just the problem of correct display order. Given that the encoded texts (using those existing strong LTR PUAs that we want to adopt a RTL
Re: RTL PUA?
2011/8/22 Peter Constable peter...@microsoft.com: From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Asmus Freytag Treating PUA characters as ON is very problematic As would be changing the default property of PUA characters from L to ON. I also agree with that. This is a bad option that would break compatibility (the solution advocatd by Michael Everson seems better, in that perspective, because it does not change any existing property given to existing assigned PUAs). Anyway when I spoke about a computer note that I did not use the definite article. It is evidently implied that there's also a need for software changes as well (so this does not mean *all* computers, but this could reach someday *most* computers with their installed or upgraded softwares). Your last remark in another message of this thread was really pedantic.
Re: RTL PUA?
2011/8/22 Shriramana Sharma samj...@gmail.com: On 08/22/2011 12:01 AM, Peter Constable wrote: If you mean a rule to substitute [g1 g2] with [g3] won't apply if the sequence processed by the OpenType Layout lookup processor is [g2 g1], Peter, actually I suspect Philippe is thinking that in the case of RTL, the *glyphs* are placed in reverse order and then he is asking how can the ligation take place. No, I've not said anything about ligation. But yes the problem is related to the expected reverse order of glyphs, for some PUAs, but not necessarily all of them (not the LTR runs of PUAs, after Bidi resolution). Ligation is a completely orthogonal problem (not really a problem because it is already solved).
RE: RTL PUA?
It's actually quite easy to convince Uniscribe to treat specific characters as RTL, others as LTR, and, in general, with whatever classifications you desire. Pass a preprocessed string to Uniscribe's ScriptItemize(). RichEdit has used that approach to some degree starting with RichEdit 3.0 (Windows/Office 2000). It's also a handy way to force all operators to be treated as LTR in an LTR math zone and as RTL in an RTL math zone (aside from numeric contexts for '.' and ','). And you can force IRIs to display LTR or RTL that way by classifying the delimiters such as the dots in the domain name accordingly. Some of my blog posts on http://blogs.msdn.com/b/murrays/ discuss this in greater detail. So there's no need to change the properties of the PUA to establish PUA RTL conventions. They won't be generally interchangeable, but that's the nature of the PUA. You also have to implement such choices using rich/structured text. Plain text doesn't have a place to store the necessary properties. Most text is rich text anyway grin. Murray
Re: RTL PUA?
2011/8/22 Mark E. Shoulson m...@kli.org: I'm not certain I understand the question, but if I have it right... The logic order is ALEF + LAMED, and the presentation... places those in a right-to-left sequence, shall we say (since talking about the presentation *order* is confusing here). The font table contains the lookup that ALEF + LAMED = ALEF_LAMED_LIGATURE. It all goes according to the logical order, since the presentation order isn't really an order, it's just a direction. (this is different from things like devanagari short-i vowel, which moves with respect to the other letters in the script.) Lookup tables in fonts (at least OpenType) do not work at the character level, but at the glyph level: they substitute glyph ids by other glyph ids. Sequences of glyph ids are already reordered in visual order by the layout engine when they are searched in OpenType lookups, should they be RTL glyphs, or Indic glyphs with special reordering requirements (independant of the logical ordering of characters/code points). In addition, the same sequence of characters may be sometimes searched in several distinct sequences of glypg ids (this depends on the kind of OpenType table being consulted, as well as on character properties which also determine which lookup table will be searched and the relative order of successive lookups). The only lookup table in fonts that work at the character/code point level is their cmap (which maps a default glyph id from each encoded character, independantly of their logical or visual ordering, as well as independantly of the script/language in which those characters or glyphs are used, but possibly depending on the encoding used and the software platform supporting that encoding). Not all fonts need a cmap; for some of them, a default cmap may be implied or automatically constructed -- for example Symbol fonts in Windows, that are implicitly mapped in a PUA range; another example is Type1 or CFF fonts that have a default standardEncoding inherited from PostScript, based on glyph names (rather than glyph ids or code points) that may have themselves an implicit mapping to UCS codepoints (if these names are those defined in the AGL). Not all these mappings are 1-to-1, which means that they are not reversible, in the general case.
Re: RTL PUA?
2011/8/22 Shriramana Sharma samj...@gmail.com: Hi Behdad. I only asked whether the OT *tables* would contain the entries in the logical order or the visual order. Clearly it would still be the visual order (but Philippe Verdy seemed to imagine/suggest otherwise). No ! I've not imagined that. You incorrectly reinterpret imaginatively another incorrect imaginative reinterpretation, made by someone else, of what I wrote, which did not even suggest that.
Re: RTL PUA?
2011/8/22 Shriramana Sharma samj...@gmail.com: On 08/22/2011 05:26 PM, Behdad Esfahbod wrote: OpenType tables contain entries in the logical order of the script in question. Ie. Arabic tables are always RTL. Yes I understand, but still, to clarify: The font tables themselves contain only ASCII characters I presume. No. The lookup tables contain sequences of numeric glyph ids (16 bit integers in TrueType and OpenType). Which are also not the code point values, and not the character names or glyph names. you write: ALEF + LAMED = ALEF_LAMED_LIGATURE or LAMED + ALEF = ALEF_LAMED_LIGATURE ? Let's say that; - the LAMED character is cmap'ped (by its code point value in an cmap for Unicode, or by its code position in a cmap for another legacy 8-bit encoding) to the glyph id 1012, - and the ALEF character is cmapped to the glyph id 1001 (the values of glyph ids are not important, not even their relative order or differences, they don't need to obey any standard), - and the ALEF-LAMED ligature is in glyph id 1540 (the ALEF-LAMED character of the UCS may also be cmapped separately, but this is not a requirement) Then the lookup to perform the ligature will contain : (1012, 1001) - (1540). Glyph id's are presented and scanned in the lookup table, in sequences preordered in visual order by the text layout/shaping engine. However, given that the ALEF-LAMED is also a character of the UCS, the text layout/shaping engine that knows the Arabic script can also perform a character-based substitution itself, even in absence of the lookup of glyph ids in fonts; then it can render the ligature character according to the glyph id to which it is cmapped in that font.
Re: RTL PUA?
2011/8/22 Joó Ádám a...@jooadam.hu: Um... Computers are hardware, and don't understand a thing. What I think you mean is computer _software_. (I know, I'm being pedantic, but with good reason.) Sorry, I just can’t resist pointing out that difference between hardware and software is only the fact that the former is material, with all the consequences that follows. In any other way they are completely interchangeable. Same opinion for me. As for the other part of your mail, Peter, sorry, but it really doesn’t make any sense to me. As John has pointed out, you can adjust the properties of private use characters on Apple computers. Perhaps there is a way to do so on Windows, Unix and other systems as well. What Philippe and Doug are proposing, and I also strongly agree with, is to have a standard way of interchange of these properties. I don’t think it is neccessary to go into the advantages of standards. Speaking of actual implementation, I’m convinced that this format should be the same as it is for encoded characters (whether it is the plain text format of the Unicode Character Database, XML or anything else). Rendering engines should – maybe they already do so – accept multiple files containing character properties, which could make upgrades to the newer versions of the standard a matter of downloading the new standard set, and provide a way of overriding private use (or even standard if one is so inclined) characters’ properties. Introduction of unencoded scripts would therefore become a matter of distributing a small properties file and the corresponding fonts. As well, the small properties files can be embedded, in a very compact form, in the PUA font. This small table can be limited to just listing the ranges of PUA code points that are strong RTL instead of LTR. Most often, there will be only one range, and this just requires a couple of integers in that embedded table (possibly more, only if you want to represent more properties), without requiring a complex XML parser or a complex parser for the tabulated ASCII format used in the UCD, which is overkill for just the few properties that are needed for correct display. So the duplication in each font is not a real problem (note that there won't be a lot of fonts, most often there will be only one that matches the PUA agreement and that is suitable to render the UCS-encoded PUA text).
RE: RTL PUA?
Philippe Verdy verdy underscore p at wanadoo dot fr wrote: As well, the small properties files can be embedded, in a very compact form, in the PUA font. As soon as you embed all the information in the font, you require different solutions for systems that use different font technologies. I was thinking of something more portable. This small table can be limited to just listing the ranges of PUA code points that are strong RTL instead of LTR. Most often, there will be only one range, and this just requires a couple of integers in that embedded table (possibly more, only if you want to represent more properties), without requiring a complex XML parser or a complex parser for the tabulated ASCII format used in the UCD, which is overkill for just the few properties that are needed for correct display. I generally assume there is more to character handling than display. So the duplication in each font is not a real problem (note that there won't be a lot of fonts, most often there will be only one that matches the PUA agreement and that is suitable to render the UCS-encoded PUA text). Depending on how you count, there are already two to four fonts that support Ewellic in the PUA. There are probably many more that support Tengwar or Cirth or Klingon. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: RTL PUA?
On 08/20/2011 10:54 AM, Shriramana Sharma wrote: On 08/19/2011 10:05 PM, Mark Davis ☕ wrote: All of the property assignments to PUA characters (except the GC) are purely informative. I just now noticed that you had excepted the GC in the above. Why is that? How are applications supposed to handle combining marks etc if in the PUA? Mark, can you please reply to the above -- It seems that while it is true that GC=Co should be retained *in the standard* to clearly identify the character as a PUA character, the applications will still by changing that GC to Lo, Mc, Mn, No etc for their internal private-agreement processing. So what is the exact nature of your excepting the GC in your statement above? -- Shriramana Sharma
Re: RTL PUA?
On 08/22/2011 05:20 PM, Shriramana Sharma wrote: Hi Behdad. I only asked whether the OT *tables* would contain the entries in the logical order or the visual order. Clearly it would still be the visual order My mistake: I should have said *logical* order. (but Philippe Verdy seemed to imagine/suggest otherwise). This one is correct w.r.t. what I had *intended* to say above: i.e. Philippe thinks the entries contain the glyphs in *visual* order. See other mail replying to Philippe pointing this out. -- Shriramana Sharma
Re: RTL PUA?
On 08/22/2011 09:00 PM, Philippe Verdy wrote: The font tables themselves contain only ASCII characters I presume. No. The lookup tables contain sequences of numeric glyph ids (16 bit integers in TrueType and OpenType). Which are also not the code point values, and not the character names or glyph names. And numeric glyph IDs are still ASCII aren't they? I was just noting that the glyph tables themselves don't *use* the actual codepoints of the characters getting ligated (while they *refer* to them). Let's say that; - the LAMED character is cmap'ped (by its code point value in an cmap for Unicode, or by its code position in a cmap for another legacy 8-bit encoding) to the glyph id 1012, - and the ALEF character is cmapped to the glyph id 1001 (the values of glyph ids are not important, not even their relative order or differences, they don't need to obey any standard), - and the ALEF-LAMED ligature is in glyph id 1540 (the ALEF-LAMED character of the UCS may also be cmapped separately, but this is not a requirement) Then the lookup to perform the ligature will contain : (1012, 1001) - (1540). No! See Behdad's post -- it is clearly said that the lookup will still be in logical order (1001, 1012) - (1540) and not in visual order as you say. See? This is what I meant in the other mail by you suggesting that the tables containing the characters in visual order and not in logical order, to which you replied (without much real explanation I'm afraid): quoteNo ! I've not imagined that. You incorrectly reinterpret imaginatively another incorrect imaginative reinterpretation, made by someone else, of what I wrote, which did not even suggest that./quote Glyph id's are presented and scanned in the lookup table, in sequences preordered in visual order by the text layout/shaping engine. Nope -- they are placed in the lookup table in *logical* order. IIUC the entire sequence of glyphs is only reordered from RTL at the very end. Peter or Behdad, can you corroborate this? -- Shriramana Sharma
Re: RTL PUA?
On 08/22/2011 09:31 PM, Doug Ewell wrote: Philippe Verdyverdy underscore p at wanadoo dot fr wrote: As well, the small properties files can be embedded, in a very compact form, in the PUA font. As soon as you embed all the information in the font, you require different solutions for systems that use different font technologies. Why? In the end all the systems base upon the character properties specified by the standard. For the PUA characters in question, what is needed for a table of properties to override the default ones. The systems would then handle those new properties in the same way that they would handle the regular ones. Granted, if the renderers hardcode the properties (as most OT ones do) then some parsing is required to import all the override data provided by the extra font table into a struct or such -- after which (I presume) it would be possible (to a large extent?) to treat it the same as an encoded script. [Actually, this seems quite difficult to implement in OT, where the philosophy is to explicitly hardcode the properties, but Graphite and AAT should be fine I guess.] I generally assume there is more to character handling than display. True -- so if someone wanted a PUA script to be handled properly in sorting etc one would have to prepare collation tables which would obviously go *outside* the font. -- Shriramana Sharma
RE: RTL PUA?
Shriramana Sharma samjnaa at gmail dot com wrote: As soon as you embed all the information in the font, you require different solutions for systems that use different font technologies. Why? In the end all the systems base upon the character properties specified by the standard. For the PUA characters in question, what is needed for a table of properties to override the default ones. The systems would then handle those new properties in the same way that they would handle the regular ones. Right, so if you embed that table in an OT font, the information is not available to a system that uses a font technology other than OT. What is needed is a way to specify the properties in a platform-independent way, where platform means not only OS but also font technology. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: RTL PUA?
On Mon, Aug 22, 2011 at 07:51:22AM -0700, Doug Ewell wrote: Some PUA properties, like glyph shapes and maybe directionality, can be stored in a font. Others, like numeric values and casing, might not or cannot. An interchangeable format needs to be agreed upon for the Why not? P.T. -- Petr Tomasek http://www.etf.cuni.cz/~tomasek Jabber: but...@jabbim.cz EA 355:001 DU DU DU DU EA 355:002 TU TU TU TU EA 355:003 NU NU NU NU NU NU NU EA 355:004 NA NA NA NA NA
Re: RTL PUA?
On 08/22/2011 10:12 PM, Doug Ewell wrote: Right, so if you embed that table in an OT font, the information is not available to a system that uses a font technology other than OT. I don't understand why you would say so -- assuming we are all talking about TrueType fonts, AAT just uses some tables, OT others and Graphite still others. They are all just tables appended to the TrueType font data. Any software that is able to read TT font data can also read the tables. So what's the problem? -- Shriramana Sharma
Re: RTL PUA?
Shriramana Sharma wrote: The font tables themselves contain only ASCII characters I presume. OpenType Layout tables use Glyph IDs. OTL development tools typically use glyph names, which may be particular to the tool or the same names used in the post or CFF tables. OTL tables work on glyphs, not characters, and bidi will have been resolved prior to application of OTL substitution and positioning. Input glyph strings for substitution lookups are always in the resolved direction of the glyph run, so Arabic and Hebrew alphabetic runs are processed right-to-left, i.e. alef lamed - alef_lamed *not* lamed alef - alef_lamed Similarly, context stings for glyph positioning (if present) will be right-to-left, although anchor attachment positions on individual glyphs are relative to the 0,0 coordinate, i.e. the left sidebearing. JH -- Tiro Typeworkswww.tiro.com Gulf Islands, BC t...@tiro.com The criminologist's definition of 'public order crimes' comes perilously close to the historian's description of 'working-class leisure-time activity.' - Sidney Harring, _Policing a Class Society_
RE: RTL PUA?
Petr Tomasek tomasek at etf dot cuni dot cz wrote: Some PUA properties, like glyph shapes and maybe directionality, can be stored in a font. Others, like numeric values and casing, might not or cannot. An interchangeable format needs to be agreed upon for Why not? Where does one store numeric values in a font? Maybe this should be taken off-list. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: RTL PUA?
Shriramana Sharma wrote: I was just noting that the glyph tables themselves don't *use* the actual codepoints of the characters getting ligated (while they *refer* to them). Characters are mapped to glyph IDs in the font cmap tables. Glyph IDs are mapped to other glyph IDs (one-to-one, one-to-many, many-to-one, or one-to-one-of-many) in the layout GSUB table. No! See Behdad's post -- it is clearly said that the lookup will still be in logical order (1001, 1012) - (1540) and not in visual order as you say. I think there may be some confusion in this discussion over what constitutes 'visual order'. I try to avoid the term because it is difficult for right-to-left readers to accustom themselves to thinking of visual order as anything other than right-to-left. I prefer the term 'reading order' or 'resolved order', i.e. resolved bidi and script shaping order, which may have involved integrated reordering (reordering within the glyph processing) as in the case of Indic scripts. Nope -- they are placed in the lookup table in *logical* order. IIUC the entire sequence of glyphs is only reordered from RTL at the very end. Peter or Behdad, can you corroborate this? Glyph ID inputs for OTL processing are according to reading/resolved order. This is typically the same as logical order, but the term logical order really applies to character strings, not glyph strings, which are much more maleable. The order of input strings in GSUB lookups or contexts is dependent not only on the underlying character order, but also on the results of previous GSUB lookups. So while, unlike AAT and Graphite, OpenType Layout doesn't explicitly provide for glyph re-ordering, some kinds of glyph reordering are possible using sequences of contextual lookups to duplicate a glyph in a second location in the string and then remove the first instance. We use this in some Devanagari fonts to enable subsequent ligation of short ikar variants to the left of a consonant base with reph marks to the right of that base. JH -- Tiro Typeworkswww.tiro.com Gulf Islands, BC t...@tiro.com The criminologist's definition of 'public order crimes' comes perilously close to the historian's description of 'working-class leisure-time activity.' - Sidney Harring, _Policing a Class Society_
Re: RTL PUA?
On Monday 22 August 2011, Philippe Verdy verd...@wanadoo.fr wrote: So there are only two options: [snipped] ... : this requires an approval either by the UTC WG2 (solution 1) or by the OpenType working group (solution 2). Would a third option work? In the Description section of the Macintosh Roman section of a TrueType font, include a line of text in a plain text format of which the following line of text is an example. PUA.RTL=$E000-$E1FF,$E440-$E447,$E541,$E549,$E57C,$EA00-$EA0F,$EC07; One could specify precisely which Private Use Area characters were to become RTL when using that particular font. One would need rendering software that looked for such a string of text in the font file, yet, as far as I am aware, no approval from any committee in order to put this solution into practical use. William Overington 22 August 2011
Re: RTL PUA?
Doug Ewell 於 2011年8月22日 上午10:59 寫道: Petr Tomasek tomasek at etf dot cuni dot cz wrote: Some PUA properties, like glyph shapes and maybe directionality, can be stored in a font. Others, like numeric values and casing, might not or cannot. An interchangeable format needs to be agreed upon for Why not? Where does one store numeric values in a font? Maybe this should be taken off-list. This is actually a relevant point. The major TrueType variants all work primarily with glyphs, not characters. Using them as a place to store information about the *characters* in the text is therefore not a reliable way to provide an override for default system behavior. By the time the rendering engine consults the fonts for layout specifics, large chunks of the text processing will already be completed. OpenType, for example, expects that the bidi algorithm is largely run in character space, not glyph space, and therefore without regard for the specific font involved. (AAT does almost everything in glyph space, including bidi. I'm not sure about Graphite.) The net result is that a font is an unreliable way of storing character-specific information useful on multiple platforms. This is one reason why embedding the existing directionality controls within the text itself is currently the most reliable way of getting the behavior one might want in a platform-agnostic way. = Siôn ap-Rhisiart John H. Jenkins jenk...@apple.com
Re: RTL PUA?
True -- so if someone wanted a PUA script to be handled properly in sorting etc one would have to prepare collation tables which would obviously go *outside* the font. If a proper definition of an unencoded script needs additional properties which cannot be stored in the font anyway, why would you want to store part of it in OT tables? It’s just not the right place. Fonts’ sole purpose is to display already defined characters, not to define them. Tails shouldn’t be made wagging dogs. Á
Re: RTL PUA?
On 08/22/2011 10:55 PM, Joó Ádám wrote: If a proper definition of an unencoded script needs additional properties which cannot be stored in the font anyway, why would you want to store part of it in OT tables? It’s just not the right place. Fonts’ sole purpose is to display already defined characters, not to define them. Tails shouldn’t be made wagging dogs. True, but we are only trying to help those who find themselves unable to even *display* PUA characters as RTL (or as Indic with reordering, which can be handled by IndicMatraCategory). Since collation never cares about whether the script is LTR or RTL or Indic (with the except of Thai etc where the encoding is as per visual order and not logical order) the collation data can be outside the font, since it is not needed for display. -- Shriramana Sharma
Re: RTL PUA?
William_J_G Overington 於 2011年8月22日 上午10:49 寫道: In the Description section of the Macintosh Roman section of a TrueType font, include a line of text in a plain text format of which the following line of text is an example. PUA.RTL=$E000-$E1FF,$E440-$E447,$E541,$E549,$E57C,$EA00-$EA0F,$EC07; Forgive my asking, but this reference to the description section of the Macintosh Roman section of a TrueType font has me puzzled, because I don't know what you're talking about. What table contains this string? = 井作恆 John H. Jenkins jenk...@apple.com
Re: RTL PUA?
On Monday 22 August 2011, John H. Jenkins jenk...@apple.com wrote: Forgive my asking, but this reference to the description section of the Macintosh Roman section of a TrueType font has me puzzled, because I don't know what you're talking about. What table contains this string? When I use FontCreator, made by High-Logic, http://www.high-logic.com is the webspace: with a font file open, I can select Format from the menu bar and then select Naming... from the drop down menu. That leads to a dialogue panel. From that dialogue panel one may select, for an ordinary, basic Unicode font, either of two platforms, namely Macintosh Roman and Microsoft Unicode BMP only. Having selected a platform, one may view the text content of various fields for that platform, such as font family name and copyright notice, version string and postscript name. There is then a button that is labelled Advanced... that, if clicked, opens another dialogue panel with various other text fields, including Font Designer and Description, which are the two that I often use. Now, when the text values in the fields are stored in the font file, the values for the Macintosh Roman platform are stored in plain text and the values for the Microsoft Unicode BMP only platform are stored in some encoded format. So, if one opens a TrueType font file in WordPad and one searches for an item of plain text that is in one of the fields of the font, then the text that is in the Macintosh platform can be found, yet the text that is in the Microsoft Unicode BMP only platform cannot be found. So, I thought that if a manufacturer of a wordprocessing application or a desktop publishing application decided to make a special researcher's edition of the software, then that software could, when a font is selected, first scan the font for a PUA.RTL string and, if one is found, override the left-to-right nature of the identified characters to be a right-to-left nature, just while that font is selected. Whether such a software package ever becomes available is something that only time will tell, yet it seems to me that it is a method that could be used without needing any changes by any committee. William Overington 22 August 2011
Re: RTL PUA?
2011/8/22 Doug Ewell d...@ewellic.org: Depending on how you count, there are already two to four fonts that support Ewellic in the PUA. There are probably many more that support Tengwar or Cirth or Klingon. First, these fonts can work fine with the default LTR directionality. So there's no need for additional data for them. Second, even if they were RTL, the needed info for each of these fonts, embedded in them would be extremely small, reduced to just specifying the range of RTL characters they need to contain. So I don't see that as a problem. Those fonts do exist and are used exactly because there was no problem for rendering them with texts encoded in logical order (the same as the visual order). It's still strange that we can have several fonts for esoteric fonts that have been used effectively by very few people, when there are centuries of traditions, and many interested users (but spread in very small communities worldwide) that cannot use computer technologies to render their favorite scripts, or that want to teach them, or make books and other publications to expose them, as an important humane cultural heritage, even if this was only to translate them or transcribe them in a more modern script.
Re: RTL PUA?
2011/8/22 Shriramana Sharma samj...@gmail.com: On 08/22/2011 09:00 PM, Philippe Verdy wrote: The font tables themselves contain only ASCII characters I presume. No. The lookup tables contain sequences of numeric glyph ids (16 bit integers in TrueType and OpenType). Which are also not the code point values, and not the character names or glyph names. And numeric glyph IDs are still ASCII aren't they? I was just noting that the glyph tables themselves don't *use* the actual codepoints of the characters getting ligated (while they *refer* to them). Let's say that; - the LAMED character is cmap'ped (by its code point value in an cmap for Unicode, or by its code position in a cmap for another legacy 8-bit encoding) to the glyph id 1012, - and the ALEF character is cmapped to the glyph id 1001 (the values of glyph ids are not important, not even their relative order or differences, they don't need to obey any standard), - and the ALEF-LAMED ligature is in glyph id 1540 (the ALEF-LAMED character of the UCS may also be cmapped separately, but this is not a requirement) Then the lookup to perform the ligature will contain : (1012, 1001) - (1540). No! See Behdad's post -- it is clearly said that the lookup will still be in logical order (1001, 1012) - (1540) and not in visual order as you say. See? This is what I meant in the other mail by you suggesting that the tables containing the characters in visual order and not in logical order, to which you replied (without much real explanation I'm afraid): quoteNo ! I've not imagined that. You incorrectly reinterpret imaginatively another incorrect imaginative reinterpretation, made by someone else, of what I wrote, which did not even suggest that./quote Glyph id's are presented and scanned in the lookup table, in sequences preordered in visual order by the text layout/shaping engine. Nope -- they are placed in the lookup table in *logical* order. IIUC the entire sequence of glyphs is only reordered from RTL at the very end. Peter or Behdad, can you corroborate this? Hmmm... this is not very clear then in the OpenType specification. Well it does not matter the which order is physically used in the stored table as long as it is consistant. But this confirms that the OpenType rendering algorithm, the way it is presented in the OpenType specification, is completely wrong: the Bidi algorithm is definitely not the first step needed before performing glyph substitutions. However the Bidi algorithm really needs to reorder the glyphs at least relatively, for correct application of GPOS (glyph positionining). As a consequence, the font to use will be completely known (all cmap'pings will have been applied already, and no glyph substitution can accur across distinct fonts that have independant glyph ids). As such the PUA agreement implied by the PUA font would have been asserted. Nothing forbids then to use the font as THE reliable source of information about which PUAs are RTL and which ones are LTR. The computing order of features should not then be: - BiDi algorithm for reordering grapheme clusters - font search and font fallback (using cmap) - GSUB (lookups of ligatures or discretionary glyph variants) - GPOS but really: - font lookup and font fallback (using cmap) - GSUB (lookups of ligatures or discretionary glyph variants) - BiDi algorithm for reordering glyphs representing the grapheme clusters or ligatured grapheme clusters - GPOS The BiDi algorithm absolutely does not have to be changed. This time there's absolutely no PUA with unknown directionality if the font defines the RTL property for these PUA (using the normative LTR only as a default when the font does not specify it)
Re: RTL PUA?
2011/8/22 Shriramana Sharma samj...@gmail.com: True -- so if someone wanted a PUA script to be handled properly in sorting etc one would have to prepare collation tables which would obviously go *outside* the font. Collation tables can aleady be tailored very easily with existing technologies. And anyway this has nothing to do with directionality of characters, or their rendering, on which they absolutely do not depend. Tailored collations already have a working standard and syntax in the CLDR project or ICU and in a few other libraries (notably in CPAN for Perl).
Re: RTL PUA?
William_J_G Overington 於 2011年8月22日 下午12:36 寫道: On Monday 22 August 2011, John H. Jenkins jenk...@apple.com wrote: Forgive my asking, but this reference to the description section of the Macintosh Roman section of a TrueType font has me puzzled, because I don't know what you're talking about. What table contains this string? When I use FontCreator, made by High-Logic, http://www.high-logic.com is the webspace: with a font file open, I can select Format from the menu bar and then select Naming... from the drop down menu. That leads to a dialogue panel. From that dialogue panel one may select, for an ordinary, basic Unicode font, either of two platforms, namely Macintosh Roman and Microsoft Unicode BMP only. Having selected a platform, one may view the text content of various fields for that platform, such as font family name and copyright notice, version string and postscript name. There is then a button that is labelled Advanced... that, if clicked, opens another dialogue panel with various other text fields, including Font Designer and Description, which are the two that I often use. Now, when the text values in the fields are stored in the font file, the values for the Macintosh Roman platform are stored in plain text and the values for the Microsoft Unicode BMP only platform are stored in some encoded format. So, if one opens a TrueType font file in WordPad and one searches for an item of plain text that is in one of the fields of the font, then the text that is in the Macintosh platform can be found, yet the text that is in the Microsoft Unicode BMP only platform cannot be found. So, I thought that if a manufacturer of a wordprocessing application or a desktop publishing application decided to make a special researcher's edition of the software, then that software could, when a font is selected, first scan the font for a PUA.RTL string and, if one is found, override the left-to-right nature of the identified characters to be a right-to-left nature, just while that font is selected. Whether such a software package ever becomes available is something that only time will tell, yet it seems to me that it is a method that could be used without needing any changes by any committee. Ah. You're referring to an entry in the 'name' table, then. The intention of the 'name' table is to provide localizable strings for the UI. Using it to store data of any sort for the rendering engine would be very, very inappropriate. In general, one should not be using a text editor to examine the contents of a TrueType font. It would be like using a text editor to examine the contents of an application. Even if you see some plain text, you really don't have any sense for how it's actually being used. You may want to bone up on the structure of TrueType/OpenType fonts. = John H. Jenkins 井作恆 Жбь А. ЖЩэпЮьц jenk...@apple.com
RE: RTL PUA?
There is more to displaying characters than LTR versus RTL, and there is more to handling characters than just displaying them. This point continues to be lost on several people responding to this thread. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
RE: RTL PUA?
Philippe Verdy verdy underscore p at wanadoo dot fr wrote: Depending on how you count, there are already two to four fonts that support Ewellic in the PUA. There are probably many more that support Tengwar or Cirth or Klingon. First, these fonts can work fine with the default LTR directionality. So there's no need for additional data for them. Second, even if they were RTL, the needed info for each of these fonts, embedded in them would be extremely small, reduced to just specifying the range of RTL characters they need to contain. This isn't my point. Multiple fonts can exist for PUA scripts and the user should not have to be constrained to using just the one font which happens to contain property information, because someone decided properties should be stored in the font. So I don't see that as a problem. Those fonts do exist and are used exactly because there was no problem for rendering them with texts encoded in logical order (the same as the visual order). Not my point. It's still strange that we can have several fonts for esoteric fonts that have been used effectively by very few people, when there are centuries of traditions, and many interested users (but spread in very small communities worldwide) that cannot use computer technologies to render their favorite scripts, or that want to teach them, or make books and other publications to expose them, as an important humane cultural heritage, even if this was only to translate them or transcribe them in a more modern script. One person added Ewellic to his shareware font as an experiment, and I paid another person to do a font for me. Sorry if this was culturally insensitive. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
RE: RTL PUA?
Shriramana Sharma samjnaa at gmail dot com wrote: Right, so if you embed that table in an OT font, the information is not available to a system that uses a font technology other than OT. I don't understand why you would say so -- assuming we are all talking about TrueType fonts, AAT just uses some tables, OT others and Graphite still others. They are all just tables appended to the TrueType font data. Any software that is able to read TT font data can also read the tables. So what's the problem? OK, so it's obvious by now I'm not a font guy. But I still maintain that there's more to proper handling of Unicode characters, PUA or otherwise, than whether their directionality is LTR or Arabic-RTL or non-Arabic-RTL or what have you. That's why all those other properties exist. And I maintain that PUA users need a place to store those other properties, and that the font doesn't seem like the right place for non-display properties. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: RTL PUA?
2011/8/22 William_J_G Overington wjgo_10...@btinternet.com: Having selected a platform, one may view the text content of various fields for that platform, such as font family name and copyright notice, version string and postscript name. There is then a button that is labelled Advanced... that, if clicked, opens another dialogue panel with various other text fields, including Font Designer and Description, which are the two that I often use. Now, when the text values in the fields are stored in the font file, the values for the Macintosh Roman platform are stored in plain text and the values for the Microsoft Unicode BMP only platform are stored in some encoded format. Note some encoded format. The strings are encoded using the encoding specified in the platform selectors. The strings for the Macintish Romain platform will be encoded using MacRoman. The strings for the MS Unicode BMP platform will be encoded with the BMP part of UTF-16 (without support for surrogates). The strings for the Unicode platform will use the UTF-32 encoding. So, if one opens a TrueType font file in WordPad and one searches for an item of plain text that is in one of the fields of the font, then the text that is in the Macintosh platform can be found: It just happens that you are opening the TrueType font as if it was a plain-text encoded with Windows-1252, or some other 8-bit encoding based on ASCII. You are also searching ASCII characters that are encoded identically in Windows-1252 as well as in the MacRoman encoding, so you find a match. yet the text that is in the Microsoft Unicode BMP only platform cannot be found. Because tou would have to insert null bytes in your search strings, to find an exact match in an UTF-16 encoded string. Without these nulls, you'll get no match. What you are doing is a search in a text loaded after assuming the wrong encoding. TrueType fonts are binary containers, that can mix several encodings for its plain-text elements, but that also embed many other non-text data. This happens even if your text editor is capable of loading Unicode-encoded texts (this fails here if you try to load it as UTF-16, because the whole TTF container cannot match the conformance requirements for correctly encoded UTF-16 texts, for the whole document, but only for fragments of it. On the opposite, there's no conformance problem if you try to read the file as if it was Windows-1252 or ISO-8859-1...
ALM (was: Re: RTL PUA?)
On 8/21/2011 3:31 PM, Richard Wordingham wrote: I expect ARABIC LANGUAGE MARK would not go down well - has it already been proposed and rejected?. ARABIC *LETTER* MARK, not *LANGUAGE* mark. (And suggested to just be renamed to AL MARK.) Proposed? Yes. Discussed? Yes. Rejected? No. The last UTC meeting took a consensus to issue a public review issue on the proposed ALM and ELM (embedding level mark) characters. So there will be further discussion and chance for input. Nothing has been decided yet. --Ken
Re: RTL PUA?
On Mon, 22 Aug 2011 07:51:22 -0700 Doug Ewell d...@ewellic.org wrote: Some PUA properties, like glyph shapes and maybe directionality, can be stored in a font. Others, like numeric values and casing, might not or cannot. An interchangeable format needs to be agreed upon for the properties in the latter category. I suggest that the obvious format is that used for capturing the UCD in XML. Only the characters in which you are interested need be specified. One very important property for several scripts is the script to which a character belongs. One reason for associating properties with a font is that text that is to be displayed is at that point tentatively associated with a font. Another is that in a multi-font document, a PUA character could have multiple implicit properties dependent on the font it appears in. Richard.
RE: RTL PUA?
Richard Wordingham richard dot wordingham at ntlworld dot com wrote: One reason for associating properties with a font is that text that is to be displayed is at that point tentatively associated with a font. I thought John said fonts dealt with glyph IDs, not characters per se. Another is that in a multi-font document, a PUA character could have multiple implicit properties dependent on the font it appears in. Normal, assigned characters don't change their Unicode properties depending on font. I don't see why PUA characters would be different. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell
Re: RTL PUA?
On Sat, Aug 20, 2011 at 7:08 AM, Shriramana Sharma samj...@gmail.com wrote: On 08/20/2011 01:57 PM, Martin Hosken wrote: D49 states that all properties of PUA characters are overridable by a higher protocol. But in 'normal' implementations, there are no higher level protocols to override the properties and so they use the defaults in the Unicode Database. So while in *theory* it's possible to override these values, nobody does. (This happens to also be the case with other tailoring algorithms in Unicode). Adding the configuration that tailoring requires is usually prohibitive and so it just doesn't get done. Good point -- Michael should note this. Somebody remarked that Apple Mac OS's rendering engine already supports an extended OT table which would signal that the glyphs in a PUA font are RTL. If other rendering don't support it, again it is not the fault of the standard. Is there a specificatino for that OT table? Are you implementing this in anything? Read a previous post by John Jenkins. He's the one who said they have a prop table in Apple's implemention of OT (or is it their own AAT) that enables one to do this. Is this correct? that Apple solves the problem of RTL PUA user requirements? See John Jenkins latest mail that says: [Begin Quote] To be honest, I don't know if using the 'prop' table to override directionality for glyphs still works. A quick-and-dirty test on Lion suggests that it doesn't, so I may have spoken too quickly. This is not a part of the functionality of AAT which gets much exercise, so it's entirely possible that it was lost at some point without anyone noticing. In any event, my apologies for raising any false hopes. [End Quote] Hope a new proposal or a UTN from UC will make things clear, and RTL community benefits. N. Ganesan Jonathan Kew 於 2011年8月21日 上午10:48 寫道: On 21 Aug 2011, at 17:21, Behdad Esfahbod wrote: On 08/21/11 16:44, Shriramana Sharma wrote: BTW can John Jenkins show us a few entries from the prop table of some font supporting the custom Apple PUA characters, especially the RTL and GC=No ones? Like this? https://developer.apple.com/fonts/ttrefman/RM06/Chap6prop.html However, note that this documentation is very old, and does not make it clear whether there is any support for overriding directionality in current Mac OS X software. Yes, it's very old, largely because we haven't done anything with the structure of the 'prop' table for a long, long time. Still, anything referring to QuickDraw GX is obviously overdue for an update. To be honest, I don't know if using the 'prop' table to override directionality for glyphs still works. A quick-and-dirty test on Lion suggests that it doesn't, so I may have spoken too quickly. This is not a part of the functionality of AAT which gets much exercise, so it's entirely possible that it was lost at some point without anyone noticing. In any event, my apologies for raising any false hopes. = 井作恆 John H. Jenkins jenk...@apple.com If the application doesn't do this and allows Graphite to break the text into runs, then Graphite can treat PUA characters as having BC other than L? /myunderstanding Yes that understanding is correct. Great! Could you then place some sample characters from your Scheherezade font in the PUA and render them RTL and show to us then Michael would be convinced. -- Shriramana Sharma
Re: RTL PUA?
On 08/23/2011 03:29 AM, N. Ganesan wrote: Hope a new proposal or a UTN from UC will make things clear, and RTL community benefits. Dear Ganesan, I wonder if you have actually understood all the issues here. As usual you have done your copy-paste from somebody else's post. Please say something if you have something to actually contribute instead of just saying I support Oriya OM I support PUA RTL or such. If you support PUA RTL, and since you are so interested in Grantha, you should do a proposal for regions in the PUA to be allocated proper IndicMatraCategory properties so that today we can put Grantha in the PUA and get it rendered properly by existing rendering engines. -- Shriramana Sharma
Re: RTL PUA?
2011/8/19 Michael Everson ever...@evertype.com: There is plenty of space. There would be no difficulty in assigning some rows to a RTL PUA. Mucking about with the directionality of the existing PUA would be extremely unwise. Conceivably certain closed user-groups could be using closed-distribution rendering engines which would support bidi and glyph reordering or such for PUA codepoints. Not everyone is a programmer and can devise a rendering engine. But lots of people can make fonts that could support a RTL conscript or some private Arabic characters. Hmmm Given the current standard in OpenType, and the fact that OpenType fonts cannot reorder glyphs to support the BiDi algorithm and correctly handle featues like ligatures, I have serious doubt about the feasibility of an OpenType font capable of supporting an RTL conscript or some private Arabic characters, that will work with existing OpenType engines, simply because there's absolutely nothing to describe such properties. This would be possible only if the engine can not only use the existing OpenType fonts, but also include some supplementary character properties tables for PUA assignments used in that font, or these custom properties can be integrated in extension tables added in the OpenType fonts, notably: directionality and mirroring, but also as well the combining classes, some decomposition mappings, and probably also fallback mapping. There would also be the need to represent a finite state machine needed to recognize grapheme cluster boundaries, at least, and list the feature names in which the substitution positioning rules for recognized sequences of PUA characters (or their mapped glyphs). What this means is that, in practice, PUA are only usable in fonts for characters with strong LTR directionality, excluding all reordering and mirroring. Those conscripts will then have to be represented in PUAs as if they were completely with strong LTR characters, like the sinograms. It's not impossible to do that, but you have to completely forget the logical encoding order and only use a strict visual order for these PUA-encoded conscripts, and even for unencoded rare Arabic letters/clusters for which you'd want to just use a PUA. The alternative is to not use OpenType features, but use one of the alternatives: Apple's AAT or SIL's Graphite, which are less restricted than OpenType, or some newer font formats (in this case, you won't need any newer PUA ranges with strong RTL properties, you can just use the existing assignments). -- Philippe.
Re: RTL PUA?
On Sun, Aug 21, 2011 at 12:21:28AM +, Doug Ewell wrote: The more I think of it, the more I like the idea of reassigning the default BC of Plane 16 to 'R'. What would the arguments against this be? I found a font (Asana Math) installed on my system that occupies U+10fddf..U+10fffd. P. -- Petr Tomasek http://www.etf.cuni.cz/~tomasek Jabber: but...@jabbim.cz EA 355:001 DU DU DU DU EA 355:002 TU TU TU TU EA 355:003 NU NU NU NU NU NU NU EA 355:004 NA NA NA NA NA
Re: RTL PUA?
2011/8/20 Ken Whistler k...@sybase.com: There are 131,068 private use code points in the standard. That is all there ever will be. I also fully agree (sorry then to Michael Everson support for such new RTL PUA assignments). All that can be done is to fix the softwares. Notably the font formats where you'll be able to define the necessary overrides for directionality mirroring mappings (for RTL conscripts), and other reordering properties that may be needed to support Indic conscripts (such as prepended letters). Adding new RTL PUAs will require any way modification of renderers/layout engines to support it. These same engines can as well be modified to support external character properties table needed to override the existing PUAs, so that they can be rendered correctly. May be it's the desire of OpenType designers to not use any such overrides, but this was only intended for normal non-PUA characters. An revised OpenType specification can perfectly integrate the possibility of some new extension table, and assert that these custom properties stored in fonts will ONLY by valid and usable for PUA characters only, as a font validation constraint.
Re: RTL PUA?
On 21 Aug 2011, at 02:44, Doug Ewell wrote: Would that really be a better default? I thought the main RTL needs for the PUA would be for unencoded scripts, not for even more Arabic letters. Could easily be for work on new Arabic-script orthographies which use new letters. Or for similar scripts that treat numbers as Arabic does. (How many more are there anyway?) No one knows. :-) Michael Everson * http://www.evertype.com/
RE: RTL PUA?
Several RTL scripts do not require shaping nor ligatures. Jony -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Philippe Verdy Sent: Sunday, August 21, 2011 10:29 AM To: Michael Everson Cc: unicore UnicoRe Discussion; Unicode Discussion List Subject: Re: RTL PUA? 2011/8/19 Michael Everson ever...@evertype.com: There is plenty of space. There would be no difficulty in assigning some rows to a RTL PUA. Mucking about with the directionality of the existing PUA would be extremely unwise. Conceivably certain closed user-groups could be using closed- distribution rendering engines which would support bidi and glyph reordering or such for PUA codepoints. Not everyone is a programmer and can devise a rendering engine. But lots of people can make fonts that could support a RTL conscript or some private Arabic characters. Hmmm Given the current standard in OpenType, and the fact that OpenType fonts cannot reorder glyphs to support the BiDi algorithm and correctly handle featues like ligatures, I have serious doubt about the feasibility of an OpenType font capable of supporting an RTL conscript or some private Arabic characters, that will work with existing OpenType engines, simply because there's absolutely nothing to describe such properties. This would be possible only if the engine can not only use the existing OpenType fonts, but also include some supplementary character properties tables for PUA assignments used in that font, or these custom properties can be integrated in extension tables added in the OpenType fonts, notably: directionality and mirroring, but also as well the combining classes, some decomposition mappings, and probably also fallback mapping. There would also be the need to represent a finite state machine needed to recognize grapheme cluster boundaries, at least, and list the feature names in which the substitution positioning rules for recognized sequences of PUA characters (or their mapped glyphs). What this means is that, in practice, PUA are only usable in fonts for characters with strong LTR directionality, excluding all reordering and mirroring. Those conscripts will then have to be represented in PUAs as if they were completely with strong LTR characters, like the sinograms. It's not impossible to do that, but you have to completely forget the logical encoding order and only use a strict visual order for these PUA-encoded conscripts, and even for unencoded rare Arabic letters/clusters for which you'd want to just use a PUA. The alternative is to not use OpenType features, but use one of the alternatives: Apple's AAT or SIL's Graphite, which are less restricted than OpenType, or some newer font formats (in this case, you won't need any newer PUA ranges with strong RTL properties, you can just use the existing assignments). -- Philippe.
RE: RTL PUA?
From: unicore-boun...@unicode.org [mailto:unicore-boun...@unicode.org] On Behalf Of Michael Everson Yeah OK maybe simply base+diacritic stuff or even ligatures would be easy to do via simple substitution rules in tables, but how about glyph reordering? No problem unless you are using Uniscribe. Which of these are you saying? - That mark positioning and simple substitution rules involving PUA characters is not a problem unless you're using Uniscribe - That glyph re-ordering of PUA characters is not a problem unless you're using Uniscribe (Unless we have a bug I haven't encountered, the first is incorrect. The second suggests that you've missed Sharma's point entirely.) Indic scripts involving reordering and split-positioning vowel signs can't be handled by placing them in the PUA. There are other ways of handling such clusters. Oh? You must mean something like ignoring Unicode. If not, please clarify. Peter
Re: RTL PUA?
On Sun, 21 Aug 2011 01:44:02 + Doug Ewell d...@ewellic.org wrote: The more I think of it, the more I like the idea of reassigning the default BC of Plane 16 to 'R'. What would the arguments against this be? BC of 'AL'? Would that really be a better default? I thought the main RTL needs for the PUA would be for unencoded scripts, not for even more Arabic letters. (How many more are there anyway?) Not necessarily better, I'm just suggesting that both need to be supported. However, we need to look at use cases. (1) Unencoded Arabic script letters with joining behaviour, for use with any application. (a) We need the character to have AL, R or ON for it to be included in BiDi runs. If we use ON we may need RLM when the character is at the edge of a run, and even then, its behaviour may be no better than a character with a BC of R. (b) It may get left out of script runs. There were problems on Windows with the Tamil ligature k.SS not rendering, despite font support, when the character U+0BB7 TAMIL LETTER SSA was new. And that's in a left-to right script with a character in the appropriate block! (2) Complete right-to-left script. I'm presuming the difference between AL and R is then a matter of what right-to-left script the potential users chiefly also use. (a) As a practical implementation, the distinction between AL and R would matter if the script has modern use. Otherwise, any of ON, AL and R would do, though one might face the annoyance of having to start chunks of text with RLM. If a script with modern use should be encoded using a BC of R, then I believe ON would also do as a stop-gap until the script is encoded. How fiendish is BiDi-sensitive transliteration? (b) For experimentation, I believe the difference between AL, R and ON would matter little, even though it would be irritiating to have to use RLM. (c) Complex script support is patchy - one might be restricted to applications that allow the font to provide full complex script support. The big issue in all this, though, is (i) how to update the rendering system with a new set of values for Unicode properties, including script, and (ii) the scope of such an update. (The distinction between the PUA and the rest is that it makes sense for PUA properties to change as freely as fonts.) This, incidentally, is analogous to locales reflecting code page selections. There is also, though less pressing, the issue of tailoring collations. (The worst issue is there is distinct canonically inequivalent characters of type Lo comparing equal - I've seen it for Canadian Aboriginal Syllabics for Windows XP and for Thai in Ubuntu 10.04 - surely that's not the normal British collation of such characters.) One minor problem with (i) *was* that it wasn't clear how one should annotate a copy of UnicodeData.txt to show that it has been modified. The standard XML alternative provides allows for comments, thereby solving that problem. If Issue (i) can be readily solved at the machine or user level or lower, then the default properties of the PUA become irrelevant. Richard.
RE: RTL PUA?
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Philippe Verdy Hmmm Given the current standard in OpenType, and the fact that OpenType fonts cannot reorder glyphs to support the BiDi algorithm and correctly handle featues like ligatures... I agree that OpenType font tables cannot to glyph re-ordering. But totally incorrect in saying that it cannot handle ligatures. What this means is that, in practice, PUA are only usable in fonts for characters with strong LTR directionality, excluding all reordering and mirroring. In the OpenType specification, the only data related to glyph mirroring that a rendering engine is assumed to have is the bidi mirroring data from TUS 5.1. (See http://www.microsoft.com/typography/otspec/TTOCHAP1.htm#ltrrtl.) All other glyph mirroring is to be handled using glyph substitution data in OpenType Layout tables in fonts. Peter
Re: RTL PUA?
I think as soon as we start talking about this many scenarios, we are no longer talking about what the *default* bidi class of the PUA (or some part of it) should be. Instead, we are talking about being able to specify private customizations, so that one can have 'AL' runs and 'ON' runs and so forth. There really isn't any way the UTC is going to approve changing one part of the PUA to be default 'AL', another part 'R', another part 'ON', etc. Asmus just said that merely assigning one plane to be different from the others should be a non-starter. For this discussion, I really don't find it very interesting that existing technologies A, B, and C don't currently provide a way to override the default PUA properties. Through most of the 1990s, most existing applications and technologies didn't support Unicode at all, or very small parts of it, and the solution generally was to update them so that they would. The same should be true here. I would suggest that installing a modified copy of UnicodeData.txt seems like a rather clumsy solution; if text files are involved, I'd suggest leaving UnicodeData.txt alone and creating some sort of overrides file. -- Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell -Original Message- From: Richard Wordingham Sent: Sunday, August 21, 2011 9:48 To: unicode@unicode.org Subject: Re: RTL PUA? On Sun, 21 Aug 2011 01:44:02 + Doug Ewell d...@ewellic.org wrote: The more I think of it, the more I like the idea of reassigning the default BC of Plane 16 to 'R'. What would the arguments against this be? BC of 'AL'? Would that really be a better default? I thought the main RTL needs for the PUA would be for unencoded scripts, not for even more Arabic letters. (How many more are there anyway?) Not necessarily better, I'm just suggesting that both need to be supported. However, we need to look at use cases. (1) Unencoded Arabic script letters with joining behaviour, for use with any application. (a) We need the character to have AL, R or ON for it to be included in BiDi runs. If we use ON we may need RLM when the character is at the edge of a run, and even then, its behaviour may be no better than a character with a BC of R. (b) It may get left out of script runs. There were problems on Windows with the Tamil ligature k.SS not rendering, despite font support, when the character U+0BB7 TAMIL LETTER SSA was new. And that's in a left-to right script with a character in the appropriate block! (2) Complete right-to-left script. I'm presuming the difference between AL and R is then a matter of what right-to-left script the potential users chiefly also use. (a) As a practical implementation, the distinction between AL and R would matter if the script has modern use. Otherwise, any of ON, AL and R would do, though one might face the annoyance of having to start chunks of text with RLM. If a script with modern use should be encoded using a BC of R, then I believe ON would also do as a stop-gap until the script is encoded. How fiendish is BiDi-sensitive transliteration? (b) For experimentation, I believe the difference between AL, R and ON would matter little, even though it would be irritiating to have to use RLM. (c) Complex script support is patchy - one might be restricted to applications that allow the font to provide full complex script support. The big issue in all this, though, is (i) how to update the rendering system with a new set of values for Unicode properties, including script, and (ii) the scope of such an update. (The distinction between the PUA and the rest is that it makes sense for PUA properties to change as freely as fonts.) This, incidentally, is analogous to locales reflecting code page selections. There is also, though less pressing, the issue of tailoring collations. (The worst issue is there is distinct canonically inequivalent characters of type Lo comparing equal - I've seen it for Canadian Aboriginal Syllabics for Windows XP and for Thai in Ubuntu 10.04 - surely that's not the normal British collation of such characters.) One minor problem with (i) *was* that it wasn't clear how one should annotate a copy of UnicodeData.txt to show that it has been modified. The standard XML alternative provides allows for comments, thereby solving that problem. If Issue (i) can be readily solved at the machine or user level or lower, then the default properties of the PUA become irrelevant. Richard.
Re: RTL PUA?
Jonathan Rosenne wrote: People do all kinds of fancy things. I guess old manuscripts contain many ligatures... Not in Hebrew. The only common ligature is the aleph_lamed, a post-classical import from Judaeo-Arabic. JH -- Tiro Typeworkswww.tiro.com Gulf Islands, BC t...@tiro.com The criminologist's definition of 'public order crimes' comes perilously close to the historian's description of 'working-class leisure-time activity.' - Sidney Harring, _Policing a Class Society_
Re: RTL PUA?
2011/8/21 Peter Constable peter...@microsoft.com: From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Philippe Verdy Hmmm Given the current standard in OpenType, and the fact that OpenType fonts cannot reorder glyphs to support the BiDi algorithm and correctly handle featues like ligatures... I agree that OpenType font tables cannot to glyph re-ordering. But totally incorrect in saying that it cannot handle ligatures. I meant recognizing and generating ligatures in the context where re-ordering has been performed externally by the renderer. Ligatures can only be recognized in OpenType, provided that the layout engine has performed the reordering itself, because OpenType fonts won't recognize ligatures with glyphs in arbitrary order or intersperced with other unrelated characters coming from an unreordered glyph sequence. What this means is that, in practice, PUA are only usable in fonts for characters with strong LTR directionality, excluding all reordering and mirroring. In the OpenType specification, the only data related to glyph mirroring that a rendering engine is assumed to have is the bidi mirroring data from TUS 5.1. (See http://www.microsoft.com/typography/otspec/TTOCHAP1.htm#ltrrtl.) All other glyph mirroring is to be handled using glyph substitution data in OpenType Layout tables in fonts. Exactly, but mirroring data for remapping glyphs will not be be part of that font. Glyph mirroring substitution data in substitution rules of OpenType fonts does not work because it cannot solve the ambiguity of the expected direction, as the context length is limited (otherwise the number of contextual pairs to recognize would explode combinatorially, making such implementation unpractical to implement in decent table sizes in fonts, even if we use class-based substitution, because the necessary character-to-class mappings would also require large mapping tables, including for a lot of characters that are not even mapped in the font and for which the font was never designed). Mirroring behavior is then best handled in the layout engine, which has a more global and centralized view of properties of the whole UCS. Here, we just want to complement this view of character properties, by permitting to specify a set of character properties for PUA characters only, expecting that the layout engine will handle all the other character properties for non-PUA characters, using the standard data of the UCD...
Re: RTL PUA?
On 08/21/2011 01:09 PM, John Hudson wrote: Jonathan Rosenne wrote: People do all kinds of fancy things. I guess old manuscripts contain many ligatures... Not in Hebrew. The only common ligature is the aleph_lamed, a post-classical import from Judaeo-Arabic. Closest you might have to ligatures is idiosyncratic letters-getting-joined-together by rapid writing, etc. There are some examples in Ada Yardeni's book. But they're not really ligatures; at best _maybe_ they're calligraphic variants (tho mostly they're quite the opposite of calligraphic). Alef-Lamed did get a fair amount of use as a true ligature, though. ~mark
RE: RTL PUA?
From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy I agree that OpenType font tables cannot to glyph re-ordering. But totally incorrect in saying that it cannot handle ligatures. I meant recognizing and generating ligatures in the context where re-ordering has been performed externally by the renderer. That statement isn't adequate: the results of re-ordering may result in contexts in which ligatures will occur. That can happen, for instance, in displaying Indic scripts. Ligatures can only be recognized in OpenType, provided that the layout engine has performed the reordering itself, because OpenType fonts won't recognize ligatures with glyphs in arbitrary order or intersperced with other unrelated characters coming from an unreordered glyph sequence. I'm not sure what it means to create a ligature of glyphs in arbitrary order. If you mean a rule to substitute [g1 g2] with [g3] won't apply if the sequence processed by the OpenType Layout lookup processor is [g2 g1], then that's true: if the behaviour of the script is such that glyph re-ordering is appropriate, then a rendering engine for OpenType should do that reordering, and substitution lookups in OpenType fonts should be written to assume that that reordering has taken place. What this means is that, in practice, PUA are only usable in fonts for characters with strong LTR directionality, excluding all reordering and mirroring. In the OpenType specification, the only data related to glyph mirroring that a rendering engine is assumed to have is the bidi mirroring data from TUS 5.1. (See http://www.microsoft.com/typography/otspec/TTOCHAP1.htm#ltrrtl.) All other glyph mirroring is to be handled using glyph substitution data in OpenType Layout tables in fonts. Exactly, but mirroring data for remapping glyphs will not be be part of that font. Um... Why not? If the mirroring isn't in reflected in http://www.unicode.org/Public/5.1.0/ucd/BidiMirroring.txt, then it must be handled by glyph substitution in the font as a normal GSUB operation. Peter
Re: RTL PUA?
On Sun, Aug 21, 2011 at 10:09:22AM -0700, John Hudson wrote: Jonathan Rosenne wrote: People do all kinds of fancy things. I guess old manuscripts contain many ligatures... Not in Hebrew. The only common ligature is the aleph_lamed, a post-classical import from Judaeo-Arabic. JH Not true. See: Collete Sirat. Hebrew Manuscripts of the Middle Ages. Cambridge University Press 2002, fig. 114 (p. 176) or fig. 127 (p. 189) or fig. 134 (p. 193). -- Petr Tomasek http://www.etf.cuni.cz/~tomasek Jabber: but...@jabbim.cz EA 355:001 DU DU DU DU EA 355:002 TU TU TU TU EA 355:003 NU NU NU NU NU NU NU EA 355:004 NA NA NA NA NA
Re: RTL PUA?
2011/8/21 Peter Constable peter...@microsoft.com: In the OpenType specification, the only data related to glyph mirroring that a rendering engine is assumed to have is the bidi mirroring data from TUS 5.1. (See http://www.microsoft.com/typography/otspec/TTOCHAP1.htm#ltrrtl.) All other glyph mirroring is to be handled using glyph substitution data in OpenType Layout tables in fonts. In addition, this specification highly depends on two things: - the layout engine fully knows the properties of all characters in order to implement BiDi reordering as well as BiDi mirroring - the layout engine fully knows the necessary mappings for the OMPL table (this assumes that it always implements the latest version of the UCD) This is not the case because: - an OpenType layout engine will always implement a specific version of the UCD. Standard properties defined in the UCD will never concern unassigned characters that will be assigned in a later version. As well, it will not provide any normative property for the PUA. All it can then do is then to apply default properties for unassigned (still unknown) characters, as well as for all PUAs. - as such it will never be able to assert which runs of text containing PUAs or unassigned characters are in RTL order of LTR order. - if it uses the default LTR order, it will not be able to find any mirroring mapping in the OMPL, because the OMPL lookup table will only be searched for runs tht have been identified as RTL - if it uses the default RTL order assumed from some blocks, the OMPL will still not work with unknown characters/code points (the OMPL only contains a list of pairs of known (assigned) non-PUA characters), so character-level mirroring will not work as expected. - in addition, if it cannot know if a run of reordered characters is LTR or RTL, after mapping them to the glyph id's from the cmap (where it exists in a font for the unknown non-PUA character or the PUA character), it won't know which of the ltrm or rtlm tables to use (if it assumes incorrectly the default LTR order, which is the default for PUA, it will only lookup in the ltrm table, not on the rtlm table. Mirroring will then not work if the RTL or RTL guess was wrong. The only way to change this would be that the OpenType layout engine allows overriding its default properties for unassigned or PUA characters. For the case of BiDi reordering, this would require the support of an additional lookup table in the OpenType font, containing overrides for the BiDi character class assigned to characters. Of course, this lookup table should NEVER be used if the character is non-PUA and known in the implementation of the UCD by the layout engine. The rule would be: - if the character is not a PUA and is known in the current implemented version of the UCD, use the known character property of the UCD (allow no override). - otherwise if the character (which is then either a PUA or an unknown non-PUA) is mapped in the font's cmap table, and there's a BiDi lookup table on the OpenType font, and that lookup table provides the proerty value for that character, use that property - otherwise use the default property value (indicated in the UCD and Unicode specifications). A similar rule can be used as well for the character-level mirroring: the standard OMPL will be used if and only if the character is not a PUA and is known in the impelemtned version of the UCD. Otherwise, an OMPL table in the OpenType font will contain additional character pairs to lookup. Such lookup will however never be performed if the character is in a LTR run (which means that this feature is dependant on the correct implementation of the BiDi override above, which must be impelmented first). Then only, the existing ltrm and rtlm lookup tables in OpenType can be used like today, because the OpenType layout engine knows reliably which one to use. This allows standard glyph-level mirroring to be specified (between pairs of glyph-id's). Also the existing ltra and rtla lookup tables will be workable to provide lists of alternate mirrored glyphs (but only for advanced applications that allows selecting alternate variants). It may be possible that this first requires the support of additional variation sequences (using variation selectors), which are unknonw in the implemented version of the UCD, using an additional lookup table working under the same rule as above, in order to allow sequences of PUA+VSn (which will never be part of the UCD, but may be needed under the PUA convention agreement that the font provides). One difficulty in this scheme is that all those properties in OpenType were never meant to be overridable in specific fonts. This means that they were assumed to be consistant across all fonts. The difficulty can come because of the behavior of font subsitutions. I don't think this is critical because this also means that we change of PUA agreement in this case: the encoded PUA text is then dependant of the PUA font used to
Re: RTL PUA?
Petr Tomasek wrote: Not in Hebrew. The only common ligature is the aleph_lamed, a post-classical import from Judaeo-Arabic. Not true. See: Collete Sirat. Hebrew Manuscripts of the Middle Ages. Cambridge University Press 2002, fig. 114 (p. 176) or fig. 127 (p. 189) or fig. 134 (p. 193). I wouldn't classify any of those examples as 'common'. I also wouldn't classify all examples of touching letters -- of which many occur in rapidly written text -- as ligatures. Aleph+lamed on the other hand is a regularly occurring distinct formation in whole classes of manuscripts (and persisting in typography). I have a good collection of books on Hebrew palaeography, and while there are many examples of Hebrew letters being very tightly spaced there are relatively few instances of what I would consider ligatures, i.e. formations in which the ductus or spacing of the specific sequences of letters is modified to facilitate connection. JH -- Tiro Typeworkswww.tiro.com Gulf Islands, BC t...@tiro.com The criminologist's definition of 'public order crimes' comes perilously close to the historian's description of 'working-class leisure-time activity.' - Sidney Harring, _Policing a Class Society_
Re: RTL PUA?
2011/8/21 Peter Constable peter...@microsoft.com: Exactly, but mirroring data for remapping glyphs will not be be part of that font. Um... Why not? If the mirroring isn't in reflected in http://www.unicode.org/Public/5.1.0/ucd/BidiMirroring.txt, then it must be handled by glyph substitution in the font as a normal GSUB operation. A GSUB operation will only be used if it is specified in the correct feature table. The problem here is which feature to use: rtlm or ltrm ? It's impossible to know because it first depend on the layout engine to KNOW exactly if the run of text is RTL or LTR. Without a font-level support of BiDi properties of PUAs (or unassigned characters), the layout engine will assume the wrong guess from the default property value. And then it won't find the expected GSUB operation, because it won't match it in the correct feature subtable.
RE: RTL PUA?
From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy In the OpenType specification In addition, this specification highly depends on two things: - the layout engine fully knows the properties of all characters in order to implement BiDi reordering as well as BiDi mirroring Not true: mirroring depends on the resolved directionality, not the Unicode character properties. - the layout engine fully knows the necessary mappings for the OMPL table (this assumes that it always implements the latest version of the UCD) No. The OMPL is fixed at TUS 5.1. Peter
RE: RTL PUA?
From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy A GSUB operation will only be used if it is specified in the correct feature table. The problem here is which feature to use: rtlm or ltrm ? It's impossible to know because it first depend on the layout engine to KNOW exactly if the run of text is RTL or LTR. The layout engine already _has_ to know the bidi level of a run regardless. Without a font-level support of BiDi properties of PUAs (or unassigned characters), I'm trying to tell you that, wrt mirroring, that's already defined in the OpenType spec. the layout engine will assume the wrong guess from the default property value. And then it won't find the expected GSUB operation, because it won't match it in the correct feature subtable. As I explained in an earlier message, the layout engine doesn't use the default property value but the resolved bidi level. Btw, in the past few weeks, you've written several posts in which you make assertions about how rendering implementations work and, in some cases, why more is needed. And then I or others have to spend a bunch of time writing responses so that you get the correct understanding and, more importantly, so that others don't get mislead. It would be a lot easier if you just asked, How is this done? Peter
Re: RTL PUA?
2011/8/21 Peter Constable peter...@microsoft.com: From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy A GSUB operation will only be used if it is specified in the correct feature table. The problem here is which feature to use: rtlm or ltrm ? It's impossible to know because it first depend on the layout engine to KNOW exactly if the run of text is RTL or LTR. The layout engine already _has_ to know the bidi level of a run regardless. Without a font-level support of BiDi properties of PUAs (or unassigned characters), I'm trying to tell you that, wrt mirroring, that's already defined in the OpenType spec. the layout engine will assume the wrong guess from the default property value. And then it won't find the expected GSUB operation, because it won't match it in the correct feature subtable. As I explained in an earlier message, the layout engine doesn't use the default property value but the resolved bidi level. Once again, you refuse to understand my arguments. What I'm saying is that OpenType CANNOT resolve the bidi level of PUAs (with the exception where we use additional BiDi controls, which remains a hack, because it adds unnecessary unvisible markup around the encoded texts, and complexifies the use of strings and substrings). You can turn the problem as you want, but PUAs (as well as unknown characters) still have default properties that, in fine, will get used in absence of a more precise definition (i.e. an explicit override) of the actual BiDi property needed for the character. Btw, in the past few weeks, you've written several posts in which you make assertions about how rendering implementations work and, in some cases, why more is needed. And then I or others have to spend a bunch of time writing responses so that you get the correct understanding and, more importantly, so that others don't get mislead. It would be a lot easier if you just asked, How is this done? Ok, you've replied, but not completely. And at least on this point, Michael Everson is also right when he says that PUAs do not properly handle RTL scripts only because of their default BiDi property value. But I don't maintain his idea of encoding new PUAs, when in fact we can effectively provide the additional character properties needed, for example in fonts, without changing the default proerty of PUA (I son't support it at all, and probably you too) and without allocating more (unneeded) PUA block(s) for RTL scripts (and also without hacking on top of another existing set of RTL assigned characters). I did not post any assertion about how OpenType could be used, just wanted to explain that with the current specifications, it cannot *currently* resolve the problem (and Michael Everson certainly fully agrees with that, but he can reply as well if he thinks that I misinterpret his last few messages). We really need a raliable way to transport a PUA agreement in such a way that it can be understood by a computer. An encoded font can transport this information reliably, which at least must include some necessary character property values, and it offers a smooth way for transitions during all the encoding process of new scripts (notably during the experimentation), as well as after that, for its adoption for more general use (before a large majority of users can use updated implementations of their text renderers, that will provide automatically those properties for newly encoded characters and scripts. Simply because it's MUCH easier to upgrade a font (especially a PUA font which is not part of the core fonts of the operating system), than to upgrade a rendering engine (bound to the OS, for the case of Microsoft APIs and libraries in Windows). An extensible set of properties, managed with a good rule of priorities to avoid hacks or non-compliant implementations, can certainly accelerate the development and adoption rate by many years, can improve the number of experimentations possible, can help avoiding errors during the encoding process for new characters and scripts. It could reduce this delay from about 10 years (during which even if the script or characters are encoded, it will not be available or usable reliably), to just a few months (even anticipating the final encoding in the UCS, by a reliable way to represent it as PUAs, managed with help of a PUA font, and after the UCD encoding, with a font that provides the upward upgrade for older implementations of the layout engine only knowing an older UCD version) I ma completley convinced that we don't need more PUAs due to continuous lack of support in existing softwares. But softwares can still be updated to provide the support with the help of transitional subtables in fonts (that can easily be ignored by newer engines that won't require such extension tables), for integrating the additional character properties. Philippe.
Re: RTL PUA?
For once, I am in strong agreement with something Philippe had to say: We really need a raliable way to transport a PUA agreement in such a way that it can be understood by a computer. I don't necessarily agree that fonts, or (especially) any particular font technology, are the one and only way to accomplish this, because there's more to character handling than display. Maybe some sort of open format could be devised that could be used as a plug-in to a variety of existing components. -- Doug Ewell • d...@ewellic.org Sent via BlackBerry by ATT
Re: RTL PUA?
On Sun, 21 Aug 2011 11:00:26 -0600 Doug Ewell d...@ewellic.org wrote: I think as soon as we start talking about this many scenarios, we are no longer talking about what the *default* bidi class of the PUA (or some part of it) should be. Instead, we are talking about being able to specify private customizations, so that one can have 'AL' runs and 'ON' runs and so forth. I was exploring the consequences to see if there was a one size fits all solution. Someone (you?) suggested ON as a default, and I like it. I think it would also work fairly well for practical CJK applications as well - the only problems are that LRM and RLM would occasionally be needed, and the subtle differences between AL and R would be lost. I expect ARABIC LANGUAGE MARK would not go down well - has it already been proposed and rejected?. Through most of the 1990s, most existing applications and technologies didn't support Unicode at all, or very small parts of it, and the solution generally was to update them so that they would. The same should be true here. Agreed. I also noted that changes would be of limited assistance for extending existing supported scripts. I would suggest that installing a modified copy of UnicodeData.txt seems like a rather clumsy solution; if text files are involved, I'd suggest leaving UnicodeData.txt alone and creating some sort of overrides file. While partial overrides are cleaner, that appears to be the way to fix Pango, albeit via recompilation. According to the comments, its BiDi settings are derived from the file automatically. Also, one needs a method of updating the properties of codepoints as they become assigned and properties change. There are also advantages to trying out proposed changes. Richard.
Re: RTL PUA?
2011/8/21 Doug Ewell d...@ewellic.org: For once, I am in strong agreement with something Philippe had to say: We really need a raliable way to transport a PUA agreement in such a way that it can be understood by a computer. I don't necessarily agree that fonts, or (especially) any particular font technology, are the one and only way to accomplish this, because there's more to character handling than display. Maybe some sort of open format could be devised that could be used as a plug-in to a variety of existing components. Yes but without display support, at least, all the other needs will never be addressed, because you won't have text encoded to work with. So don't even dream for example about performing plain-text search, if you don't have encoded texts to search in ! Collation is then a secondary target. Proper display is an immediate need (that even comes before the development of easy input methods, or later developments of spell checkers, content indexers, semantic analyzers, and localization of softwares to use a given script through its UI). For proper display of PUAs, all that is needed is a minimum set of character properties. I have argued, against what Peter Constable thinks, that OpenType cannot handle RTL characters with PUAs, because it has absolutely no source of information to know if a run of text is RTL or LTR, when implemeing the BiDi algorithm. OK, the mirroring property is probably not essential (because most mirrored characters are today only punctuations, that already cover a very wide range. If needed additional PUA punctuations may be added, and even coded in two mirrored code positions, even if they are not automatically mirrored according to their context : for such rare cases, using BiDi format controls around them, or other equivalent CSS embedding styles in HTML, and similar technics, will be enough. But for most of the RTL text using PUAs in long runs or mixed within other sequences of standard RTL characters (for example in the middle of words), format controls are clearly not the solution (it does not work reliably in HTML for example, if you have to split words within separate spans, and inserting those controls in the middle of words is really a nightmare). In addition it completely defeats the plain-text searchability and editability of encoded texts. This will only slow down the production of encoded texts that in fact, almost no work will be done with those PUAs. As a consequence, most texts will wait indefinitely for some encoding effort. The need will become even more urgent now that the UTC and WG2 will pass most of its time in discussing scripts that are rarely used, where the cultural knowledge will be difficult to find. If we don't have an easy way to experiment their encodings at least with PUAs, for extended periods (because there will be the need of a long research period, with conflicting experimentations), those scripts will remain unencoded in the UCS for very long. And in fact I doubt that even the WG2 or the UTC will have the resources to provide all this effort without commiting many critical errors that will be a plague for the long-term future. We absolutely need a transition mechanism, and PUAs can be part of this transition. For the same reason, the possibility offered to support external character prorperties, for characters that are not encoded or encoded in separate efforts via PUAs, and later that will be encoded with low levels of implementations and deployment for many year, would certainly help maintaining the needed resources (at UTC and WG2) at a low level, where most of the experimentations will be performed independantly without depending on the release of a putative version of the UCS finally accepting to encode the script. But even in this case, or historic scripts, the encoding effort will be hard to finalize: it is highly probable that those scripts will be encoded progressively, with a starting minimum subset about which most people will agree, and many other characters remaining that need longer experimentations or researches. Those scripts will then need to support for long a mix of standard assignments, and PUAs, at the same time, for distinct small communities that will need to share and discuss their agreement. The current problem is that there is absolutely no transition mechanism in the UCS encoding process: a character gets fully encoded with most of its essential properties becoming normative, some of them impossible to change later (even if there was an error or an unexpected caveat, that the interested communities have not had any chance to experiment before they were finally approved by the UTC and WG2). Unicode should not interfere with what users will want to do with PUAs. After all, PUAs was made specifically for that. If users need to assign their own property values to PUAs, they must be able to do that. And these properties must find a way to be representable in the current technology
Re: RTL PUA?
On 8/21/2011 3:31 PM, Richard Wordingham wrote: On Sun, 21 Aug 2011 11:00:26 -0600 Doug Ewelld...@ewellic.org wrote: I think as soon as we start talking about this many scenarios, we are no longer talking about what the *default* bidi class of the PUA (or some part of it) should be. Instead, we are talking about being able to specify private customizations, so that one can have 'AL' runs and 'ON' runs and so forth. I was exploring the consequences to see if there was a one size fits all solution. Someone (you?) suggested ON as a default, and I like it. I think it would also work fairly well for practical CJK applications as well - the only problems are that LRM and RLM would occasionally be needed, and the subtle differences between AL and R would be lost. I expect ARABIC LANGUAGE MARK would not go down well - has it already been proposed and rejected?. If your implementation supported the directional overrides, it would be possible to use these to lay out any RTL text in a portable manner. Just enclose any RTL run with RLO and PDF (pop directional formatting). No impact on any existing implementation, no impact on the standard. Those who produce rendering engines that do not support these overrides today could be leaned on to upgrade their implementations - that change would benefit users of non-PUA RTL languages as well (because sometimes, the bidi-algorithm can fail, such as for part numbers, and being able to use RLO is a simple way to stabilize such problematic text). Treating PUA characters as ON is very problematic - their display would become context sensitive in unintended ways. No users of CJK characters would think of using LRM characters, but if text is inserted or viewed in RTL context, it could behave randomly. In contrast, always supplying a RLO override for RTL text (containing PUA characters) would be a simple thing to remember and to get right. A./
Re: RTL PUA?
I suggested 'R' for Plane 16, not 'ON'. What's a LANGUAGE MARK? -- Doug Ewell • d...@ewellic.org Sent via BlackBerry by ATT -Original Message- From: Richard Wordingham richard.wording...@ntlworld.com Sender: unicode-bou...@unicode.org Date: Sun, 21 Aug 2011 23:31:58 To: unicode@unicode.org Subject: Re: RTL PUA? On Sun, 21 Aug 2011 11:00:26 -0600 Doug Ewell d...@ewellic.org wrote: I think as soon as we start talking about this many scenarios, we are no longer talking about what the *default* bidi class of the PUA (or some part of it) should be. Instead, we are talking about being able to specify private customizations, so that one can have 'AL' runs and 'ON' runs and so forth. I was exploring the consequences to see if there was a one size fits all solution. Someone (you?) suggested ON as a default, and I like it. I think it would also work fairly well for practical CJK applications as well - the only problems are that LRM and RLM would occasionally be needed, and the subtle differences between AL and R would be lost. I expect ARABIC LANGUAGE MARK would not go down well - has it already been proposed and rejected?. Through most of the 1990s, most existing applications and technologies didn't support Unicode at all, or very small parts of it, and the solution generally was to update them so that they would. The same should be true here. Agreed. I also noted that changes would be of limited assistance for extending existing supported scripts. I would suggest that installing a modified copy of UnicodeData.txt seems like a rather clumsy solution; if text files are involved, I'd suggest leaving UnicodeData.txt alone and creating some sort of overrides file. While partial overrides are cleaner, that appears to be the way to fix Pango, albeit via recompilation. According to the comments, its BiDi settings are derived from the file automatically. Also, one needs a method of updating the properties of codepoints as they become assigned and properties change. There are also advantages to trying out proposed changes. Richard.
Re: RTL PUA?
On 22 Aug 2011, at 00:37, Asmus Freytag wrote: If your implementation supported the directional overrides, it would be possible to use these to lay out any RTL text in a portable manner. Just enclose any RTL run with RLO and PDF (pop directional formatting). No impact on any existing implementation, no impact on the standard. Useful for RTL'ing the Phaistos Disc text or even Latin for the Jabberwocky text. Not so desirable for nonce or novel Arabic (or other RTL script) characters intended to be used within RTL text strings. Those who produce rendering engines that do not support these overrides today could be leaned on to upgrade their implementations - that change would benefit users of non-PUA RTL languages as well (because sometimes, the bidi-algorithm can fail, such as for part numbers, and being able to use RLO is a simple way to stabilize such problematic text). The problem is that existing PUA characters are all strong L. Treating PUA characters as ON is very problematic - their display would become context sensitive in unintended ways. No users of CJK characters would think of using LRM characters, but if text is inserted or viewed in RTL context, it could behave randomly. Easy to fix: Add RTL PUA characters. In contrast, always supplying a RLO override for RTL text (containing PUA characters) would be a simple thing to remember and to get right. Not, I think, practical and certainly not putting RTL and LTR users on the same level in terms of PUA usage. Michael Everson * http://www.evertype.com/
Re: RTL PUA?
On Sun, 21 Aug 2011 23:55:46 + Doug Ewell d...@ewellic.org wrote: What's a LANGUAGE MARK? There are *three* strong directionalities - 'L' left-to-right, 'AL' right-to-left as in Arabic, 'R' right-to-left (as in Hebrew, I suspect). 'AL' and 'R' have different effects on certain characters next to digits - it's the mind-numbing part of the BiDi algorithm. With one a $ sign after a string of European (or is it Arabic?) digits appears on the left and in the other it appears on the right. I can't remember whether 'higher-level protocols' have an effect on this logic. LRM has a BC of L, RLM has a BC of R, but no invisible character has a BC of AL. That's why I tentatively raised the notion of ARABIC LANGUAGE MARK. Incidentally, an RLO gives characters with a temporary BC of R, not AL. Richard.
Re: RTL PUA?
On Sun, 21 Aug 2011 16:37:34 -0700 Asmus Freytag asm...@ix.netcom.com wrote: Treating PUA characters as ON is very problematic - their display would become context sensitive in unintended ways. No users of CJK characters would think of using LRM characters, but if text is inserted or viewed in RTL context, it could behave randomly. I think a problem would be immediately obvious. Also, the CJK PUA characters would usually be guarded by non-PUA CJK characters. In contrast, always supplying a RLO override for RTL text (containing PUA characters) would be a simple thing to remember and to get right. So long as you remembered to pop before digits. This could easily go wrong if the text were amended. For example, if two paragraphs were merged, one could easily delete a PDF, and then digits at the bottom of the second paragraph, quite possibly off-screen at the time, would suddenly flip. Richard.