Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?
Well, it's not that complicated. Ligatures in German must not happen at compound break points, while they can be applied to ordinary break points. Consider the word `Dorfladen' (village shop). Using `=' to indicate a compound break point and `-' for normal ones, the proper break points are `Dorf=la-den' which means no `fl' ligature. Note that `Fladen' means `cow dung', so having a ligature there is really bad. But "Dorfladen" is not ambiguous. Asmus war referring to ambiguous cases created by the way compound words are spelled in German. For those, some user interaction is necessary, and it's my view that there are unobtrusive ways of interacting with the user about this. (But then it needs to be acknowledged that ambiguous cases probably exist or can be constructed in a lot of languages. And the frequency of such ambiguity occurring in actual German text isn't that high. Even more so if one takes into account the orthographic recommendation to use an explicit hyphen in ambiguous cases. But of course these cases, if they occur, need to be handled nevertheless.) Stephan
Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?
>> Certain layout processes, in certain cases, in certain languages, >> simply can't be fully automated. > > And interestingly, there is a crucial difference between ligatures > and hyphenation in this regard: While a conservative processor could > simply omit hyphenation in ambiguous cases (potentially leading to > suboptimal linebreaking though), a decision ought to be made for > ligatures if one uses a font requiring them. But then, although > getting ligatures wrong in this case is categorically somehow > "worse" than too-wide inter-word spacing, who knows which visual > effect actually has more adverse effect on the reading process ... Well, it's not that complicated. Ligatures in German must not happen at compound break points, while they can be applied to ordinary break points. Consider the word `Dorfladen' (village shop). Using `=' to indicate a compound break point and `-' for normal ones, the proper break points are `Dorf=la-den' which means no `fl' ligature. Note that `Fladen' means `cow dung', so having a ligature there is really bad. On the other hand, consider `Löffel' (spoon). Inspite of the hyphenation `Löf-fel', a ligature looks good. Werner
Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?
From my background I never perceived a need, but I guess I (and most people??) wouldn't really mind the tradition coming back (in Germany) if things are designed well (which is the job of the font designer) and for the user everything is handled automatically in the background by the available technology ... Which cannot happen for German, as it is one of the languages where the same letter pair may or may not have a ligature based on the *meaning* of the word - something that you can't automate. You are absolutely right! Certain layout processes, in certain cases, in certain languages, simply can't be fully automated. *Actually*, the emphasis here is on the word "fully". Writing a (language-specific) tool (or wordprocessor plugin) for semi-automated processing would be so easy - something that walks you through all cases of ambiguous hyphenation and ligatures (if the font so requires). An unobtrusive way of doing this would be if the word processor simply put a purple squiggly line under each word needing closer inspection, for right-click fixing. I'm really wondering why such tools are not employed or - if they are - I haven't heard of them ... Stephan
Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?
From my background I never perceived a need, but I guess I (and most people??) wouldn't really mind the tradition coming back (in Germany) if things are designed well (which is the job of the font designer) and for the user everything is handled automatically in the background by the available technology ... Which cannot happen for German, as it is one of the languages where the same letter pair may or may not have a ligature based on the *meaning* of the word - something that you can't automate. You are absolutely right! We had famous discussions on this list on this subject. Take an "st" ligature. There are two meanings for the German word "Wachstube", only one allows the st ligature. A human would have to decide when the ligature is appropriate. (Incidentally, the same goes for hyphenation for this word, one meaning allows a hyphen after the "s" the other does not). Certain layout processes, in certain cases, in certain languages, simply can't be fully automated. And interestingly, there is a crucial difference between ligatures and hyphenation in this regard: While a conservative processor could simply omit hyphenation in ambiguous cases (potentially leading to suboptimal linebreaking though), a decision ought to be made for ligatures if one uses a font requiring them. But then, although getting ligatures wrong in this case is categorically somehow "worse" than too-wide inter-word spacing, who knows which visual effect actually has more adverse effect on the reading process ... There are two ways of generalizing from a situation where a locale tends to preferably use fonts without (and not necessitating) ligatures: If fonts with ligatures are introduced ... (1) [generalizing: "we're not using ligatures"] ... the community is going to find it distracting because it is not used to the ligatures, plus there may be inherent problems with this for the respective locale anyways. (2) [generalizing: "presently used fonts don't use ligatures"] ... the community won't find it distracting because good fonts will do ligatures well. (while the great majority of laymen might neither notice nor care ...) This theoretical ambiguity in generalizing simply arises from the fact that "not using ligatures" is equivalent to "not using fonts having/necessitating ligatures" in Germany. Lots of the English-language discussion of ligatures I've seen tacitly assumes that "good" typesetting with "good" fonts" "should" use ligatures in certain cases, and I just disagree with this assumption. Well, forgive me if maybe I'm just getting the wrong impression, being a layman on this matter. Stephan
RE: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?
There are certainly monospaced fonts that support Arabic. For instance, Windows fonts Courier New and Simplified Arabic Fixed support Arabic. Devanagari is a different matter. Peter -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Richard Wordingham Sent: Sunday, September 11, 2011 5:19 PM To: Unicode Discussion Subject: Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text? On Sun, 11 Sep 2011 23:14:04 +0200 Kent Karlsson wrote: > Den 2011-09-11 18:53, skrev "Peter Constable" > : > > Hence, in a monospaced font, FB01 certainly should look different > > from <0066, > > 0069>, regardless of whether ligature glyphs are used in either > > 0069>case. > > If "monospace" is interpreted that rigidly, then it is much better > *not* to have any glyph at all for FB01 (and other characters like > it) in a "monospace" font. Aesthetically you're correct, but U+FB01 and U+00E6 LATIN SMALL LETTER A WITH DIAERESIS both have the ID start property, and the latter is definitely allowed in C identifiers. While U+00E6 is much securer as a character, it too tends to be quite ugly in monospaced fonts. (Courier can be quite useful for setting off text as computer code, especially variable and function names.) Incidentally, are there working definitions of monospace for Arabic and Devanagari? Richard.
Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?
On Sun, 11 Sep 2011 23:14:04 +0200 Kent Karlsson wrote: > Den 2011-09-11 18:53, skrev "Peter Constable" > : > > Hence, in a monospaced font, FB01 certainly should look different > > from <0066, > > 0069>, regardless of whether ligature glyphs are used in either > > 0069>case. > > If "monospace" is interpreted that rigidly, then it is much better > *not* to have any glyph at all for FB01 (and other characters like > it) in a "monospace" font. Aesthetically you're correct, but U+FB01 and U+00E6 LATIN SMALL LETTER A WITH DIAERESIS both have the ID start property, and the latter is definitely allowed in C identifiers. While U+00E6 is much securer as a character, it too tends to be quite ugly in monospaced fonts. (Courier can be quite useful for setting off text as computer code, especially variable and function names.) Incidentally, are there working definitions of monospace for Arabic and Devanagari? Richard.
Don't be evil - Unicode Inc President
The subject was before "Re: Mail filtering, and Tulasi - (was) Re: Everson's Ahom proposal" Changed it to "Don't be evil - Unicode Inc President" "Shall Mark Davis continue to encoding any letter/symbol used in scripture like Koran?". The "unicode hot f**k" portal link/passage portion is omitted from the email ~mark (Mark E. Shoulson) cited and used as reference to compose the reply (appended herewith). It was Everson/Magda who omitted that portion before the email was delivered to unicode forum. So ~mark did not read the original email: http://www.mail-archive.com/unicode@unicode.org/msg28840.html ~mark am I correct? Fyi, Magda recently moved that "unicode hot f**k" portal to "under cover" state. > You're really out of line making cracks about how much money > Mark Davis is or should be making; it's (a) not your business and > (b) not relevant to the discussion at hand. To experiment one shall post a message to a worldwide Islamic forum. In the message s(he) shall mention facts like: Mark Davis, President of Unicode Inc, has supervised encoding some letters/symbols used in "Koran". google upper-deck executives orchestrate google policies. google advertises in "hot f**k" portal for revenue. CNN and other news say google profited from add in "prescription drug abuse". > Might as well get this out into the open. Portion of Mark Davis (Unicode Inc President) income come from "hot f**k" portal revenue and "prescription drug abuse". So shall Mark Davis continue to encoding any letter/symbol used in scripture like "Koran"? Google add on hot f**k portal: http://techstack.com/forum/apache/70011-hi-im-16-hot-f*u*c*k-me-night-free.html http://www.xred2.com/Amazing_hot_amateur_on_Fucking_Machines___Hardcore_sex_video Prescription drug abuse: http://articles.cnn.com/2011-06-22/us/google.drug.ads_1_prescription-drug-abuse-google-advertising-internet-search-giant-google/2?_s=PM:US http://www.ktbs.com/news/28409624/detail.html Who is Magda Danish? Administrator of unicode @ unicode.org forum http://www.jigsaw.com/scid11549476/magda_danish.xhtml?ver=5 http://unicode.org/consortium/directors.html http://unicode.org/consortium/img/magda.jpg > ~mark > (NOT employed by any large corporation, not making money off > advertisements, etc. That good enough?) Shall be good enough if not from add in hot f**k portal / drug abuse as well as other low-moral conscious-deficient activity. Is this response good enough? Tulasi From: Mark E. Shoulson Date: Wed, Jun 29, 2011 at 6:29 PM Subject: Re: Mail filtering, and Tulasi - (was) Re: Everson's Ahom proposal To: tulasi Cc: Unicode Discussion On 06/29/2011 02:58 PM, tulasi wrote: > > This unicode @ googlegroups is a property of Google Inc. > Do you know that 97% of google revenue comes from advertisement? > http://gigaom.com/2009/07/17/where-does-google-get-97-of-its-revenue/ > > It seems Mark Davis is upper-deck executive at Google Inc. > So part of his living comes from the revenue that Google Inc earns from > advertisements through protocol like unicode @ googlegroups > > My suggestion to Mark:- > Ask Google Inc to clean-up all such protocol and take pay-cut - > fyi academia in California has been living with at least 15% pay-cut for a > while. > Shall Mark do so it shall elevate Unicode Inc moral/consciousness. > I hope so! > Look, this keeps going on and you really should stop it. First of all, saying that Google gets 97% of its income from advertising is like saying that doctors get 97% of their income from patients. That's the business they're in, were you expecting something else? You seem to have some kind of axe to grind against Unicode for being a commercial entity. Might as well get this out into the open. Standards committees generally are part of the commercial sector, because businesses are a big part of what will use and be affected by the standards. I'm not sure how you propose to do what needs to be done to create a standard like this without business involvement. There are occasional complaints and resentment (including from me) about perceived overstrong influence of this or that company in some of the decision-making, but that's to be expected, and I don't think anyone (except you) really believes that some companies are out there maliciously trying to tweak Unicode in order somehow to make them money unfairly. You're really out of line making cracks about how much money Mark Davis is or should be making; it's (a) not your business and (b) not relevant to the discussion at hand. So, out with it: what is it that you suspect Unicode Inc and Google are doing that is so underhanded and unfair that you need to make these insinuations? Let's at least set that to rest so you can talk about the *actual* business of Unicode without such distractions. ~mark (NOT employed by any large corporation, not making money off advertisements, etc. That good enough?)
Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?
Den 2011-09-11 18:53, skrev "Peter Constable" : > There's no requirement that the width of glyphs in a monospaced font be 1 em. > I would agree, though, that if a monospaced font forms a ligature of a pair > like <0066, 0069>, then it should be twice the width (not necessarily 2em) of > single-character glyphs. That's fine (assuming the ligature is well designed, in the case of a monospace font connecting the bar of the f to the top serif of the i and only that). > In a monospace font, nothing prevents the glyph for FB01 being a ligature, and > some monospaced fonts do have a ligature glyph for that character. Fine too. But see below. > Of course, in a monospaced font, the glyph for that character should be the > same width as all other glyphs. So if it's not a ligature, then the "f" and > "i" elements still need to be narrower than the glyphs for 0066 and 0069. > > Hence, in a monospaced font, FB01 certainly should look different from <0066, > 0069>, regardless of whether ligature glyphs are used in either case. If "monospace" is interpreted that rigidly, then it is much better *not* to have any glyph at all for FB01 (and other characters like it) in a "monospace" font. /Kent K > > Peter > > -Original Message- > From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf > Of Philippe Verdy > Sent: Saturday, September 10, 2011 10:33 PM > To: Michael Everson > Cc: unicode Unicode Discussion > Subject: Re: ligature usage - WAS: How do we find out what assigned code > points aren't normally used in text? > > 2011/9/11 Michael Everson : >> On 11 Sep 2011, at 00:23, Richard Wordingham wrote: >> >>> A font need not support such ligation, but a glyph for U+FB01 must >>> ligate the letters - otherwise it's not U+FB01! >> >> Not in monowidth, it doesn't. > > I also agree, a monospaced font can perfectly show the dot and ligate the > letters, using a "double-width" (2em) ligature without any problem, or simply > not map it at all, or choose to just map a composite glyph made of the > 1em-width glyphs assigned to the two letters f and (dotted) i without showing > any visible ligation between those glyphs (this being consistant with > monospaced fonts that remove all ligations, variable advances and kernings > between letters). > > You could as well have a font design in which all pairs or Latin letters are > joined, including in a monospaced font, in which case you should not see any > difference between FB01 and the pair or Basic Latin letters. Joining letters > is fully independant of the fact that the upper part of letter f may or may > not interact graphically with the presence of a dot. If the style of letter > glyphs does not cause any interaction, there's no reason to remove the dot > over i or j in the "ligature" or joining letters. > > You should not be limited by the common style used in modern Times-like fonts > (notably in italic styles, where the letter f is overhanging over the nearby > letters). Other font styles also exist that do not require adjustment to > remove the dot, or merge it with a graphic feature of the preceding letter f > which is specific to some fonts. > > As the pair of letters f and (dotted) i is perfectly valid in Turkish, there's > absolutely no reason why the fi ligature would be invalid in Turkish. But > given that this character is just provided for compatibility with legacy > encodings, I would still not recommand it for Turkish or for any other > language, including English. This FB01 character is not necessary to any > orthography and if possible, should be replaced by the pair of Basic Latin > letters (and in fact I don't see any reason why a font would not choose to do > this everywhere) > > -- Philippe. > > > >
Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?
An old acquaintance of mine, many years ago, pointed out two cases in Dutch: a hunter of kiwi birds, kiwijager, cannot use the customary ij ligature. And as for parsing ambiguities, he observed that there were three different ways of understanding the word "kwartslagen", depending on whether it was read "kwart-slagen", "kwarts-lagen", or "kwart-sla-gen". Peter Ingerman On 2011-09-11 00:42, Asmus Freytag wrote: On 9/9/2011 8:12 PM, Stephan Stiller wrote: Dear Martin, Thanks for alerting me to the issue of causal direction of aesthetic preference - it's been on my mind, but your reply helps me sort out some details. When I first encountered text (outside of the German language locale) with ample use of ligatures in modern printed text, I definitely found the ligatures a bit distracting, but partly just because I wasn't used to them. I also perceived them as a solution to what (in Germany) appeared to me to be a real non-issue. Put simply, there is a conflict between full flexibility for font designs and the burden imposed by sophisticated ligatures and kerning tables. From my background I never perceived a need, but I guess I (and most people??) wouldn't really mind the tradition coming back (in Germany) if things are designed well (which is the job of the font designer) and for the user everything is handled automatically in the background by the available technology ... Which cannot happen for German, as it is one of the languages where the same letter pair may or may not have a ligature based on the *meaning* of the word - something that you can't automate. We had famous discussions on this list on this subject. Take an "st" ligature. There are two meanings for the German word "Wachstube", only one allows the st ligature. A human would have to decide when the ligature is appropriate. (Incidentally, the same goes for hyphenation for this word, one meaning allows a hyphen after the "s" the other does not). Certain layout processes, in certain cases, in certain languages, simply can't be fully automated. A./ Stephan
Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?
On 9/9/2011 8:12 PM, Stephan Stiller wrote: Dear Martin, Thanks for alerting me to the issue of causal direction of aesthetic preference - it's been on my mind, but your reply helps me sort out some details. When I first encountered text (outside of the German language locale) with ample use of ligatures in modern printed text, I definitely found the ligatures a bit distracting, but partly just because I wasn't used to them. I also perceived them as a solution to what (in Germany) appeared to me to be a real non-issue. Put simply, there is a conflict between full flexibility for font designs and the burden imposed by sophisticated ligatures and kerning tables. From my background I never perceived a need, but I guess I (and most people??) wouldn't really mind the tradition coming back (in Germany) if things are designed well (which is the job of the font designer) and for the user everything is handled automatically in the background by the available technology ... Which cannot happen for German, as it is one of the languages where the same letter pair may or may not have a ligature based on the *meaning* of the word - something that you can't automate. We had famous discussions on this list on this subject. Take an "st" ligature. There are two meanings for the German word "Wachstube", only one allows the st ligature. A human would have to decide when the ligature is appropriate. (Incidentally, the same goes for hyphenation for this word, one meaning allows a hyphen after the "s" the other does not). Certain layout processes, in certain cases, in certain languages, simply can't be fully automated. A./ Stephan
RE: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?
There's no requirement that the width of glyphs in a monospaced font be 1 em. I would agree, though, that if a monospaced font forms a ligature of a pair like <0066, 0069>, then it should be twice the width (not necessarily 2em) of single-character glyphs. In a monospace font, nothing prevents the glyph for FB01 being a ligature, and some monospaced fonts do have a ligature glyph for that character. Of course, in a monospaced font, the glyph for that character should be the same width as all other glyphs. So if it's not a ligature, then the "f" and "i" elements still need to be narrower than the glyphs for 0066 and 0069. Hence, in a monospaced font, FB01 certainly should look different from <0066, 0069>, regardless of whether ligature glyphs are used in either case. Peter -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Philippe Verdy Sent: Saturday, September 10, 2011 10:33 PM To: Michael Everson Cc: unicode Unicode Discussion Subject: Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text? 2011/9/11 Michael Everson : > On 11 Sep 2011, at 00:23, Richard Wordingham wrote: > >> A font need not support such ligation, but a glyph for U+FB01 must >> ligate the letters - otherwise it's not U+FB01! > > Not in monowidth, it doesn't. I also agree, a monospaced font can perfectly show the dot and ligate the letters, using a "double-width" (2em) ligature without any problem, or simply not map it at all, or choose to just map a composite glyph made of the 1em-width glyphs assigned to the two letters f and (dotted) i without showing any visible ligation between those glyphs (this being consistant with monospaced fonts that remove all ligations, variable advances and kernings between letters). You could as well have a font design in which all pairs or Latin letters are joined, including in a monospaced font, in which case you should not see any difference between FB01 and the pair or Basic Latin letters. Joining letters is fully independant of the fact that the upper part of letter f may or may not interact graphically with the presence of a dot. If the style of letter glyphs does not cause any interaction, there's no reason to remove the dot over i or j in the "ligature" or joining letters. You should not be limited by the common style used in modern Times-like fonts (notably in italic styles, where the letter f is overhanging over the nearby letters). Other font styles also exist that do not require adjustment to remove the dot, or merge it with a graphic feature of the preceding letter f which is specific to some fonts. As the pair of letters f and (dotted) i is perfectly valid in Turkish, there's absolutely no reason why the fi ligature would be invalid in Turkish. But given that this character is just provided for compatibility with legacy encodings, I would still not recommand it for Turkish or for any other language, including English. This FB01 character is not necessary to any orthography and if possible, should be replaced by the pair of Basic Latin letters (and in fact I don't see any reason why a font would not choose to do this everywhere) -- Philippe.
UAX #14 (UCA): Derived primary weight ranges
I think that the UCA forgets to specify which are the valid primary weights infered from the default rules used in the current DUCET. # Derived weight ranges: FB40..FBFF # [Hani] core primaries: FB40..FB41 (2) U+4E00..U+9FFFFB40..FB41 (2) U+F900..U+FAFFFB41 (1) # [Hani] extended primaries: FB80..FB9D (30) U+3400..U+4DBFFB80 (1) U+2..U+E FB84..FB9D (29) # Other primaries: FBC0..FBE1 (34) U+..U+E FBC0..FBDD (30) U+F..U+10 FBDE..FBE1 (4) # Trailing weights: FC00.. (1024) It clearly exhibits that the currently assigned ranges of primary weights are way too large for the use. - Sinograms can fully be assigned a first primary weight within a set of only 32 values, instead of the 128 assigned. - This leaves enough place to separate the primary weights used by PUA blocks (both in the BMP or in planes 15 and 16), which just requires 1 primary weight for the PUAs in the BMP, and 4 primary weights for the last two planes (if some other future PUA ranges are assigned, for example for RTL PUAs, we could imagine that this count of 5 weights would be extended to - All other primaries will never be assigned to anything outside planes 0 to 14, and only for unassigned code points (whose primary weight value should probably be between the first derived primary weights for sinograms, and those from the PUA), so they'll never need more than 30 primary weights. Couldn't we remap these default bases for derived primary weights like this, and keep more space for the rest: # Derived weight ranges: FBB0..FBFF (80) # [Hani] core primaries: FBB0..FBB1 (2) U+4E00..U+9FFFFBB0 (1) (using base=U+2000 for the 2nd primary weight) U+F900..U+FAFFFBB1 (1) (using base=U+A000 for the 2nd primary weight) # [Hani] extended primaries: FBB2..FB9D (30) U+3400..U+4DBFFBB2 (1) (using base=U+2000 for the 2nd primary weight) reserved FBB3 (1) U+2..U+E FBB4..FBCF (26) (using base=U+n or U+n8000 for the 2nd primary weight) # Other non-PUA primaries:FBD0..FBEF (32) U+..U+E FBD0..FBED (30) (using base=U+n or U+n8000 for the 2nd primary weight) reserved FBEE..FBEF (2) # PUA primaries: FBF0..FBFF (16) U+D800..U+DFFFFBF0 (1) (using base=U+n8000 for the 2nd primary weight) reserved FBF1..FBFB (11) U+F..U+10 FBFC..FBFF (4) (using base=U+n or U+n8000 for the 2nd primary weight) # Trailing weights: FC00.. (1024) This scheme completely frees the range FB40..FBAF, while reducing the gaps currently left which will never have any use. (In this scheme, I have no opinion of which best range to use for code points assigned to non-characters, but they could all map to FBFF, used here for PUA, but with the second primary weight at end of the encoding space 8000.. moved to 4000..BFFF so that the second primary weight for non-characters goes easily into C000..) This way, we would keep ranges available for future large non-sinographic scripts (pictographic, non-Han ideographic), that would probably use only derived weights, or for a refined DUCET containing more precise levels or gaps facilitating some derived collation tables (for example in CLDR). And all PUAs would clearly sort within dedicated ranges of primary weights, with a warranty of all being sorted at end, after all scripts. -- Philippe.