Re: [9fans] Woes of New Language Support
On Tue, 28 Jul 2009 07:52:14 -0700 John Floren wrote: > On Tue, Jul 28, 2009 at 7:11 AM, Ethan Grammatikidis > wrote: > > On Tue, 28 Jul 2009 11:39:46 +0100 > > Charles Forsyth wrote: > > > >> >the unicode proposal says that matches depend on (re, locale, input). > >> >not just (re, input). i would think that is not acceptable. > >> > >> it's not just the unicode people. shell file name matching takes locale > >> into account > >> which often makes it case-independent (even with case-dependent > >> file systems). i hate them all. > >> > > > > You've got me wondering why anyone would want case-sensitive filename > > matching. I don't understand what could be worth the regular irritation > > I experience at having to get the case exactly right. > > > > This is not VMS! This is Plan 9. There are rules. *grin* I needed a laugh today, thanks. -- Ethan Grammatikidis Those who are slower at parsing information must necessarily be faster at problem-solving.
Re: [9fans] Woes of New Language Support
On Tue, Jul 28, 2009 at 7:11 AM, Ethan Grammatikidis wrote: > On Tue, 28 Jul 2009 11:39:46 +0100 > Charles Forsyth wrote: > >> >the unicode proposal says that matches depend on (re, locale, input). >> >not just (re, input). i would think that is not acceptable. >> >> it's not just the unicode people. shell file name matching takes locale into >> account >> which often makes it case-independent (even with case-dependent >> file systems). i hate them all. >> > > You've got me wondering why anyone would want case-sensitive filename > matching. I don't understand what could be worth the regular irritation > I experience at having to get the case exactly right. > This is not VMS! This is Plan 9. There are rules. John -- "I've tried programming Ruby on Rails, following TechCrunch in my RSS reader, and drinking absinthe. It doesn't work. I'm going back to C, Hunter S. Thompson, and cheap whiskey." -- Ted Dziuba
Re: [9fans] Woes of New Language Support
On Tue, 28 Jul 2009 11:39:46 +0100 Charles Forsyth wrote: > >the unicode proposal says that matches depend on (re, locale, input). > >not just (re, input). i would think that is not acceptable. > > it's not just the unicode people. shell file name matching takes locale into > account > which often makes it case-independent (even with case-dependent > file systems). i hate them all. > You've got me wondering why anyone would want case-sensitive filename matching. I don't understand what could be worth the regular irritation I experience at having to get the case exactly right. -- Ethan Grammatikidis Those who are slower at parsing information must necessarily be faster at problem-solving.
Re: [9fans] Woes of New Language Support
>the unicode proposal says that matches depend on (re, locale, input). >not just (re, input). i would think that is not acceptable. it's not just the unicode people. shell file name matching takes locale into account which often makes it case-independent (even with case-dependent file systems). i hate them all.
Re: [9fans] Woes of New Language Support
On Sun Jul 26 14:40:56 EDT 2009, knapj...@gmail.com wrote: > If I'm reading you right, you're saying it might be easier if > everything were encoded as combining (or maybe more aptly > non-combining) codes, regardless of language? > > So, we might encode 'Waffles' as w+upper a f f l e s and let the > renderer (if there is one) handle the presentation of the case shift > and the potential ligature, but things like grep get noticeably easier > with no overlap of ő and o+umlaut. > > Again, oversimplified, with no real understanding on my part of the > depth or breadth of the problem space. you understand. except, i was taking the opposite position. if you did for english what is done for indic languages, if you typed 'this is a sentence.' the 't' would be capitalized as soon as you typed the '.'. there's no hint that this rule need to be applied, the rendered would just have to know it. in ak's example a certain combination of codepoints yields a specific 'letter'. (i hope i have that right.) the renderer is just supposed to know this. so for consistency and reducing the need for complicated language-specific (how do we know that the text represented is actually from the language we think it is?), i would force the producer to declare the combinations. btw, the search problem is not at all solved by standardizing (or is that standardising?) the combiners problem. consider the following bits of unicode fun: ; grep 'zero width' /lib/unicode 200bzero width space 200czero width non-joiner 200dzero width joiner feffzero width no-break space i'm sure that someone more conversant in unicode could point out other points of real difficulty. how do you tell unicode from uni\ufeffcode? not only is that an annoyance, but it could be a pretty interesting security problem. and what a gift for spammers! - erik
Re: [9fans] Woes of New Language Support
If I'm reading you right, you're saying it might be easier if everything were encoded as combining (or maybe more aptly non-combining) codes, regardless of language? So, we might encode 'Waffles' as w+upper a f f l e s and let the renderer (if there is one) handle the presentation of the case shift and the potential ligature, but things like grep get noticeably easier with no overlap of ő and o+umlaut. Again, oversimplified, with no real understanding on my part of the depth or breadth of the problem space. If this is the case, could it be handled by pushing everything into a subset of unicode rather than use the unallocated space to create a superset? -J On 7/26/09, erik quanstrom wrote: >> to be fair to the unicode people, this decoupling of glyphs and codepoints >> is (i think) the most straightforward way to implement some languages like >> arabic, where the glyphs for characters depend on their position within a >> word. that is, a letter at the beginning of a word looks different from >> what it would look like if it was in the middle. > > my opinion (not that i'm entitled to one here) is > that the unicode guys screwed up. unicode is not > consistant. explain why there are two code points sigma. > 03c3 greek small letter sigma > 03c2 greek small letter final sigma > why does german get ä, ö, ü? if you want to take > this further, why are there capital forms of latin letters? > can't that also be inferred by the font? > > what's called a ligature in one language is a character > in another. i see no consistency. it seems like the > unicode committee had a problem with too much > knowledge of the specific problems and few actual > unifying (sorry) concepts. > > i think it would make much more sense to put this logic > in editors. this would also allow the freedom to use a > capital, ligature, final form in the wrong place. > like say studlyCaps. i can't imagine english is the only > language in the world that gets abused. > > - erik > > -- Sent from my mobile device
Re: [9fans] Woes of New Language Support
On Sun, Jul 26, 2009 at 09:48:23AM -0400, erik quanstrom wrote: > > to be fair to the unicode people, this decoupling of glyphs and codepoints > > is (i think) the most straightforward way to implement some languages like > > arabic, where the glyphs for characters depend on their position within a > > word. that is, a letter at the beginning of a word looks different from > > what it would look like if it was in the middle. > > my opinion (not that i'm entitled to one here) is > that the unicode guys screwed up. Oh and how. Let's not forget punching a huge hole in the code point namespace to appease the tortured encoding that is UTF-16. I similarly may not be entitled to have an opinion on Unicode's handling of linguistics, but their handling of the abstract codepoint namespace and failure to keep encodings entirely separate is laughable. --nwf; pgpOFtTJR1QKI.pgp Description: PGP signature
Re: [9fans] Woes of New Language Support
> the real problem isn't in viewing them however, but comes when you > start searching for them: it's easy to search for ë (e-umlaut) for > example, but what if it's described as e+"U+0308 COMBINING DIAERESIS"? > the answer is the UTS#18 Regular Expressions technical standard which > probably contributes at least half of the slowness of gnu grep > discussed in another thread. http://www.unicode.org/reports/tr18/ iirc, gnu grep calls malloc for each character of utf-8 input. awsome. at a minimum, it would be good to write to add support to tcs to translate to cannonical form utf. this would make the searching problem much easier. the unicode proposal says that matches depend on (re, locale, input). not just (re, input). i would think that is not acceptable. - erik
Re: [9fans] Woes of New Language Support
On Sun Jul 26 10:14:51 EDT 2009, tlaro...@polynum.com wrote: > On Sun, Jul 26, 2009 at 09:48:23AM -0400, erik quanstrom wrote: > > > > my opinion (not that i'm entitled to one here) is > > that the unicode guys screwed up. unicode is not > > consistant. explain why there are two code points sigma. > > 03c3greek small letter sigma > > 03c2greek small letter final sigma > > They are distinct in ancient greek at least. The glyph is not the same > whether the letter is inside or at the end of a word. (At the beginning, > in ancient greek, there was indeed no blanks between words but just a > stream of chars...) > > Or perhaps did I misunderstand what you wrote. yes they are. but we're arguing in the odd, odd world of codepoints. code points quite pointedly have no cannonical glyph. this is why unicode often does not distinguish final forms and other ligatures. it bothers me that the exception seems to be for western languages. all the glyphs that one needs for most western languages are already there. such strange ligatures as there are like ffl are just not important enough to bother with (u+fb03 for those following along at home). - erik
Re: [9fans] Woes of New Language Support
On Sun, Jul 26, 2009 at 09:48:23AM -0400, erik quanstrom wrote: > > my opinion (not that i'm entitled to one here) is > that the unicode guys screwed up. unicode is not > consistant. explain why there are two code points sigma. > 03c3 greek small letter sigma > 03c2 greek small letter final sigma They are distinct in ancient greek at least. The glyph is not the same whether the letter is inside or at the end of a word. (At the beginning, in ancient greek, there was indeed no blanks between words but just a stream of chars...) Or perhaps did I misunderstand what you wrote. Cheers, -- Thierry Laronde (Alceste) http://www.kergis.com/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: [9fans] Woes of New Language Support
> to be fair to the unicode people, this decoupling of glyphs and codepoints > is (i think) the most straightforward way to implement some languages like > arabic, where the glyphs for characters depend on their position within a > word. that is, a letter at the beginning of a word looks different from > what it would look like if it was in the middle. my opinion (not that i'm entitled to one here) is that the unicode guys screwed up. unicode is not consistant. explain why there are two code points sigma. 03c3greek small letter sigma 03c2greek small letter final sigma why does german get ä, ö, ü? if you want to take this further, why are there capital forms of latin letters? can't that also be inferred by the font? what's called a ligature in one language is a character in another. i see no consistency. it seems like the unicode committee had a problem with too much knowledge of the specific problems and few actual unifying (sorry) concepts. i think it would make much more sense to put this logic in editors. this would also allow the freedom to use a capital, ligature, final form in the wrong place. like say studlyCaps. i can't imagine english is the only language in the world that gets abused. - erik
Re: [9fans] Woes of New Language Support
Please disregard the question, "kbmap perhaps?" in my last post. I quickly realised that kbmap is only for inputs, while I'm discussing plain old output from every other source. partying too much ak
Re: [9fans] Woes of New Language Support
> what is the total number of stealth characters like nsa? > if it'not too unreasonable, it might be good enough to steal part of > the operating system or application reserved areas. Any consonant should be able to become a half-consonant, but only when followed by another consonant. In the TTF method, character type checking falls out easily. I'm still up for your suggestion, which if I understand it correctly, is to take up parts of the unspecified unicode ranges and dedicate them to half-consonants? You would then have to do this for Bengali, Telugu, Tamil, Gujarati, Gurumukhi (I think), and perhaps a couple of others. It's the fastest implementation, but has a couple of set backs: (a) it is not homogeneous across all Plan 9 distributions, and (b) it diverts from general Unicode standards, and thus, the problem of reading texts is still present, as everyone else is still using the consonant+virama+consonant sequence as opposed to following our self-defined code maps. One can deal with (a) if dedicated enough to language support for a billion or so people, but (b) is pretty serious and still presents us with the same full-stop as before. If there were some way to map unicode sequences to our self-defined codes, then that could work in this methodology. kbmap perhaps? Best, ak
Re: [9fans] Woes of New Language Support
erik quanstrom wrote: > yes. this is a problem. unfortunately the unicode guys > took the position that codepoint is divorced from glyphs > unfortunately, this case isn't as bad as it gets. e.g. archaic cryllic > letters have transliterations like ^^A in unicode. would > three hats on an A be illegal? i don't see what would prevent it. > and therefore one needs to implment some sort of character > layout engine to render unicode. that's pretty bogus. to be fair to the unicode people, this decoupling of glyphs and codepoints is (i think) the most straightforward way to implement some languages like arabic, where the glyphs for characters depend on their position within a word. that is, a letter at the beginning of a word looks different from what it would look like if it was in the middle. salman
Re: [9fans] Woes of New Language Support
diacritics (combining characters) are a real mess in Unicode. with so much space in the format why did they have to go this route, i wonder? erik mentioned cyrillic. i did have an old church slavonic bible text i was attempting to display correctly on Plan 9 sometime in 2003-4. top is x11 with correctly (i presume) combined characters, below is the Plan 9 rendering: http://mirtchovski.com/screenshots/x-p9-diacritics.jpg there's a pattern there, as you can see: the combining char always follows the char it's combined with, so you can try simply not advancing forward as a first draft of implementing char combinations in Plan 9. there doesn't seem to be a default list of "combining" characters in UTF so you'll have to pick up all glyphs described as "combining" and check for them when you input. fun and slow :) the real problem isn't in viewing them however, but comes when you start searching for them: it's easy to search for ë (e-umlaut) for example, but what if it's described as e+"U+0308 COMBINING DIAERESIS"? the answer is the UTS#18 Regular Expressions technical standard which probably contributes at least half of the slowness of gnu grep discussed in another thread. http://www.unicode.org/reports/tr18/
Re: [9fans] Woes of New Language Support
> However, in the class of languages for which I am trying to > provide support, certain characters are meant to be produced > by an ordered combination of other characters. For example, > the general sequence in Devanagari script (and this extends > to the other scripts as well) is that > consonant+virama+consonant produces > half-consonant+consonant, where the half-consonant has no > other unicode specification. As a concrete case in > Devanagari, na virama sa (viz., \u0928\u094d\u0938) should > produce the nsa character (this sequence can be seen in any > unicode representation of the word "Sanskrit" in Devanagari > script). > > It seems to me that TTF font specifications (i.e., those I > converted to subfonts using Federico's ttf2subf) include > these sequence definitions, which are then processed by each > application providing support for the fonts. Plan 9 > subfonts are much too simple for this. yes. this is a problem. unfortunately the unicode guys took the position that codepoint is divorced from glyphs unfortunately, this case isn't as bad as it gets. e.g. archaic cryllic letters have transliterations like ^^A in unicode. would three hats on an A be illegal? i don't see what would prevent it. and therefore one needs to implment some sort of character layout engine to render unicode. that's pretty bogus. what is the total number of stealth characters like nsa? if it'not too unreasonable, it might be good enough to steal part of the operating system or application reserved areas. i hope my ignorance of the particular script in question isn't leading to silly suggestions! - erik