subject:"Re\: \[9fans\] Woes of New Language Support"

Re: [9fans] Woes of New Language Support

2009-07-28 Thread Ethan Grammatikidis

On Tue, 28 Jul 2009 07:52:14 -0700
John Floren  wrote:

> On Tue, Jul 28, 2009 at 7:11 AM, Ethan Grammatikidis 
> wrote:
> > On Tue, 28 Jul 2009 11:39:46 +0100
> > Charles Forsyth  wrote:
> >
> >> >the unicode proposal says that matches depend on (re, locale, input).
> >> >not just (re, input).  i would think that is not acceptable.
> >>
> >> it's not just the unicode people. shell file name matching takes locale 
> >> into account
> >> which often makes it case-independent (even with case-dependent
> >> file systems). i hate them all.
> >>
> >
> > You've got me wondering why anyone would want case-sensitive filename
> > matching. I don't understand what could be worth the regular irritation
> > I experience at having to get the case exactly right.
> >
> 
> This is not VMS! This is Plan 9. There are rules.

*grin* I needed a laugh today, thanks.


-- 
Ethan Grammatikidis

Those who are slower at parsing information must
necessarily be faster at problem-solving.

Re: [9fans] Woes of New Language Support

2009-07-28 Thread John Floren

On Tue, Jul 28, 2009 at 7:11 AM, Ethan Grammatikidis wrote:
> On Tue, 28 Jul 2009 11:39:46 +0100
> Charles Forsyth  wrote:
>
>> >the unicode proposal says that matches depend on (re, locale, input).
>> >not just (re, input).  i would think that is not acceptable.
>>
>> it's not just the unicode people. shell file name matching takes locale into 
>> account
>> which often makes it case-independent (even with case-dependent
>> file systems). i hate them all.
>>
>
> You've got me wondering why anyone would want case-sensitive filename
> matching. I don't understand what could be worth the regular irritation
> I experience at having to get the case exactly right.
>

This is not VMS! This is Plan 9. There are rules.


John
-- 
"I've tried programming Ruby on Rails, following TechCrunch in my RSS
reader, and drinking absinthe. It doesn't work. I'm going back to C,
Hunter S. Thompson, and cheap whiskey." -- Ted Dziuba

Re: [9fans] Woes of New Language Support

2009-07-28 Thread Ethan Grammatikidis

On Tue, 28 Jul 2009 11:39:46 +0100
Charles Forsyth  wrote:

> >the unicode proposal says that matches depend on (re, locale, input).
> >not just (re, input).  i would think that is not acceptable.
> 
> it's not just the unicode people. shell file name matching takes locale into 
> account
> which often makes it case-independent (even with case-dependent
> file systems). i hate them all.
> 

You've got me wondering why anyone would want case-sensitive filename
matching. I don't understand what could be worth the regular irritation
I experience at having to get the case exactly right.

-- 
Ethan Grammatikidis

Those who are slower at parsing information must
necessarily be faster at problem-solving.

Re: [9fans] Woes of New Language Support

2009-07-28 Thread Charles Forsyth

>the unicode proposal says that matches depend on (re, locale, input).
>not just (re, input).  i would think that is not acceptable.

it's not just the unicode people. shell file name matching takes locale into 
account
which often makes it case-independent (even with case-dependent
file systems). i hate them all.

Re: [9fans] Woes of New Language Support

2009-07-26 Thread erik quanstrom

On Sun Jul 26 14:40:56 EDT 2009, knapj...@gmail.com wrote:
> If I'm reading you right, you're saying it might be easier if
> everything were encoded as combining (or maybe more aptly
> non-combining) codes, regardless of language?
> 
> So, we might encode 'Waffles' as w+upper a f f l e s and let the
> renderer (if there is one) handle the presentation of the case shift
> and the potential ligature, but things like grep get noticeably easier
> with no overlap of ő and o+umlaut.
> 
> Again, oversimplified, with no real understanding on my part of the
> depth or breadth of the problem space.

you understand.  except, i was taking the opposite position.

if you did for english what is done for indic languages,
if you typed 'this is a sentence.' the 't' would be capitalized
as soon as you typed the '.'.  there's no hint that this rule
need to be applied, the rendered would just have to know
it.  in ak's example a certain combination of codepoints yields
a specific 'letter'.  (i hope i have that right.)  the renderer is
just supposed to know this.  so for consistency and reducing
the need for complicated language-specific (how do we know
that the text represented is actually from the language we think
it is?), i would force the producer to declare the combinations.

btw, the search problem is not at all solved by standardizing
(or is that standardising?) the combiners problem.  consider
the following bits of unicode fun:

; grep 'zero width' /lib/unicode
200bzero width space
200czero width non-joiner
200dzero width joiner
feffzero width no-break space

i'm sure that someone more conversant in unicode could
point out other points of real difficulty.

how do you tell unicode from uni\ufeffcode?  not only
is that an annoyance, but it could be a pretty interesting
security problem.  and what a gift for spammers!

- erik

Re: [9fans] Woes of New Language Support

2009-07-26 Thread Jack Johnson

If I'm reading you right, you're saying it might be easier if
everything were encoded as combining (or maybe more aptly
non-combining) codes, regardless of language?

So, we might encode 'Waffles' as w+upper a f f l e s and let the
renderer (if there is one) handle the presentation of the case shift
and the potential ligature, but things like grep get noticeably easier
with no overlap of ő and o+umlaut.

Again, oversimplified, with no real understanding on my part of the
depth or breadth of the problem space.

If this is the case, could it be handled by pushing everything into a
subset of unicode rather than use the unallocated space to create a
superset?

-J

On 7/26/09, erik quanstrom  wrote:
>> to be fair to the unicode people, this decoupling of glyphs and codepoints
>> is (i think) the most straightforward way to implement some languages like
>> arabic, where the glyphs for characters depend on their position within a
>> word.  that is, a letter at the beginning of a word looks different from
>> what it would look like if it was in the middle.
>
> my opinion (not that i'm entitled to one here) is
> that the unicode guys screwed up.  unicode is not
> consistant.  explain why there are two code points sigma.
> 03c3  greek small letter sigma
> 03c2  greek small letter final sigma
> why does german get ä, ö, ü?  if you want to take
> this further, why are there capital forms of latin letters?
> can't that also be inferred by the font?
>
> what's called a ligature in one language is a character
> in another.  i see no consistency.  it seems like the
> unicode committee had a problem with too much
> knowledge of the specific problems and few actual
> unifying (sorry) concepts.
>
> i think it would make much more sense to put this logic
> in editors.  this would also allow the freedom to use a
> capital, ligature, final form in the wrong place.
> like say studlyCaps.  i can't imagine english is the only
> language in the world that gets abused.
>
> - erik
>
>

-- 
Sent from my mobile device

Re: [9fans] Woes of New Language Support

2009-07-26 Thread Nathaniel W Filardo

On Sun, Jul 26, 2009 at 09:48:23AM -0400, erik quanstrom wrote:
> > to be fair to the unicode people, this decoupling of glyphs and codepoints
> > is (i think) the most straightforward way to implement some languages like
> > arabic, where the glyphs for characters depend on their position within a
> > word.  that is, a letter at the beginning of a word looks different from
> > what it would look like if it was in the middle.
> 
> my opinion (not that i'm entitled to one here) is
> that the unicode guys screwed up.

Oh and how.  Let's not forget punching a huge hole in the code point
namespace to appease the tortured encoding that is UTF-16.  I similarly may
not be entitled to have an opinion on Unicode's handling of linguistics, but
their handling of the abstract codepoint namespace and failure to keep
encodings entirely separate is laughable.

--nwf;

pgpOFtTJR1QKI.pgp
Description: PGP signature

Re: [9fans] Woes of New Language Support

2009-07-26 Thread erik quanstrom

> the real problem isn't in viewing them however, but comes when you
> start searching for them: it's easy to search for ë (e-umlaut) for
> example, but what if it's described as e+"U+0308 COMBINING DIAERESIS"?
> the answer is the UTS#18 Regular Expressions technical standard which
> probably contributes at least half of the slowness of gnu grep
> discussed in another thread. http://www.unicode.org/reports/tr18/

iirc, gnu grep calls malloc for each character of utf-8 input.  awsome.

at a minimum, it would be good to write to add support to tcs
to translate to cannonical form utf.  this would make the
searching problem much easier.

the unicode proposal says that matches depend on (re, locale, input).
not just (re, input).  i would think that is not acceptable.

- erik

Re: [9fans] Woes of New Language Support

2009-07-26 Thread erik quanstrom

On Sun Jul 26 10:14:51 EDT 2009, tlaro...@polynum.com wrote:
> On Sun, Jul 26, 2009 at 09:48:23AM -0400, erik quanstrom wrote:
> > 
> > my opinion (not that i'm entitled to one here) is
> > that the unicode guys screwed up.  unicode is not
> > consistant.  explain why there are two code points sigma. 
> > 03c3greek small letter sigma
> > 03c2greek small letter final sigma
> 
> They are distinct in ancient greek at least. The glyph is not the same
> whether the letter is inside or at the end of a word. (At the beginning,
> in ancient greek, there was indeed no blanks between words but just a
> stream of chars...)
> 
> Or perhaps did I misunderstand what you wrote.

yes they are.

but we're arguing in the odd, odd world of codepoints.  code points
quite pointedly have no cannonical glyph.  this is why unicode often
does not distinguish final forms and other ligatures.

it bothers me that the exception seems to be for western languages.
all the glyphs that one needs for most western languages are already
there.  such strange ligatures as there are like ffl are just not important
enough to bother with (u+fb03 for those following along at home).

- erik

Re: [9fans] Woes of New Language Support

2009-07-26 Thread tlaronde

On Sun, Jul 26, 2009 at 09:48:23AM -0400, erik quanstrom wrote:
> 
> my opinion (not that i'm entitled to one here) is
> that the unicode guys screwed up.  unicode is not
> consistant.  explain why there are two code points sigma. 
> 03c3  greek small letter sigma
> 03c2  greek small letter final sigma

They are distinct in ancient greek at least. The glyph is not the same
whether the letter is inside or at the end of a word. (At the beginning,
in ancient greek, there was indeed no blanks between words but just a
stream of chars...)

Or perhaps did I misunderstand what you wrote.

Cheers,
-- 
Thierry Laronde (Alceste) 
 http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

Re: [9fans] Woes of New Language Support

2009-07-26 Thread erik quanstrom

> to be fair to the unicode people, this decoupling of glyphs and codepoints
> is (i think) the most straightforward way to implement some languages like
> arabic, where the glyphs for characters depend on their position within a
> word.  that is, a letter at the beginning of a word looks different from
> what it would look like if it was in the middle.

my opinion (not that i'm entitled to one here) is
that the unicode guys screwed up.  unicode is not
consistant.  explain why there are two code points sigma. 
03c3greek small letter sigma
03c2greek small letter final sigma
why does german get ä, ö, ü?  if you want to take
this further, why are there capital forms of latin letters?
can't that also be inferred by the font?

what's called a ligature in one language is a character
in another.  i see no consistency.  it seems like the
unicode committee had a problem with too much
knowledge of the specific problems and few actual
unifying (sorry) concepts.

i think it would make much more sense to put this logic
in editors.  this would also allow the freedom to use a
capital, ligature, final form in the wrong place.
like say studlyCaps.  i can't imagine english is the only
language in the world that gets abused.

- erik

Re: [9fans] Woes of New Language Support

2009-07-26 Thread Akshat Kumar

Please disregard the question, "kbmap perhaps?" in my
last post.
I quickly realised that kbmap is only for inputs, while
I'm discussing plain old output from every other source.

partying too much
ak

Re: [9fans] Woes of New Language Support

2009-07-26 Thread Akshat Kumar

> what is the total number of stealth characters like nsa?
> if it'not too unreasonable, it might be good enough to steal part of
> the operating system or application reserved areas.

Any consonant should be able to become a half-consonant,
but only when followed by another consonant. In the TTF
method, character type checking falls out easily. I'm still up
for your suggestion, which if I understand it correctly, is to
take up parts of the unspecified unicode ranges and dedicate
them to half-consonants? You would then have to do this for
Bengali, Telugu, Tamil, Gujarati, Gurumukhi (I think), and
perhaps a couple of others. It's the fastest implementation, but
has a couple of set backs:
(a) it is not homogeneous across all Plan 9 distributions, and
(b) it diverts from general Unicode standards, and thus, the
problem of reading texts is still present, as everyone else is
still using the consonant+virama+consonant sequence as
opposed to following our self-defined code maps.
One can deal with (a) if dedicated enough to language support
for a billion or so people, but (b) is pretty serious and still presents
us with the same full-stop as before.
If there were some way to map unicode sequences to our self-defined
codes, then that could work in this methodology. kbmap perhaps?


Best,
ak

Re: [9fans] Woes of New Language Support

2009-07-26 Thread Salman Aljammaz

erik quanstrom wrote:
> yes.  this is a problem.  unfortunately the unicode guys
> took the position that codepoint is divorced from glyphs
> unfortunately, this case isn't as bad as it gets.  e.g. archaic cryllic
> letters have transliterations like ^^A in unicode.  would
> three hats on an A be illegal?  i don't see what would prevent it.
> and therefore one needs to implment some sort of character
> layout engine to render unicode.  that's pretty bogus.

to be fair to the unicode people, this decoupling of glyphs and codepoints
is (i think) the most straightforward way to implement some languages like
arabic, where the glyphs for characters depend on their position within a
word.  that is, a letter at the beginning of a word looks different from
what it would look like if it was in the middle.

salman

Re: [9fans] Woes of New Language Support

2009-07-26 Thread andrey mirtchovski

diacritics (combining characters) are a real mess in Unicode. with so
much space in the format why did they have to go this route, i wonder?

erik mentioned cyrillic. i did have an old church slavonic bible text
i was attempting to display correctly on Plan 9 sometime in 2003-4.
top is x11 with correctly (i presume) combined characters, below is
the Plan 9 rendering:
http://mirtchovski.com/screenshots/x-p9-diacritics.jpg

there's a pattern there, as you can see: the combining char always
follows the char it's combined with, so you can try simply not
advancing forward as a first draft of implementing char combinations
in Plan 9. there doesn't seem to be a default list of "combining"
characters in UTF so you'll have to pick up all glyphs described as
"combining" and check for them when you input. fun and slow :)

the real problem isn't in viewing them however, but comes when you
start searching for them: it's easy to search for ë (e-umlaut) for
example, but what if it's described as e+"U+0308 COMBINING DIAERESIS"?
the answer is the UTS#18 Regular Expressions technical standard which
probably contributes at least half of the slowness of gnu grep
discussed in another thread. http://www.unicode.org/reports/tr18/

Re: [9fans] Woes of New Language Support

2009-07-25 Thread erik quanstrom

> However, in the class of languages for which I am trying to
> provide support, certain characters are meant to be produced
> by an ordered combination of other characters.  For example,
> the general sequence in Devanagari script (and this extends
> to the other scripts as well) is that
> consonant+virama+consonant produces
> half-consonant+consonant, where the half-consonant has no
> other unicode specification.  As a concrete case in
> Devanagari, na virama sa (viz., \u0928\u094d\u0938) should
> produce the nsa character (this sequence can be seen in any
> unicode representation of the word "Sanskrit" in Devanagari
> script).
> 
> It seems to me that TTF font specifications (i.e., those I
> converted to subfonts using Federico's ttf2subf) include
> these sequence definitions, which are then processed by each
> application providing support for the fonts.  Plan 9
> subfonts are much too simple for this.

yes.  this is a problem.  unfortunately the unicode guys
took the position that codepoint is divorced from glyphs
unfortunately, this case isn't as bad as it gets.  e.g. archaic cryllic
letters have transliterations like ^^A in unicode.  would
three hats on an A be illegal?  i don't see what would prevent it.
and therefore one needs to implment some sort of character
layout engine to render unicode.  that's pretty bogus.

what is the total number of stealth characters like nsa?
if it'not too unreasonable, it might be good enough to steal part of
the operating system or application reserved areas.

i hope my ignorance of the particular script in question
isn't leading to silly suggestions!

- erik

Re: [9fans] Woes of New Language Support

Re: [9fans] Woes of New Language Support

Re: [9fans] Woes of New Language Support

Re: [9fans] Woes of New Language Support

Re: [9fans] Woes of New Language Support

Re: [9fans] Woes of New Language Support

Re: [9fans] Woes of New Language Support

Re: [9fans] Woes of New Language Support

Re: [9fans] Woes of New Language Support

Re: [9fans] Woes of New Language Support

Re: [9fans] Woes of New Language Support

Re: [9fans] Woes of New Language Support

Re: [9fans] Woes of New Language Support

Re: [9fans] Woes of New Language Support

Re: [9fans] Woes of New Language Support

Re: [9fans] Woes of New Language Support

16 matches

Site Navigation

Mail list logo

Footer information