subject:"Re\: Transcoding Tamil in the presence of markup"

Re: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))

2003-12-09 Thread Mark Davis

I agree strongly. Reordering of glyphs doesn't affect the ability to maintain
styles. Every reasonable package has to retain the mappings back and forth
between character and glyph to maintain styles and to map highlighting/mouse
clicks/etc. The only issue is for combinations. That is, the character => glyph
mappings can be arbitrary combinations of the following:

reordering: easy to retain style
1:1 mapping: easy to retain style
1:n mapping: also easy to retain style
n:1 mapping: this is the place where it gets tricky.

Any time the 1:n mapping is involved, maintaining styles is difficult. For
example, with fi, if ligatures are
used for "fi", then you have some choices. (a) disallow the ligature, (b) color
it all one or the other color, (c) if (and that's a big if) your font allows for
the production of an fi ligature with two adjacent 'fitting' pieces, essentially
contextual forms instead of a ligature, then you can do both the ligature and
the color.

Mark
__
http://www.macchiato.com
â à â

- Original Message - 
From: "Peter Constable" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tue, 2003 Dec 09 00:30
Subject: RE: Transcoding Tamil in the presence of markup (was Re: Coloured
diacritics (Was: Transcoding Tamil in the presence of markup))


> From: [EMAIL PROTECTED] on behalf of Kenneth Whistler
>
> >> Unicode doesn't prevent styling, of course. But having 'logical' order
> >> instead of 'visual' makes it a hard task for the application and the
> >> renderer.
> >> This is witnessed by the thin-spread support for this.
> >
> >Yes...
>
> Ken conceded the claim too readily. Glyph re-ordering due to a logical
encoding order that is different from visual order may mean that certain types
of styling (of the re-ordered character) may not be supported in some
implementations, but it does *not* mean that this is, in general, a hard task.
Style information is applied to characters, and as long as there is a 1:m
association between characters and glyphs and there is a path to transform the
styling information to match the character/glyph transformations, styling is in
principle possible. (There's a constraint that styling might not be possible if
the styling differences require different fonts but the glyph transformations
that occur require rule contexts to span such a style boundary.)
>
> (Expecting one component from a precomposed character to be styled differently
from the rest, however, would be somewhat hard.)
>
> In particular, for reordering this is easy to demonstrate by considering a
hypothetical complex-script rendering implementation in which processing is
divided into two stages: character re-ordering, and glyph transformation. In the
first stage, all that happens is that a string is mapped to a temporary string
used internally only, in which characters are reordered into visual order.
(Surroundrant characters with no decomposition would be mapped into multiple
internal-use-only virtual characters.) Thus, a styled string such as
ke would transform in the first stage
to ek. There is nothing hard in such
processing.
>
>
> (Of course, whether it is harder to get people to implement support for one
thing rather than another is an entirely different question.)
>
>
>
>
> Peter Constable
>
>

Re: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))

2003-12-09 Thread Jungshik Shin

On Mon, 8 Dec 2003, Peter Jacobi wrote:

> It would be most interesting, if someone can point out a wordprocessor
> or even a rendering library (shouldn't Pango be the solution to
> everything?),
> which enables styling of individual Tamil letters.

  I think Pango's   attributed
string (
http://developer.gnome.org/doc/API/2.0/pango/pango-Text-Attributes.html
) can be used for this.  I believe that other layout/rendering libraries
such as Uniscribe, ATSUI and the rendering/layout part of ICU
have similar data type/APIs.

  Jungshik

Re: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))

2003-12-09 Thread Peter Kirk

On 08/12/2003 16:17, Kenneth Whistler wrote:

...

Having an 'invisible consonant' to call for rendering of the vowel sign
in isolation (and without the dotted circle), would also help the limited
number of cases where the styled single character is needed - but in
a rather hackish way.
   

That is what the SPACE as base character is for. If some renderers
insist on rendering such combinations with a dotted circle glyph,
that is an issue in the renderer -- it is not a defect in the
encoding standard for not having a way to represent the vowel
sign in isolation.
 

SPACE is unsuitable for this function for at least two good reasons:

1) because of its word and line breaking characteristics;

2) because in a case like this no extra spacing is required. The vowel 
sign is a spacing character in itself, although a combining mark. SPACE 
is expected to add its own spacing. In the absence of clearly defined 
rules to the contrary, renderers will render this combination of SPACE 
with a Tamil vowel with an extra space which is not wanted. (As for 
which side of the vowel the space will appear, that is anyone's guess!)

This is yet another example to add to a number that I have identified 
showing that the reuse of SPACE and NBSP as carriers for diacritics is 
an undesirable overloading of character semantics. I propose again a new 
base character for carrying combining marks, with no glyph and a width 
just as wide as that required to display the combining marks. The 
mechanism already defined for using SPACE and NBSP for this should be 
deprecated, although not abolished.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))

2003-12-09 Thread Peter Constable

From: [EMAIL PROTECTED] on behalf of Kenneth Whistler

>> Unicode doesn't prevent styling, of course. But having 'logical' order
>> instead of 'visual' makes it a hard task for the application and the
>> renderer.
>> This is witnessed by the thin-spread support for this.
>
>Yes...
 
Ken conceded the claim too readily. Glyph re-ordering due to a logical encoding order 
that is different from visual order may mean that certain types of styling (of the 
re-ordered character) may not be supported in some implementations, but it does *not* 
mean that this is, in general, a hard task. Style information is applied to 
characters, and as long as there is a 1:m association between characters and glyphs 
and there is a path to transform the styling information to match the character/glyph 
transformations, styling is in principle possible. (There's a constraint that styling 
might not be possible if the styling differences require different fonts but the glyph 
transformations that occur require rule contexts to span such a style boundary.)
 
(Expecting one component from a precomposed character to be styled differently from 
the rest, however, would be somewhat hard.)
 
In particular, for reordering this is easy to demonstrate by considering a 
hypothetical complex-script rendering implementation in which processing is divided 
into two stages: character re-ordering, and glyph transformation. In the first stage, 
all that happens is that a string is mapped to a temporary string used internally 
only, in which characters are reordered into visual order. (Surroundrant characters 
with no decomposition would be mapped into multiple internal-use-only virtual 
characters.) Thus, a styled string such as ke would transform in the first stage to ek. There is nothing hard in such processing.
 
 
(Of course, whether it is harder to get people to implement support for one thing 
rather than another is an entirely different question.)
 
 
 
 
Peter Constable

Re: Fwd: Re: Transcoding Tamil in the presence of markup

2003-12-08 Thread Mark E. Shoulson

On 12/08/03 14:16, Peter Constable wrote:

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
   

On Behalf
 

Of Mark E. Shoulson
   

(and now I contradict myself with a counterexample.  In
http://omega.enstb.org/yannis/pdf/biblical-hebrew94.pdf, Yannis
Haralambous notes--correctly--that when typesetting the Hebrew Bible,
letters that are written small hang from the top line and have
normal-sized vowels below them (and the vowels are below the baseline
   

of
 

the normal text))
   

It might be possible to develop technologies that allowed correct
positioning in particular cases where different sizes were involved, but
I think we still have some more basic problems to solve, like finishing
getting implementations that offer basic support for all of the scripts
in Unicode 4.0.
 

Yes.  And this business of small letters and normal vowels and whatnot 
isn't "plain text" anyway, so it should be the problem of the larger 
rendering system and probably not even just the font.  Certainly not 
Unicode's problem.

~mark

RE: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))

2003-12-08 Thread Kenneth Whistler

Peter Jacobi said:

> Unicode doesn't prevent styling, of course. But having 'logical' order
> instead of 'visual' makes it a hard task for the application and the
> renderer.
> This is witnessed by the thin-spread support for this.

Yes, but having visual order instead of logical order makes
*other* tasks difficult for the application. There is a
tradeoff here.

The Brahmi-derived script which got a grandfathered visual order 
into Unicode is Thai (and Lao), because of TIS 620-2533. That
definitely makes some aspects of coexistence with legacy data
easier for Thai in Unicode. But it also meant pulling in all
the complications for searching and sorting, among other things.
The Unicode Collation Algorithm (and every other implementation
of sorting) has to be special-cased for Thai, as a result, in
order to get expected ordering results. And we live with *that*
complication forever.

> 
> 'Logical order' makes a lot sense for heavily conjunct forming, 
> 2-D compositing
> scripts. It is not such a perfect match for Tamil, which is 
> essentially 1-D and
> has a well-defined visual order of characters.

It doesn't stack as much as some other Brahmi-derived scripts,
but it still has some substantial ligating behavior (cf. -u, -uu),
and this is going to cause significant problems for attempting
to do systematic markup of particular syllabic pieces (consonants
as opposed to vowels, for example), *regardless* of the logical
order issue.
 
> But excuse my lamenting, I'm not
> into utopian and ill-advised projects of re-doing all from scratch. 

Noted.

I concur with the Peters here. Markup of text is essentially
outside the scope of Unicode. The problem of how to do
various kinds of orthographic and/or linguistic markup and
highlighting in complex scripts while not arbitrarily breaking
the rendering itself is an issue for negotiation between
the markup protocols and the text rendering engines and fonts.

In an earlier post:

> So, to promote Unicode usage, in a community, which partly sees
> ISCII unification as a conspiracy against the Dravidian languages,
> it would be very helpful to demonstrate, that everything that can
> be done with the legacy encodings, can also be done using Unicode.

This is a bit off down the garden path, though. As the discussion
in this thread has made clear, we are talking about behavior
above the level of the representation of the plain text content.

First of all, it is quite evident that the same plain text
content as represented in TSCII can also be represented in
Unicode. That is the sufficiency test that has to be applied
to the Unicode Standard.

Against that, one balances the following pros and cons:

A. Con. It is more difficult to get browsers using Unicode to
take HTML span markup (color or whatever) of Tamil consonants
to render as expected when dealing with left-side (reordrant)
Tamil vowels or the two-part vowels. Because TSCII uses
visual order, such behavior is much more straightforward in
these particular cases.

B. Pro. It is much easier to get collaters to behave correctly
for Tamil data when dealing with left-side or two-part vowels,
because they are stored in logical order and do not add
complications on top of the already difficult issues of
syllable weighting for Tamil or other languages using Indic
scripts.

> Having an 'invisible consonant' to call for rendering of the vowel sign
> in isolation (and without the dotted circle), would also help the limited
> number of cases where the styled single character is needed - but in
> a rather hackish way.

That is what the SPACE as base character is for. If some renderers
insist on rendering such combinations with a dotted circle glyph,
that is an issue in the renderer -- it is not a defect in the
encoding standard for not having a way to represent the vowel
sign in isolation.

--Ken

RE: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))

2003-12-08 Thread Peter Jacobi

Dear Peter Constable, Peter Kirk, All,

"Peter Constable" <[EMAIL PROTECTED]> wrote:
> SIL's Graphite definitely *will* permit exactly what you want to do 
> (assuming the font is properly designed). [...]

Thanks for this  clarification. Having tried SIL WorldPad with Tamil
Graphite
font, and not getting the desired results, I was under the impression that
Graphite doesn't handle it. It seems I have to do some programming
against the Graphite lib to get a better understanding.

Peter Kirk <[EMAIL PROTECTED]> wrote:
> I thought this had been made clear. This is not a matter for Unicode as 
> Unicode does not define character styles. It is not a matter for legacy 
> encodings either unless they define character styles. It is a matter for 
> higher level protocols. You need to address your comments to those who 
> define them.

"Peter Constable" <[EMAIL PROTECTED]> wrote:
> There is nothing about encoding of Tamil in Unicode that prevents
> styling of individual characters. This is not a Unicode problem, as (at
> least some) others have said.

Unicode doesn't prevent styling, of course. But having 'logical' order
instead of 'visual' makes it a hard task for the application and the
renderer.
This is witnessed by the thin-spread support for this.

'Logical order' makes a lot sense for heavily conjunct forming, 2-D
compositing
scripts. It is not such a perfect match for Tamil, which is essentially 1-D
and
has a well-defined visual order of characters. But excuse my lamenting, I'm
not
into utopian and ill-advised projects of re-doing all from scratch. 

Having an 'invisible consonant' to call for rendering of the vowel sign
in isolation (and without the dotted circle), would also help the limited
number of cases where the styled single character is needed - but in
a rather hackish way. 

Peter Kirk <[EMAIL PROTECTED]> wrote:
> Unicode in fact makes deliberate provision for inserting markup within 
> combining character sequences by not forbidding defective combining 
> sequences. If particular markup protocols refuse to use this provision, 
> that is their problem.

Agreed. I already conceded, that this was a most valuable lesson, I learned
from this discussion.

Best Regards,
Peter Jacobi

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net

RE: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))

2003-12-08 Thread Philippe Verdy

Peter Jacobi
> To re-iterate - in the original post, the string in question did
> consist of side by side characters, not ligated in any font known
> to me. And the legacy Tamil enocings have for obvious reasons no
> problem to style any single character.

This specific case is not the one of "side-by-side" characters, but
the the case of _one_ character which is combining around a base
letter as two separate glyphs. These individual glyphs are those
that were used on typewriters, instead of the single abstract
letter. So on typewriters, you could color them individually, even
if they were denoting a single letter. There was then no key on the
typewriter keyboard to enter this character, and these glyphs had
the actual status of coded characters. Even the base letter in the
middle of the two combining glyphs will then have its own style
or color: this is the exhibited coloring feature in Tamil, where
the base letter is colored, but not its combining character that
follows it. On typewriters, there was simply no combining characters
but plain characters representing each part of the glyph and that's
why it worked.


__
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
<>

Re: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))

2003-12-08 Thread Peter Kirk

On 08/12/2003 10:16, Peter Jacobi wrote:

...
So, to promote Unicode usage, in a community, which partly sees
ISCII unification as a conspiracy against the Dravidian languages,
it would be very helpful to demonstrate, that everything that can
be done with the legacy encodings, can also be done using Unicode.
 

I thought this had been made clear. This is not a matter for Unicode as 
Unicode does not define character styles. It is not a matter for legacy 
encodings either unless they define character styles. It is a matter for 
higher level protocols. You need to address your comments to those who 
define them.

Unicode in fact makes deliberate provision for inserting markup within 
combining character sequences by not forbidding defective combining 
sequences. If particular markup protocols refuse to use this provision, 
that is their problem.

The most useful answers so far, were the assertions by Jungshik, Bruno
and others, that the markup in 
 http://www.jodelpeter.de/i18n/tamil/markup-uc.htm
should be considered correct and, in an ideal user agent,
render like the the TSCII encoded
 http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm
 

These may be useful to you but anything about markup is irrelevant to 
Unicode.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Fwd: Re: Transcoding Tamil in the presence of markup

2003-12-08 Thread Peter Constable

> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf
> Of Mark E. Shoulson


> I also agree, but I point out that the sufficiently perverse could
come
> up with some pretty tough examples.  Applying color is a pretty benign
> style, but what if I wanted a boldface circumflex on a normal letter?

Shaping of diacritics where such styling differences (face, size or
weight) occur is problematic to implement due to reasons related to
design and also reasons related purely to the technologies involved:

Diacritic positioning can be accomplished in a few different ways: 

- substitution of alternate glyphs that have metrics that result in
different positioning of the outline, to correspond with different
metrics of a base glyph

- kerning rules (in this context adjust the glyph by x units
horizontally and y units vertically)

- attachment anchor points (adjust the glyph so that point i in the
outline aligns with point j in the outline of the base glyph)

Whichever approach is used in a font implementation, the implementation
will have assumed equal point sizes (remember, the outline has no point
size) and equal face and weight/style characteristics. If positioning by
attachment point is used, it may be possible to produce somewhat
tolerable results when the base and diacritic are different sizes, or
one is (say) bold while the other is italic; but even then ideal results
should not be expected, and might not even be clearly definable. (What's
the correct positioning of a non-italic circumflex over an italic o?)


Then there's the issue of the font technologies involved: whether we're
talking about OpenType, AAT, Graphite, Pango or whatever, and no matter
which approach to positioning described above is used, this kind of
positioning is accomplished by rules within a given font (not typeface,
but font) that describe how particular glyphs within that font should
behave in relation to one another. There is absolutely no way to say
that glyph x in font A should behave in a particular way when combined
with glyph y in some different font B. In principle, these rules can
still apply when there is a change in colour or point size, but as soon
as you change between different font files (change in face or weight),
no glyph processing is possible.


> Or even more obnoxious, a 10-point circumflex on a an 8-point letter?
> These could be tricky to compute.

[then in a subsequent message]

> (and now I contradict myself with a counterexample.  In
> http://omega.enstb.org/yannis/pdf/biblical-hebrew94.pdf, Yannis
> Haralambous notes--correctly--that when typesetting the Hebrew Bible,
> letters that are written small hang from the top line and have
> normal-sized vowels below them (and the vowels are below the baseline
of
> the normal text))

It might be possible to develop technologies that allowed correct
positioning in particular cases where different sizes were involved, but
I think we still have some more basic problems to solve, like finishing
getting implementations that offer basic support for all of the scripts
in Unicode 4.0.



Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division

RE: Transcoding Tamil in the presence of markup

2003-12-08 Thread Peter Constable

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf
> Of Peter Jacobi


> As it was possible to style individual characters in legacy encodings
> (heck, it was possible using a mechanical Tamil typewriter!), what is
to
> be done in migration to Unicode?

There is nothing about encoding of Tamil in Unicode that prevents
styling of individual characters. This is not a Unicode problem, as (at
least some) others have said.

 
> So, I'm still wondering whether Unicode and HTML4 will consider
>   லா
> valid and it is the task of the user agent to make the best out of it.

There is nothing in Unicode that says this is invalid. Unicode's scope
is plain text; it has nothing to say about HTML.



Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Jungshik Shin

On Sun, 7 Dec 2003 [EMAIL PROTECTED] wrote:

> Jungshik Shin scripsit:
>
> >   Absolutely. The multi-level representability of Korean script
> > demonstrates its 'advanced' status as a script (invented only 5.5
> > centuries ago, it  must have been able to build upon more than 2,000
> > year's history of writing system), but at the same time, has been a

> OT question: is Korean script to some degree the product of stimulus
> diffusion from Indic script of any sort?  By "stimulus diffusion" I
> mean the reinvention of a cultural concept (in this case, alphabetic
> writing) as a result of hearing that some other culture has the concept,
>  but without any details.

  It's certain that to inventors of the Korean script (King
Sejong and scholars in his court), Indic scripts and Phagspa script were
well known (Mongolian was one of languages taught at the nat'l foreign
language school at the time and King Sejong was interested in translating
Buddhist books in Sanskrit).  There are several theories about the
'origin' of the Korean scripts (what script was meant by 'åç'
mentioned as the basis of the Korean script in the book explaning
the principles of the script).  Some believe that it's a completely
independent invention. Others think it's influenced by other scripts
known at the time(Indic and Phagspa among others have been frequently
mentioned since the late 15th century). Still others think that it's
based on a yet-unknown ancient script.  Tibetan script and even Syriac
script and Hebrew script have also come up. BTW, King Sejong and his
scholars were also familiar with the long tradition of Chinese phonetics
and published books on Korean and Chinese phonetics

  Jungshik

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread jcowan

Jungshik Shin scripsit:

>   Absolutely. The multi-level representability of Korean script
> demonstrates its 'advanced' status as a script (invented only 5.5
> centuries ago, it  must have been able to build upon more than 2,000
> year's history of writing system), but at the same time, has been a
> continuous source of "trouble" because it's hard to agree on which level
> to use.

OT question: is Korean script to some degree the product of stimulus diffusion
from Indic script of any sort?  By "stimulus diffusion" I mean the reinvention
of a cultural concept (in this case, alphabetic writing) as a result of hearing
that some other culture has the concept, but without any details.

-- 
John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan  www.reutershealth.com
"If I have not seen as far as others, it is because giants were standing
on my shoulders."
--Hal Abelson

Re: Fwd: Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Peter Jacobi

Hi Martin, All,

[On IANA registration]
> 
> Why don't you do that, or get your Tamil contacts to do it?
> It needs a bit of insistence (repeated checking/reminders
> to the mailing list) and some patience, but otherwise is
> quite easy, and would help a lot. [...]

This is of course, one of the desired results of the entire
exercise. But I'm still fiddling with some details.

Regards,
Peter Jacobi



-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Peter Jacobi

Hi John,

Thank you for doing the tests:

> I have uploaded a valid page to
> 
> 

But I assume you are starting from the
wrong assumptions. You wrote:

> In your TSCII version you write
> §Ä¡
> 
> is that not equivalent to Unicode
> 
> ெலா

But TSCII 
§Ä¡
is Unicode
லொ
or, altenatively but deprecated:
லொ

This change in codepoint order (from 'visual'
to 'logical') is the root of transcoding difficulties.

Depending on how your font and renderer handle
isolated vowel signs, you may be able to fake 
something along the lines of:
 ெலா

But this is an abuse of the Unicode encoding model for
Indic. 

> For Windows browsers I find I have to specify a Unicode font (in this 
> case Arial Unicode MS) in order for pages to display properly without 
> the user fiddling with his browser preferences. 

That may give you the better display.  Not being an owner of MS Office
I cannot legally test Arial Unicode MS. All Unicode fonts I tested
(Latha, Code 2000 and Avarangal) didn't give a good display
with your solution, as the vowel sign out of its true position will
render an additional 'mark' (dotted circle).

Regards,
Peter Jacobi

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Peter Jacobi

> At 2:43 pm +0100 7/12/03, Peter Jacobi wrote:
> 
> >Then you consider
> >   லொ
> >to be valid input, which ideally should render as intended?
> 
> I have uploaded a valid page to
> 
> 
> 
> where you should see the lo properly displayed (in the second case). 
> As to the TSCII stuff I have simply followed your encodings, which 
> seem to give different glyphs, but maybe the first font in my list 
> (MylaiTSC) is encoded differently -- so much for unregistered legacy 
> encodings.
> 
> >Then you consider
> >   லொ
> >to be valid input, which ideally should render as intended?
> 
> In your TSCII version you write
> §Ä¡
> 
> is that not equivalent to Unicode
> 
> ெலா
> 
> >>From a processing point of view, it is somehwat challenging, as you 
> >may have to parse through lots of markup, until you know what to do 
> >with the 0BB2.
> 
> That seems fairly easy.  I must be missing the point.
> 
> >As I've understood from other posts, the font support for
> >all this is theoretically available, but not often done in practice.
> 
> For Windows browsers I find I have to specify a Unicode font (in this 
> case Arial Unicode MS) in order for pages to display properly without 
> the user fiddling with his browser preferences.  As I said I have 
> WinNT 4.0 so maybe this has changed now.  The Mac browsers (Safari, 
> OmniWeb) require no font to be specified and will display the correct 
> characters no matter what the user's defaults.  I have nothing to do 
> with Mozilla.
> 
> JD
> 
> 
> 
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net

RE: Transcoding Tamil in the presence of markup

2003-12-07 Thread Philippe Verdy

> As an example, the vowel pairs a/ya, o/yo, u/yu, and so on
> are distinguished by changing from one small stroke to two
> small strokes. A Web page for children or foreigners may
> want to color these strokes separately. With the current
> encoding(s) in Unicode this is not possible, but I'm sure
> somebody has designed an encoding where this would be possible.

for these wowels pairs, this is not impossible to do:
but one must remember that ya, yo, yu are in fact compound 
letters (even if they are composed in the johab set of jamos that 
was used in Unicode) and are safely decomposable in Hangul as 
separate vowels, even if they are not canonically decomposable 
in Unicode.

So you could safely decompose, when creating the document,
these compound vowels, so that they can be each assigned a 
distinct style for instructing renderers.

It's just a shame that these compund letters were not given 
explicit canonical decompositions in Unicode (so that they would 
not occur with documents in NFD and NFKD forms, but could still 
be compressed with johab compound jamos and then as LV and LVT 
syllables in NFC and NFKC forms).

As a rendering process such as a browser does not need to output 
characters when rendering Hangul texts, I think they can safely 
add these decompositions internally and recompose to NFC form to 
optimize the final rendering in fonts, when the letters in the
same syllabic cluster share the same style; if this is not the 
case, then it's up to the browser to split syllables and render 
them using more basic Hangul jamos (but then the browser needs 
to know the way multiple jamos are composed, i.e.  above 
, then  on the left of  unless  is horizontal 
(in which case  is above , and then letters in  aligned 
horizontally if they are vertical, such as SSANG* CHOSEONG's), and 
same thing for  (this includes the vowels pair YE which is 
composed horizontally, as Y is vertical). The Hangul script is so 
logical that even complex clusters are easy to compose and read, 
and even to transcode to ASCII or to sort.


__
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
<>

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Peter Kirk

On 07/12/2003 04:57, Michael Everson wrote:

At 13:18 +0100 2003-12-07, Peter Jacobi wrote:

This is the core problem. Legacy Tamil encodings and use see  the 
vowel signs as individual characters. Switching to ISCII or Unicode 
will break this.


Oh well. Sometimes data can't be migrated reversibly. Think of the 
thousand years that Unicode *will* be used, rather than the little bit 
of data generated in the last twenty years.
If problems in one language like this cannot be fixed to users' 
satisfaction you might find that Unicode is used for that language not 
for a thousand years but not even for one year.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread John Hudson

At 09:47 AM 12/7/2003, Martin Duerst wrote:

The basic problem is that one has to draw the line somewhere.
Sometimes, one would for example like to color the dot on an 'i'.
In Unicode, it may theoretically be possible (with a dotless 'i'
and a 'dot above' or some such), but it wouldn't be a real 'i'
anymore.
Unless you decompose and colour in a buffered layer, preserving the 
original character string.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
Theory set out to produce texts that could not be processed successfully
by the commonsensical assumptions that ordinary language puts into play.
There are texts of theory that resist meaning so powerfully ... that the
very process of failing to comprehend the text is part of what it has to offer
- Lentricchia & Mclaughlin, _Critical terms for literary study_

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread John Delacour

At 2:43 pm +0100 7/12/03, Peter Jacobi wrote:

Then you consider
  லொ
to be valid input, which ideally should render as intended?
I have uploaded a valid page to



where you should see the lo properly displayed (in the second case). 
As to the TSCII stuff I have simply followed your encodings, which 
seem to give different glyphs, but maybe the first font in my list 
(MylaiTSC) is encoded differently -- so much for unregistered legacy 
encodings.

Then you consider
  லொ
to be valid input, which ideally should render as intended?
In your TSCII version you write
§Ä¡
is that not equivalent to Unicode

ெலா

From a processing point of view, it is somehwat challenging, as you 
may have to parse through lots of markup, until you know what to do 
with the 0BB2.
That seems fairly easy.  I must be missing the point.

As I've understood from other posts, the font support for
all this is theoretically available, but not often done in practice.
For Windows browsers I find I have to specify a Unicode font (in this 
case Arial Unicode MS) in order for pages to display properly without 
the user fiddling with his browser preferences.  As I said I have 
WinNT 4.0 so maybe this has changed now.  The Mac browsers (Safari, 
OmniWeb) require no font to be specified and will display the correct 
characters no matter what the user's defaults.  I have nothing to do 
with Mozilla.

JD

Re: Fwd: Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Martin Duerst

Hello Peter,

At 13:25 03/12/07 +0100, Peter Jacobi wrote:
Dear Doug, All,

> BTW, your "Unicode test page" is marked:
>   content="text/html; charset=ISO-8859-1">
This is of course redundant as this is the HTTP default.
Well, the HTTP spec unfortunately still says so, but the
HTML spec (and we are dealing with HTML here) disagrees,
and so does practice (if you look farther than just
Western Europe).

The heading 'Unicode' means the logical content, not the
encoding. The Tamil content is given as hex NCRs.
That's perfectly okay, of course.

> while your TSCII test page is marked "x-user-defined".

As the legacy Tamil charsets are not IANA registered, Tamil
users typically have a TSCII font set up for the display
of "x-user-defined"pages.
Why don't you do that, or get your Tamil contacts to do it?
It needs a bit of insistence (repeated checking/reminders
to the mailing list) and some patience, but otherwise is
quite easy, and would help a lot. And you have the experience
to describe how this relates to Unicode.
Regards,   Martin.

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Martin Duerst

At 23:16 03/12/07 +0900, Jungshik Shin wrote:

On Sun, 7 Dec 2003, Peter Jacobi wrote:

> So, I'm still wondering whether Unicode and HTML4 will consider
>   லா
> valid and it is the task of the user agent to make the best out of it.
  I think this is valid.
I agree. It is the task of the user agent to make the best out of it,
and different user agents may currently do different things with it.
Because this is related to rendering and styling, it seems to make
sense that this is clarified in the CSS spec (either 2.1 or 3.0).

A more interesting case has to do with
W3 CHARMOD in which NFC is required/recommended (it's not yet complete
and W3C I18N-WG has been discussing it).  Consider the following case.
  ல�x0BC7;
 �x0BBE;
Because  is equivalent to U+0BCB, we couldn't use
the above if NFC is required even though in legacy TSCII encoding,
it's possible.
Yes, this is a bad idea. But there is Web technology that can do
this (see below).
The basic problem is that one has to draw the line somewhere.
Sometimes, one would for example like to color the dot on an 'i'.
In Unicode, it may theoretically be possible (with a dotless 'i'
and a 'dot above' or some such), but it wouldn't be a real 'i'
anymore.
And there is of course a slippery slope. For example, consider
the crossbar on a 't'. You can't color that, in any encoding.
But a font designer may want to do that, for some instructional
material, or may want to color all serifs in a font,...
Similar examples exist in almost any other script. For most
intents and purposes, most people are okay with what they
can and can't do, but occasionally, we come close to the
dividing line, and some of us are quite surprised. But somehow,
we have to agree on what's a character and what's only a glyph,
and we have to agree which combinations are canonically equivalent.

The same is true of Korean syllables(see below) as
Philippe pointed out.
  각
Yes. Korean is particularly difficult because it is the most
logical, well-designed script in the world. It has more
clearly identifiable hierarchical levels than any other
script. It is very difficult to agree on which level
characters should be.
As an example, the vowel pairs a/ya, o/yo, u/yu, and so on
are distinguished by changing from one small stroke to two
small strokes. A Web page for children or foreigners may
want to color these strokes separately. With the current
encoding(s) in Unicode this is not possible, but I'm sure
somebody has designed an encoding where this would be possible.
So while this does not solve Peter's immediate problem,
starting to change Unicode to color characters, glyphs,
or character parts would be an extremely slippery slope.
Working on better font technology seems to be much better
suited to do the job. And such technology actually is
already around. It's part of SVG. Chris Lilley had a
very nice example once, but it got lost in a HD crash.
Chris, any chance of getting a new example?
SVG (http://www.w3.org/Graphics/SVG/ http://www.w3.org/TR/SVG11/)
is the XML-based vector graphics format for the Web.
Here is more or less how it works (as far as I understand it):
In SVG Fonts (http://www.w3.org/TR/SVG11/fonts.html),
SVG itself is used to describe glyph shapes. This means
that all kinds of graphic features, including of course
coloring, but also animation,... are available.
But of course you don't want colors to be fixed.
So glyphs in a font, or parts of glyphs, also allow
the 'class' attribute. So you can mark glyphs or glyph
components with things such as class='accent' or
class='crossbar', and so on. The rendering of pieces
in this class can then be controlled from a CSS
stylesheet.  (I hope I got the details right.)
Regards,Martin.

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Martin Duerst

At 23:34 03/12/07 +0900, Jungshik Shin wrote:

On Sun, 7 Dec 2003, Peter Jacobi wrote:

> There is some mixup of lang and encoding tagging, which I didn't fully
> understand.
   When lang is not explicitly specified, Mozilla resorts to 'infering'
'langGroup' ('script (group)' would have been a better term) from
the page encoding. Because UTF-8 is script-neutral, it's important to
specify 'lang' explicitly. Your page is in ISO-8859-1 so that without
lang specified, it's assumed to be in 'x-western' lagnGroup(well, Latin
script). Anyway, this behavior slightly changed recently in Windows
version (I forgot when I commited that patch, before or after 1.4)
and each Unicode block is assigned the default 'script'. The way fonts
are picked up by the Xft version of Mozilla makes it harder to do the
equivalent on Linux.
I know that font selection/composition is a terribly difficult
business, and hard work, so improving things takes time.
Starting out with certain assumptions about fonts for certain
encodings is clearly very helpful for speed. But I think that
not (correctly) rendering a character that is obviously in
one script and not in another is a bad idea.
Years ago, I developed a very flexible system that was able to
start out with the user-selected font but would use another
font if the first font wasn't able to do the job. The basic
architecture was in many ways very simple, but it took quite
some time to get it right. Once I had this basic architecture,
all kinds of neat things became very easy. For details, see
the paper from the 7th Unicode Conference at:
http://www.ifi.unizh.ch/groups/mml/people/mduerst/papers/PS/FontComposition. 
ps.gz

Regards,Martin.

Re: Fwd: Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Mark E. Shoulson

On 12/07/03 08:55, Peter Jacobi wrote:

Hi Mark, All,

 

I also agree, but I point out that the sufficiently perverse could come 
up with some pretty tough examples.  Applying color is a pretty benign 
style, but what if I wanted a boldface circumflex on a normal letter?  
Or even more obnoxious, a 10-point circumflex on a an 8-point letter?  
These could be tricky to compute.
   

Please have a look at the examples. This isn't a parallel to
accents. The Tamil vowels and consonants in questions are
clearly distinct side by side. They could have been styled using
a mechanical typewriter by double-striking, underlining or 
switching to the second color. Individually.

Yes, I agree.  The discussion had moved farther afield, though, to the 
general case of styling combining sequences, so I was exploring other 
combining sequences that are or could be pathological.

if you ask the system to do bizarre things, it's your own fault (while 
applying color is not quite so bizarre).
   

Emphasizing a single letter isn't bizarre. It is often used in educational
material.
No, but changing font size between a letter and a diacritic is.

(and now I contradict myself with a counterexample.  In 
http://omega.enstb.org/yannis/pdf/biblical-hebrew94.pdf, Yannis 
Haralambous notes--correctly--that when typesetting the Hebrew Bible, 
letters that are written small hang from the top line and have 
normal-sized vowels below them (and the vowels are below the baseline of 
the normal text))

~mark

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Peter Jacobi

Hi Doug, All,

"Doug Ewell" <[EMAIL PROTECTED]>:
> This is browser behavior, not word processor behavior, and certainly not
> an inherent defect in the Unicode logical-order model.  Display engines
> need to do a better job of applying style to individual reordrant
> glyphs, that's all.

I've already agreed to the second sentence. But further tests
of the first sentence vote for pessimism:

Lowly Wordpad and Open Source Abiword won't be
able to style individual Tamil Unicode characters -
but you can use them in the old-style hackish way for
TSCII and then you can of course style individual 
characters.

Then I downloaded WorldPad from SIL and specially hinted
"Code2000 Tamil Graphite" font for this program. But this
still doesn't work for styling individual Tamil characters.

Other (hopefully positive) test results most welcome!

Regards,
Peter Jacobi

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net

Re: Fwd: Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Mark E. Shoulson

On 12/07/03 07:25, Peter Jacobi wrote:

[..]  Display engines
need to do a better job of applying style to individual reordrant
glyphs, that's all.
   

I fully agree with this, Do you know any display engine which is capable
of this?
I also agree, but I point out that the sufficiently perverse could come 
up with some pretty tough examples.  Applying color is a pretty benign 
style, but what if I wanted a boldface circumflex on a normal letter?  
Or even more obnoxious, a 10-point circumflex on a an 8-point letter?  
These could be tricky to compute.

Of course, "Garbage In, Garbage Out" is probably a good answer to this: 
if you ask the system to do bizarre things, it's your own fault (while 
applying color is not quite so bizarre).

~mark

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Jungshik Shin

On Sun, 7 Dec 2003, Peter Jacobi wrote:

Hi,

> > [..] Anyway, he
> > should have used 'lang' tag to help browsers pick up fonts. In two
> > pages above, simply adding 'lang="ta"' to  would suffice.
>
> But I assume (and tested), that language tagging doesn't help
> Mozilla in rendering the 'styled' example.

  Sure, it doesn't (for the reason I explained in  another message.)
My point was not that it'd help Mozilla render the style example but
that language tagging in general is a good idea.

> And, unfortunately, language tagging the TSCII version interferes with
> the font hack to correctly display TSCII pages. If you want to able to see
> TSCII and Unicode Tamil pages, Mozilla's font setup must associate a
> Tamil Unicode font with 'Tamil'  and a Tamil TSCII font with 'User Defined'.

  This is not an unreasonable requirement, is it? How could
browser know what's meant by 'user defined'?

> There is some mixup of lang and encoding tagging, which I didn't fully
> understand.

   When lang is not explicitly specified, Mozilla resorts to 'infering'
'langGroup' ('script (group)' would have been a better term) from
the page encoding. Because UTF-8 is script-neutral, it's important to
specify 'lang' explicitly. Your page is in ISO-8859-1 so that without
lang specified, it's assumed to be in 'x-western' lagnGroup(well, Latin
script). Anyway, this behavior slightly changed recently in Windows
version (I forgot when I commited that patch, before or after 1.4)
and each Unicode block is assigned the default 'script'. The way fonts
are picked up by the Xft version of Mozilla makes it harder to do the
equivalent on Linux.

> >   You're right. Anyway, this is an interesting challege to
> > layout/rendering engines.
>
> Then you consider
>   லொ
> to be valid input, which ideally should render as intended?

  Yes, I do.  As I wrote in another message, this thread should
be interesting to people on W3C I18N-WG. You may consider moving (or
crossposting) the thread to I18N WG's public mailing list.

  Jungshik

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Jungshik Shin

On Sun, 7 Dec 2003, Peter Jacobi wrote:

> > In Unicode U+0BBE, U+0BC6 and U+0BCA are all dependent vowel signs
>
> Yes, but just this fact doesn't meet user's expectations. It is inherited
> from
> the ISCII unification.

  I don't think there's any rule that dependent vowel signs cannot be
styled independently from consonants preceding them.

> But the core problem is not on the theoretical, but on the practical side.
>
> As it was possible to style individual characters in legacy encodings
> (heck, it was possible using a mechanical Tamil typewriter!), what is to
> be done in migration to Unicode?
>
> So, I'm still wondering whether Unicode and HTML4 will consider
>   லா
> valid and it is the task of the user agent to make the best out of it.

  I think this is valid. A more interesting case has to do with
W3 CHARMOD in which NFC is required/recommended (it's not yet complete
and W3C I18N-WG has been discussing it).  Consider the following case.

  ல�x0BC7;
 �x0BBE;

Because  is equivalent to U+0BCB, we couldn't use
the above if NFC is required even though in legacy TSCII encoding,
it's possible. The same is true of Korean syllables(see below) as
Philippe pointed out.

  각

> > In Mozilla you may be completely breaking the font lookups by separately
> > formatting the different parts of a conjunct.
>
> As I've understood Mozilla (i.e. Jungshik Shin) internally transcodes to
> TSCII
> before display. Or is this only be done on Linux?

  It's only done on Linux and on Win 9x/ME. On Win 2k/XP, it relies on
TextOut (the exact name of Win32 API is escaping me at the moment)
'implicitly invoking Uniscribe API' (on our behalf). Mozilla needs to
invoke Uniscribe APIs directly  on Windows as is done by MS IE (and Pango
APIs on Linux ATSUI on Mac OS X).  Either way, what's currently happening
is that enclosing U+0BB2 with  breaks it apart from U+0BBE so that
the 'context' is lost and both U+0BB2 and U+0BBE are rendered separately
(so that reordering doesn't happen)

  Jungshik

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Peter Jacobi

Hi Jungshik, All,

> [..] Anyway, he
> should have used 'lang' tag to help browsers pick up fonts. In two
> pages above, simply adding 'lang="ta"' to  would suffice.

But I assume (and tested), that language tagging doesn't help
Mozilla in rendering the 'styled' example. 

And, unfortunately, language tagging the TSCII version interferes with
the font hack to correctly display TSCII pages. If you want to able to see
TSCII and Unicode Tamil pages, Mozilla's font setup must associate a
Tamil Unicode font with 'Tamil' and a Tamil TSCII font with 'User Defined'.

There is some mixup of lang and encoding tagging, which I didn't fully
understand.

> > This is browser behavior, not word processor behavior, and certainly not
> > an inherent defect in the Unicode logical-order model.  Display engines
> > need to do a better job of applying style to individual reordrant
> > glyphs, that's all.
> 
>   You're right. Anyway, this is an interesting challege to
> layout/rendering engines. 

Then you consider 
  லொ
to be valid input, which ideally should render as intended?

>From a processing point of view, it is somehwat challenging,
as you may have to parse through lots of markup, until
you know what to do with the 0BB2. 

As I've understood from other posts, the font support for
all this is theoretically available, but not often done in practice.

Regards,
Peter Jacobi

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net

Fwd: Re: Fwd: Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Peter Jacobi

Hi Mark, All,

> I also agree, but I point out that the sufficiently perverse could come 
> up with some pretty tough examples.  Applying color is a pretty benign 
> style, but what if I wanted a boldface circumflex on a normal letter?  
> Or even more obnoxious, a 10-point circumflex on a an 8-point letter?  
> These could be tricky to compute.

Please have a look at the examples. This isn't a parallel to
accents. The Tamil vowels and consonants in questions are
clearly distinct side by side. They could have been styled using
a mechanical typewriter by double-striking, underlining or 
switching to the second color. Individually.

> if you ask the system to do bizarre things, it's your own fault (while 
> applying color is not quite so bizarre).

Emphasizing a single letter isn't bizarre. It is often used in educational
material.

Please consider the analogy of 'th' being considered a unbreakable
conjunct in English and any attempts to style only the 't' or
only the 'h' failing with differing results.

Regards,
Peter Jacobi

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Jungshik Shin

On Sat, 6 Dec 2003, Doug Ewell wrote:

> Peter Jacobi  wrote:
>
> > Some tests: In Mozilla 1.4.1 the characters fall apart and in IE5.5
> > the style expands to the entire orthographic syllable.
> > Unicode test page: http://www.jodelpeter.de/i18n/tamil/markup-uc.htm
> > TSCII test page: http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm
>
> BTW, your "Unicode test page" is marked:
>
>   content="text/html; charset=ISO-8859-1">

  Peter uses NCRs so that it doesn't matter (although I prefer to
tag the page as 'UTF-8', even in that case), does it? Anyway, he
should have used 'lang' tag to help browsers pick up fonts. In two
pages above, simply adding 'lang="ta"' to  would suffice.
In xref-uc.htm, if you want a fine-grained control, he can just globally
replace '&#' with '&#'.

> while your TSCII test page is marked "x-user-defined".  I'm not sure
> what either of those declarations accomplishes.

   TSCII is not recongized by most browsers(it's not registered with
IANA)[1]. 'x-user-defined' means that to view the page one has
to configure one's browser to use Tamil 'custom encoded' [2] font
(in TSCII/TAM? encoding) font when rendering 'x-user-defined' page.
Most browsers have an option to set fonts for 'x-user-defined'. It's
certainly better than tagging it as 'iso-8859-1' or 'windows-1252'.

> > After seeing this effect at its source, it's now clear why you can't
> > style individual Tamil characters in a word processor, when using
> > Unicode (whereas you can do so, in legacy encodings).
>
> This is browser behavior, not word processor behavior, and certainly not
> an inherent defect in the Unicode logical-order model.  Display engines
> need to do a better job of applying style to individual reordrant
> glyphs, that's all.

  You're right. Anyway, this is an interesting challege to
layout/rendering engines. In case of Korean Hangul (as Philippe wrote),
it's even more so because unlike Indic scripts[3], it has multiple
canonically equivalent (and not-canonically-equivalent in Unicode sense
but nonetheless 'equivalent' in a certain sense) representations.

   Jungshik

[1]  http://bugzilla.mozilla.org/show_bug.cgi?id=186463

[2] 'Custom' (or 'hack') encoded : Windows-1252, Symbol or MacRoman Cmap
is used to store Tamil glyphs (or other glyphs for other Indic scripts).
Needless to say, we want to leave these fonts behind and move on.

[3] As is well known, there are a few letters for which there are two
   canonically equivalent representations in Indic scripts.

RE: Transcoding Tamil in the presence of markup

2003-12-07 Thread Michael Everson

At 13:18 +0100 2003-12-07, Peter Jacobi wrote:

This is the core problem. Legacy Tamil encodings and use see  the 
vowel signs as individual characters. Switching to ISCII or Unicode 
will break this.
Oh well. Sometimes data can't be migrated reversibly. Think of the 
thousand years that Unicode *will* be used, rather than the little 
bit of data generated in the last twenty years.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Peter Jacobi

Hi Christopher, All,

> In Unicode U+0BBE, U+0BC6 and U+0BCA are all dependent vowel signs

Yes, but just this fact doesn't meet user's expectations. It is inherited
from
the ISCII unification.

> Since in  some fonts a base character + combining vowel mark 
> might be displayed by a single ligature glyph, it makes sense to apply the
> formatting of a base character to any dependant combining characters as
well.

In Tamil most vowels never form ligatures. (O.K. the exact value of 'most'
has changed over time and was lower in the past, but its was never less
than  7/11, to by best knowledge, and is now 9/11).

But the core problem is not on the theoretical, but on the practical side.

As it was possible to style individual characters in legacy encodings
(heck, it was possible using a mechanical Tamil typewriter!), what is to
be done in migration to Unicode?

So, I'm still wondering whether Unicode and HTML4 will consider
  லா
valid and it is the task of the user agent to make the best out of it.

> In Mozilla you may be completely breaking the font lookups by separately
> formatting the different parts of a conjunct.

As I've understood Mozilla (i.e. Jungshik Shin) internally transcodes to
TSCII
before display. Or is this only be done on Linux?

Regards,
Peter Jacobi

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net

Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Peter Jacobi

Hi John,

> It's not clear to me what you are saying. I have viewed markup-uc.htm 
> with Safari and OmniWeb on the Mac and with IE 6.0.2800.11061S and 
> Opera 7 on a machine running Windows NT4 and there is no major 
> problem with the styled display which is precisely as you specify in 
> the source. 

Thank you for your tests. I've retested with IE6 on Opera7 on W2K

IE6: The complete syllable comes out blue
Opera: The characters come out in the wrong order (both
styled and unstyled!)

Can you please tell me which fonts you are using? And have you
added complex script support to NT4?

> Two things might just conceivably help other browsers 
> along -- to use decimal rather than hexadecimal entities, and to 
> declare utf-8 as the character set  [..]

I think this is only good for Version 4 browsers and they will break 
less subtle.

Regards,
Peter Jacobi

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net

Fwd: Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Peter Jacobi

Dear Doug, All,

> BTW, your "Unicode test page" is marked:
>   content="text/html; charset=ISO-8859-1">

This is of course redundant as this is the HTTP default.
The heading 'Unicode' means the logical content, not the
encoding. The Tamil content is given as hex NCRs.

> while your TSCII test page is marked "x-user-defined".  

As the legacy Tamil charsets are not IANA registered, Tamil
users typically have a TSCII font set up for the display
of "x-user-defined"pages.

> I'm not sure
> what either of those declarations accomplishes.

Hope, I could clarify this.

> [..]  Display engines
> need to do a better job of applying style to individual reordrant
> glyphs, that's all.

I fully agree with this, Do you know any display engine which is capable
of this?

> 
> > It's hard to promote Unicode, when things that have worked in the
> > past, stop working.
> 
> This is alarmist and unnecessary.

This is born out of sheer frustration. I was arguing for weeks on some
mailing
lists, that  programmatic conversion to Unicode is easy and no features are
lost.
Now a very simple point I forgot to think about hit me.

Regards,
Peter Jacobi

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net

RE: Transcoding Tamil in the presence of markup

2003-12-07 Thread Peter Jacobi

Hi Maurice, All,

> If you are trying to stylise one glyph of a multiglyph character, you are
> hung. The smallest unit in Unicode (and the standards dependent upon it)
> is
> the character (not the glyph). You can change the 'feature' of individual
> glyphs using Graphite (also from SIL), but OpenType only has on/off
> settings
> for feature (and seldom accessible).

This is the core problem. Legacy Tamil encodings and use
see  the vowel signs as individual characters. Switching to 
ISCII or Unicode will break this.

And most Tamil Vowel signs appear as very distinct characters,
otherwise a Tamil typewriter would not have been possible.

Regards,
Peter Jacobi

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net

Re: Transcoding Tamil in the presence of markup

2003-12-06 Thread John Delacour

It's not clear to me what you are saying. I have viewed markup-uc.htm 
with Safari and OmniWeb on the Mac and with IE 6.0.2800.11061S and 
Opera 7 on a machine running Windows NT4 and there is no major 
problem with the styled display which is precisely as you specify in 
the source.  Two things might just conceivably help other browsers 
along -- to use decimal rather than hexadecimal entities, and to 
declare utf-8 as the character set -- but this won't make a scrap of 
difference to a browser such as IE 5.2.3 for Mac (and many others) 
because it simply can't deal with anything that won't convert to 
standard legacy character sets.

I have come across several instances of Win IE 6 not diesplaying 
Unicode characters as it should or not at all and there are probably 
many such instances outside my experience -- whether due to the OS or 
to MSIE -- but a good up-to-date browser will not misbehave.

JD



At 7:39 pm +0100 6/12/03, Peter Jacobi wrote:

In Unicode:
 lA லா
 le லெ
 lo லொ
It is easy to see, that simple n:m mapping cannot make this conversion.
It is not that easy to judge whether this is the desired conversion at all.
And what should the receiving software should do with it.
Some tests: In Mozilla 1.4.1 the characters fall apart and in IE5.5 the
style expands to the entire orthographic syllable.
Unicode test page: http://www.jodelpeter.de/i18n/tamil/markup-uc.htm
TSCII test page: http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm
After seeing this effect at its source, it's now clear why you can't style
individual Tamil characters in a word processor, when using Unicode (whereas
you can do so, in legacy encodings).
It's hard to promote Unicode, when things that have worked in the past,
stop working.

RE: Transcoding Tamil in the presence of markup

2003-12-06 Thread Philippe Verdy

Christopher John Fynn writes:
> In Unicode U+0BBE, U+0BC6 and U+0BCA are all dependent vowel signs
> IE is probably  treating a base character and any dependent
> vowels as a single
> unit. Since in  some fonts a base character + combining vowel
> mark might be
> displayed by a single ligature glyph, it makes sense to apply the
> formatting of
> a base character to any dependant combining characters as well.
>
> In Mozilla you may be completely breaking the font lookups by separately
> formatting the different parts of a conjunct.
>
> In legacy glyph based Tamil encodings there was a simple one-to-one
> correspondence  characters and glyphs so it is straightforward to apply
> different formatting to different characters.

Still this is an interesting problem: some texts for example want to
exhibit some diacritics added to a base letter with a distinct color,
notably in linguistic texts related to grammar or orthography.

So for example you could want to exhibit the difference between the two
French words "désert" and "dessert" by coloring the accent of the first
word or the second s of the second; or even more accurately between
"bailler" (concéder un bail, des baux) and "bâiller" (ouvrir en grand)
where the presence or absence of the circumflex on letter 'a' is
necessary to reflect the difference of both meaning and pronounciation.

However, this is not a problem of Unicode itself, but of the rich-text
format used to add style to a given text. In Unicode (and even in HTML
and SGML), a letter 'a' followed by a circumflex is canonically equivalent
to the composed latter 'a' with a circumflex. However if you add tags
between a base letter and its diacritics, you create separate texts and
you then have a defective combining sequence in the second string
starting with the circumflex.

For Unicode, this circumflex will logically attempt to create a
combining sequence with its previous HTML or SGML or XML tag. This
will break many parsers that use the Unicode rules when handling files
encoded with a Unicode encoding scheme like UTF-8.

Creating a text that use this HTML "feature" is very hazardous, as the
interpretation and rendering of defective combining sequences is
implementation-specific (an application may choose to render the
diacritics with a base dotted circle glyph, or may display them with
an base empty glyph, or associate the defective combining sequence with
the previous combining sequence, or may just be unable to render this
sequence, as the previous combining sequence may not be accessible in
the current context of rendering).

If one want really to add style to diacritics only, it's not in
Unicode that you'll must search a solution, but in the styling or
tagging language itself (but defining such a style rule would be
extremely tricky, and adding this with intermediate tags is not
conforming to the W3C recommandation for separation between text and
styles). So that's an interesting question to submit to the W3C for
its CSS specification... I think that Unicode will not allow you to
define anything else.

For now you can use a conforming solution that consists in a HTML
code like this (here to render the circumflex above a in red):

a ̂

or better with a style sheet:

...
a ̂

This code does not contain any defective sequence, and treats the
diacritic as a separate graphic unit (it is really such if you
need a style to detach it from the regular text.

__
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com

Re: Transcoding Tamil in the presence of markup

2003-12-06 Thread Christopher John Fynn

In Unicode U+0BBE, U+0BC6 and U+0BCA are all dependent vowel signs

IE is probably  treating a base character and any dependent vowels as a single
unit. Since in  some fonts a base character + combining vowel mark might be
displayed by a single ligature glyph, it makes sense to apply the formatting of
a base character to any dependant combining characters as well.

In Mozilla you may be completely breaking the font lookups by separately
formatting the different parts of a conjunct.

In legacy glyph based Tamil encodings there was a simple one-to-one
correspondence  characters and glyphs so it is straightforward to apply
different formatting to different characters.

--
Christopher J. Fynn



- Original Message - 
From: "Peter Jacobi" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, December 06, 2003 6:39 PM
Subject: Transcoding Tamil in the presence of markup


> Dear All,
>
> I am attempting transcoding Tamil text (in legacy 8-bit encodings, which
> are in visual glyph order, being heirs of the Tamil typewriter) into Unicode
> (which uses 'logical' order invented  by ISCII):
> http://www.jodelpeter.de/i18n/tamil/xref-uc.htm
>
> When I thought,  my converter was ready, I had a severe collision
> with reality, as I tried it on some webpages.
>
> The problem: in the legacy encoding you can style individual characters,
> which not only breaks my simple converter, but which may have no
> good equivalent in Unicode anyway. See this example:
> (all legacy encoded Tamil is shown using C-style escape, Unicode Tamil as
> NCR)
>
> Converting unstyled text
> from TSCII
>  lA \xC4\xA1
>  le \xA7\xC4
>  lo \xA7\xC4\xA1
> to Unicode
>  lA லா
>  le லெ
>  lo லொ
>
> Now the consonant l should get a distinct color:
> In TSCII:
>  lA \xC4\xA1
>  le \xA7\xC4
>  lo \xA7\xC4\xA1
>
> In Unicode:
>  lA லா
>  le லெ
>  lo லொ
>
> It is easy to see, that simple n:m mapping cannot make this conversion.
> It is not that easy to judge whether this is the desired conversion at all.
> And what should the receiving software should do with it.

> Some tests: In Mozilla 1.4.1 the characters fall apart and in IE5.5 the
> style expands to the entire orthographic syllable.
> Unicode test page: http://www.jodelpeter.de/i18n/tamil/markup-uc.htm
> TSCII test page: http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm
>
> After seeing this effect at its source, it's now clear why you can't style
> individual
> Tamil characters in a word processor, when using Unicode (whereas
> you can do so, in legacy encodings).
>
> It's hard to promote Unicode, when things that have worked in the past,
> stop working.
>
> Any insights?
>
> Regards,
> Peter Jacobi
>
>
>
>
> -- 
> +++ GMX - die erste Adresse für Mail, Message, More +++
> Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net
>
>
>

Re: Transcoding Tamil in the presence of markup

2003-12-06 Thread Doug Ewell

Peter Jacobi  wrote:

> Some tests: In Mozilla 1.4.1 the characters fall apart and in IE5.5
> the style expands to the entire orthographic syllable.
> Unicode test page: http://www.jodelpeter.de/i18n/tamil/markup-uc.htm
> TSCII test page: http://www.jodelpeter.de/i18n/tamil/markup-tscii.htm

BTW, your "Unicode test page" is marked:

while your TSCII test page is marked "x-user-defined".  I'm not sure
what either of those declarations accomplishes.

> After seeing this effect at its source, it's now clear why you can't
> style individual Tamil characters in a word processor, when using
> Unicode (whereas you can do so, in legacy encodings).

This is browser behavior, not word processor behavior, and certainly not
an inherent defect in the Unicode logical-order model.  Display engines
need to do a better job of applying style to individual reordrant
glyphs, that's all.

> It's hard to promote Unicode, when things that have worked in the
> past, stop working.

This is alarmist and unnecessary.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

40 matches

Mail list logo