from:"D. Starner"

RE: Roundtripping in Unicode

2004-12-15 Thread D. Starner

"Arcane Jill" writes:

> The obvious solution is for all Unix machines everywhere to be using the same 
> locale - and it 
> had better be UTF-8. But an instantaneous global switch-over is never going 
> to happen, so we see 
> this gradual switch-over ... and it is during this transition phase that 
> Lars's problem 
> manifests. 

The only solution is (a) to use ASCII or (b) to make the switch over as quick 
and clean as possible. Anyone who wants to create new files in UTF-8 and leave 
their old files in the old encoding is asking for trouble. There's no magic
bullet, and complaining here as much as you want won't help. If you're a
system administrator, explain that to the people using your system, and
treat stupid responses just like you would any LART-worthy response.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: Nicest UTF

2004-12-13 Thread D. Starner

> Some won't convert any and will just start using UTF-8 
> for new ones. And this should be allowed. 

Why should it be allowed? You can't mix items with
different unlabeled encodings willy-nilly. All you're going
to get, all you can expect to get is a mess.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: Nicest UTF

2004-12-11 Thread D. Starner

"Lars Kristan" writes:

> > A system administrator (because he has access to all files).
> My my, you are assuming all files are in the same encoding. And what about
> all the references to the files in scripts? In configuration files? Soft
> links? If you want to break things, this is definitely the way to do it.

Was it ever really wise to use non-ASCII file names in scripts and configuration
files? It's not very hard to convert soft links at the same time. Nor, really
should it be too hard to figure out the encodings; /home/foo/.bashrc probably
tells you, as well as simple logic. 

Even if you can't do a system-wide change, it's easy enough to change the
system files, and post a message about switching to UTF-8, and offering to
assist any users with the change.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Nicest UTF

2004-12-11 Thread D. Starner

"Marcin 'Qrczak' Kowalczyk" writes:
> But demanding that each program which searches strings checks for 
> combining classes is I'm afraid too much. 

How is it any different from a case-insenstive search?

> >> Does "\n" followed by a combining code point start a new line? 
> > 
> > The Standard says no, that's a defective combining sequence. 
> 
> Is there *any* program which behaves this way? 

I misstated that; it's a new line followed by a defective combining sequence.

> It doesn't matter that accented backslashes don't occur practice. I do 
> care for unambiguous, consistent and simple rules. 

So do I; and the only unambiguous, consistent and simple rule that won't
give users hell is that "ba" never matches "bä". Any programs for end-users
must follow that rule.

> My current implementation doesn't support filenames which can't be 
> encoded in the current default encoding. 

The right thing to do, IMO, would be to support filenames as byte strings,
and let the programmer convert them back and forth between character strings,
knowing that it won't roundtrip.

> If the 
> program assumed that an accented slash is not a directory separator, 
> I expect possible security holes (the program thinks that a string 
> doesn't include slashes, but from the OS point of view it does). 

If the program assumes that an accented slash is not a directory separator,
then it's wrong. Any way you go is going to require sensitivity.

> > The rules you are offering are only simple and unambiguous to the 
> > programmer; they appear completely random to the end user. 
> 
> And yours are the opposite :-) 

Programmers get to spend a lot of time dealing with the "random"
requirements of users, not the other way around.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Nicest UTF

2004-12-10 Thread D. Starner

John Cowan writes:

> You are reading the XML Recommendation incorrectly.  It is not defined
> in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of
> characters.  XML processors are required to process UTF-8 and UTF-16,
> and may process other character encodings or not.  But the internal
> model is that of characters.  Thus surrogate code points are not
> allowed.

Okay, I'm confused. Does ≮ open a tag? Does it matter if it's composed or 
decomposed?
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Nicest UTF

2004-12-10 Thread D. Starner

"Marcin 'Qrczak' Kowalczyk" writes:

> "D. Starner" writes: 
>
> > This implies that every programmer needs an indepth knowledge of 
> > Unicode to handle simple strings. 
> 
> There is no way to avoid that. 

Then there's no way that we're ever going to get reliable Unicode
support. 

> If the runtime automatically performed NFC on input, then a part of a 
> program which is supposed to pass a string unmodified would sometimes 
> modify it. Similarly with NFD.

No. By the same logic you used above, I can expect the programmer to
understand their tools, and if they need to pass strings unmodified,
they shouldn't load them using methods that normalize the string.

> You can't expect each and every program which compares strings to 
> perform normalization (e.g. Linux kernel with filenames). 

As has been pointed out here, Posix filenames are not character strings; 
they are byte strings. They quite likely aren't even valid UTF-8 strings.

> > So S should _sometimes_ match an accented S? Again, I feel extended misery 
> > of explaining to people why things aren't working right coming on. 
> 
> Well, otherwise things get ambiguous, similarly to these XML issues. 

Sometimes things get ambiguous if one day ŝ is matched by s and one
day ŝ isn't? That's absolutely wrong behavior; the program must serve
the user, not the programmer. 's' cannot, should, must not match 'ŝ';
and if it must, then it absolutely always must match 'ŝ' and someway
to make a regex that matches s but not ŝ must be designed. It doesn't
matter what problems exist in the world of programming; that is the
entirely reasonable expectation of the end user.

> Does "\n" followed by a combining code point start a new line? 

The Standard says no, that's a defective combining sequence.

> Does 
> a double quote followed by a combining code point start a string 
> literal? 

That would depend on your language. I'd prefer no, but it's obvious
many have made other choices.

> Does a slash followed by a combining code point separate 
> subdirectory names?

In Unix, yes; that's because filenames in Unix are byte streams with
the byte 0x2F acting as a path seperator.

> It's hard enough to convince them that a 
> character is not the same as a byte. 

That contradicts you above statement, that every programmer needs an
indepth knowledge of Unicode.

> In case I want to circumvent security or deliberately cause a piece of 
> software to misbehave. Robustness require unambiguous and simple rules. 

The rules you are offering are only simple and unambiguous to the programmer;
they appear completely random to the end user. To have ≮ sometimes start a
tag means that a user can't look at the XML and tell whether something opens
a tag or is just text. You might be able to expect all programmers, but you
can't expect all end users to.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Nicest UTF

2004-12-08 Thread D. Starner

"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes:
> If it's a broken character reference, then what about Á (769 is
> the code for combining acute if I'm not mistaken)?

Please start adding spaces to your entity references or 
something, because those of us reading this through a web interface
are getting very confused.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Nicest UTF

2004-12-08 Thread D. Starner

"Marcin 'Qrczak' Kowalczyk" writes:
> String equality in a programming language should not treat composed
> and decomposed forms as equal. Not this level of abstraction.

This implies that every programmer needs an indepth knowledge of Unicode
to handle simple strings. The concept makes me want to replace Unicode;
spending the rest of my life explaining to programmers, and people who use
their programs, why a search for "Römishe Elegien" isn't bringing the book
is not my idea of happiness.

> IMHO splitting into graphemes is the job of a rendering engine, not of
> a function which extracts a part of a string which matches a regex.

So S should _sometimes_ match an accented S? Again, I feel extended misery
of explaining to people why things aren't working right coming on.

> They are supposed to be equivalent when they are actual characters.
> What if they are numeric character references? Should "≮"
> (7 characters) represent a valid plain-text character or be a broken
> opening tag?

Which 7 characters? My email "client" turned them into the actual characters.
But I think it's fairly obvious that XML added entities in part so you
could include '<'s and other characters without them getting interpreted as
part of the text of the document. Similarly, a combining character entity
following an actual < should be the start of a tag. 

>Note that if it's a valid plain-text character, it's impossible
>to represent isolated combining code points in XML, 

No more then it's impossible to represent '<' in the text.

> I expect breakage of XML-based protocols if implementations are
> actually changed to conform to these rules (I bet they don't now).

Really? In what cases are you storing isolated combining code points
in XML as text? I can think of hypothetical cases, but most real-world
use isn't going to be affected. If I were designing such an XML protocol,
I'd probably store it as a decimal number anyway; XML is designed to
be human-readable, and an isolated combining character that randomly 
combines with other characters that it's not logically associated with 
when displayed isn't particularly human readable.

> Implementing an API which works in terms of graphemes over an API
> which works in terms of code points is more sane than the converse,
> which suggests that the core API should use code points if both APIs
> are sometimes needed at all.

Implementing an API which works in terms of lists over an API which works
in terms of pointers is more sane than the converse, which suggests that the
core API should use pointers if both APIs are sometimes needed at all.

> While I'm not obsessed with efficiency, it would be nice if changing
> the API would not slow down string processing too much.

Who knows how much it would slow down string processing? If I get around
to writing the test code, I'll try and see how much it slows stuff down,
but right now we don't know.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Nicest UTF

2004-12-08 Thread D. Starner

"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes:
> "D. Starner" <[EMAIL PROTECTED]> writes:
>
> > You could hide combining characters, which would be extremely useful if we 
> > were just using Latin 
> > and Cyrillic scripts.
> 
> It would need a separate API for examining the contents of a combining
> character. You can't avoid the sequence of code points completely.

Not a seperate API; a function that takes a character and returns an array of 
integers.

> It would yield to surprising semantics: for example if you concatenate
> a string with N+1 possible positions of an iterator with a string with
> M+1 positions, you don't necessarily get a string with N+M+1 positions
> because there can be combining characters at the border.

The semantics there are surprising, but that's true no matter what you
do. An NFC string + an NFC string may not be NFC; the resulting text
doesn't have N+M graphemes. Unless you're explicitly adding a combining
character, a combining character should never start a string. This could 
be fixed several ways, including by inserting a dummy character to hold 
the combining character, and "normalizing" the string by removing the dummy 
characters. That would, for the most part, only hurt pathological cases.

> It would impose complexity in cases where it's not needed. Most of the
> time you don't care which code points are combining and which are not,
> for example when you compose a text file from many pieces (constants
> and parts filled by users) or when parsing (if a string is specified
> as ending with a double quote, then programs will in general treat a
> double quote followed by a combining character as an end marker).

If you do so with an language that includes <, you violate the Unicode
standard, because ≮ (not <) and ≮ are canonically equivalent. You've
either got to decompose first or look at the individual characters as
a whole instead of looking at code points.

Has anyone considered this while defining a language? How about the official
standards bodies? Searching for XML in the archives is a bit unhelpful, and
UTR #20 doesn't mention the issue. Your solution is just fine if you're
considering the issue on the bit level, but it strikes me as the wrong answer,
and I would think that it would surprising to a user that didn't understand
Unicode, especially in the ≮ case. A warning either way would be nice.

I'll see if I have time after finals to pound out a basic API that implements
this, in Ada or Lisp or something. It's not going to be the most efficient 
thing,
but I doubt it's going to be a big difference for most programs, and if you want
C, you know where to find it.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Nicest UTF

2004-12-06 Thread D. Starner

(Sorry for sending this twice, Marcin.)

"Marcin 'Qrczak' Kowalczyk" writes: 
> UTF-8 is poorly suitable for internal processing of strings in a 
> modern programming language (i.e. one which doesn't already have a 
> pile of legacy functions working of bytes, but which can be designed 
> to make Unicode convenient at all). It's because code points have 
> variable lengths in bytes, so extracting individual characters is 
> almost meaningless (unless you care only about the ASCII subset, and 
> sequences of all other characters are treated as non-interpreted bags 
> of bytes). You can't even have a correct equivalent of C isspace(). 

That's assuming that the programming language is similar to C and Ada. 
If you're talking about a language that hides the structure of strings 
and has no problem with variable length data, then it wouldn't matter 
what the internal processing of the string looks like. You'd need to 
use iterators and discourage the use of arbitrary indexing, but arbitrary 
indexing is rarely important. 

You could hide combining characters, which would be extremely useful if 
we were just using Latin and Cyrillic scripts. You'd have to be flexible, 
since it would be natural to step through a Hebrew or Arabic string as if the 
vowels were written inline, and people might want to look at the combining 
characters (which would be incredibly rare if your language already 
provided most standard Unicode functions.) 

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Unicode for words?

2004-12-05 Thread D. Starner

"Philippe Verdy" <[EMAIL PROTECTED]> writes:

> > Drop the part of the sentence before "then". A protocol could delete "the", 
> > "an", etc. right
> > now. In fact, I suspect several library systems do drop "the", etc. right 
> > now. Not that this
> > makes it a good idea, but that's a lousy argument.
> 
> If such a library does this, only based on the presence of the encoded words, 
> without wondering 
> in which language the text is written, that kind of processing text will be 
> seriously 
> inefficient or inaccurate when processing other languages than English for 
> which you will have 
> built such a library.

Many libraries have large amounts of books in English, French, German, Spanish, 
Italian, 
and various non-Latin languages. Blanket stripping of a, an, the, and la from 
the 
start of a title might very well be good 90% heuristic for removing non-sorting
words from the start of titles. (German being the odd man out, since you can't 
blanket
remove a starting die.)

> For plain-text (which is what Unicode deals about), even the "an", "the", 
> "is" words (and so 
> on...) are equally important as other parts of the text. 

No. It all depends on what you want to do with the text.

Besides which, the point is it doesn't matter whether or not words are encoded 
as 
codepoints; these process can work just the same.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Unicode for words?

2004-12-05 Thread D. Starner

"Philippe Verdy" writes:

> Suppose that Unicode encodes the common English words "the", "an", "is", 
> etc... then a protocol 
> could decide that these words are not important and will filter them. 

Drop the part of the sentence before "then". A protocol could delete "the", 
"an", etc. right
now. In fact, I suspect several library systems do drop "the", etc. right now. 
Not that this
makes it a good idea, but that's a lousy argument.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Unicode for words?

2004-12-05 Thread D. Starner

"Tim Finney" <[EMAIL PROTECTED]> writes:
> This would reduce the
> bandwidth necessary to send text.

Would it really? Ignoring all the other details (being limited
to English, for one), would words that might take up to six bytes
in UTF-8 really compete with the normal encoding, with most words
taking less than that? And that's for uncompressed text; if space
was really such a concern, you'd be compressing the text, so you
need to compare bzip2 or gzip or whatever the new compression is
on UTF-8 to this encoding, which would even it up quite a bit, if
past results mean anything. 

Storing a table of several million words to convert text from the keyboard
to this encoding is going to be eating up a lot of space, and many
places where smaller text sizes would be important wouldn't want to
include 8 MB of data and a CPU powerful to quickly compress and decompress
from this format.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: My Querry

2004-11-23 Thread D. Starner

"Mike Ayers" <[EMAIL PROTECTED]> writes:

> > What is wrong? That UTF-8 (born FSS-UTF) was designed to be 
> > compatible with C language strings?'
> 
>   Yes.  A character encoding can be compatible with ASCII or C
> language strings, but not both, as those two were not compatible to begin
> with.  

That doesn't contradict the statement. UTF-FSS was designed for Plan 9
to be used in C for Unicode strings. Whether or not it's "compatible"
with such strings is really beside the point.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Opinions on this Java URL?

2004-11-14 Thread D. Starner

"Philippe Verdy" writes:

> Nulls are legal Unicode characters, also for use in plain text and since 
> ever in ASCII, and all ISO 8-bit charset standards. Why do you want that a 
> legal Unicode string containing NULL (U+) *characters* become illegal 
> when converted to C strings? 

Why do you need a nul? They're not exactly legal characters in plain text;
I know of no program that would do anything constructive with them in
plain text. A file with arbitrary control characters in it is generally
not a plain text file; an escape code certainly has no fixed meaning and
where it does have meaning it does things, like underlining and highlighting
and other things, that aren't exactly plain text.

> A null *CHARACTER* is valid in C string, because C does not mandate the 
> string encoding (which varies according to locale conventions at run-time). 

That's specious. The string encoding in C since time immortal has generally
been a variety of ASCII or EBCDIC, both of which make the null character
the null byte. 

> Using pure UTF-8 in C strings would not be conforming to either Unicode or C 
> conventions because it will illegitimately restrict the legal embedding of 
> U+ in strings... 

That's nothing new; C has restricted the embedding of U+ in strings since
the very first compiler. ASCII is no different from UTF-8 here.

I've never seen code to make strings in C that hold nulls; I've never send 
anybody
use that as a reason that Java or any other language was better than C. The fact
that you can't put NUL in a C string is both true and seemingly moot. Java's
solution to emit it to a C string are creative and probably useful for the 
situation,
but should never have been written to disk.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: not font designers?

2004-11-08 Thread D. Starner

Michael Everson writes:
> I can't say that I care a fig any more whether you or Ms 
> Keown or Mr Snyder are happy with Phoenician. It is right to encode 
> it. It is wrong to consider it a font variant of Hebrew. 

That's sheer hubris. It's a classifaction scheme; if there are
reasonable people who would unify and reasonable people who would
seperate, then there is no right and wrong, there's only a choice
to be made, which can neither be completely correct or completely
wrong. 
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: not font designers?

2004-11-04 Thread D. Starner

"E. Keown" writes:
> Supposedly this list has >600 people. 
> 
> Just of curiosity, how many of you are NOT font 
> designers? 
> 
> And are any of your corpus linguists, text database 
> people, or maybe database designers? 

None of the above. I used to be a CompSci student, but
now the only connection I have to Unicode is a sometimes
guru for Project Gutenberg's Distributed Proofreaders 
.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: UTF-8 stress test file?

2004-10-12 Thread D. Starner

"Philippe Verdy" <[EMAIL PROTECTED]> writes:
> Examples of bad assumptions that a reader could make: 
> 
> - [quote](...) Experience so far suggests 
> that most first-time authors of UTF-8 decoders find at least one 
> serious problem in their decoder by using this file.[/quote] 
> 
> This suggests to the reader that if its browser or editor does not display 
> the contained test text as indicated, there's a problem in that application. 

If you're a reader, not an "author[...] of [a] UTF-8 decoder", then I don't
see where that statement gives you cause to assume anything. It is indeed
a bad assumption on the part of the reader.

> So who's puzzling here? Not me! It's the content of the text itself. 

Funny; I've never been puzzled by the text of the document. It's obviously
designed to test the edge cases and the failure cases of a UTF-8 decoder.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Stoefwritung in "Absent Voices"

2004-10-01 Thread D. Starner

> >every mark written was obviously 
> >phonetically distinct from any other, and size and vertical location 
> >were equally important. 
> 
> Eh? What sort of notion is this? 

"every mark" may be overstating it, but she had long tables of stuff
distinguished that every modern edition I've seen merges. Size and 
vertical location were apparently important to communicate matters of 
voice. 

> And what is it that "stoef" is supposed to mean? It's not in Clark 
> Hall & Merrit's dictionary, anyway. 

That may be my fault; I wasn't sure whether it was oe or ae. 
 
> You borrowed the book from a library? I hope so. 

Yes.
 
> I don't think I'd put this one into my wishlist, from your description. 

I wish I knew more about manuscript writing; I was hoping someone could
tell me if the foundation was firm or not.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Stoefwritung in "Absent Voices"

2004-10-01 Thread D. Starner

Has anyone read "Absent Voices", by Rochell Altman? Taking her description
of stoefwritung, it seems that Unicode needs a large block tenatively
set aside for Anglo-Saxon writing, as every mark written was obviously
phonetically distinct from any other, and size and vertical location
were equally important. After all, it was a universal writing system
clearly superior to the IPA. (For one thing, everyone can read Anglo-Saxon,
but IPA takes learning.) Locally important in her off-handed dismissal
of modern universal writing systems and universal languages, "In the 
computerized world of the late-twentieth century, the UNICODE Consortium
was trying to create a 'universal' computer character set." Another
quote is "Learning to speak a foreign language was as simple as learning
to read your native tongue with stoefwritung."

It amazes me that a book subtitled "The Story of Writing Systems in the
West" spends so much time on Anglo-Saxon, and that a book that claims
that a writing system is a universal system is about "the West", never
going east of Babylon and rarely east of Calais. 

For all my mocking, I must admit I've barely glanced through the book, and
it looks like there might actually be a wealth of real information about
Anglo-Saxon writing in there. I'm curious if anyone else has seen this book
and has comments.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Saudi-Arabian Copyright sign

2004-09-22 Thread D. Starner

"Gerd Schumacher" writes:
> I think, it would make sense to have a tiny database of composable 
> characters, which are actually used, namely in orthography, and in 
> dictionaries like the Yorouba letters with dot below, the - 35, if I 
> remember well - unencoded Lithuanian composites, the underline below vowels, 
> marking long stressed syllables in German dictionaries, etc. 

Why dictionaries? Dictionaries have such a wide and varied usage of letters,
that I would put them in the category with linguistics and mathematics,
which have a virtually unbounded set of combined characters. Likewise
"actually used" is a bad name.

I'm not sure where you're getting tiny from. Even the list of letters
used in Native American languages is going to be pretty hefty. If it
were short, they would have just encoded them. If it starts to run 
into the thousands, it won't be very useful. 
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Saudi-Arabian Copyright sign

2004-09-20 Thread D. Starner

Eric Muller writes:
 
> it is hard to 
> avoid human work essentially proportional to the number of base+mark 
> *combinations* you claim to support. [...]
> 
> I have no problem with people taking those chances or deciding their 
> fonts are ok, or whatever. But I have a real problem if somebody else 
> claims that *I* must take those chances, or that *I* must do an amount 
> of work that is not justified by my commercial goals, or that *my* fonts 
> are broken if I decided to not support some combination, or that *my* 
> fonts are ok even if the result is below my standards. 

What's your point? All this is exactly the same whether it's encoded
as one character or just left as a heh followed by a combining circle. 
I didn't claim it would come magically out of nowhere.

People don't care about your commerical goals or your standards; they 
care about their goals and their standards. Your fonts are ok if it 
meets their needs and standards (and sometimes it's not how well the 
font dances, it's finding a font that will dance at all); your fonts 
are broken (for their purposes) if you don't support their combinations.
Frankly, just supporting the characters, whether or not everything is
perfectly aligned and maximally aesthetically pleasing, can be of great
help to someone who would otherwise be adding the accents in pen.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: Saudi-Arabian Copyright sign

2004-09-19 Thread D. Starner

Asmus Freytag writes:
> Given 
> the nature of the symbol in question, I would personally see no reason to 
> object 
> to encoding it - especially given the current and projected lack of 
> availability 
> of other alternatives. 

It's a simple combining character. Even if you can't do arbitrary circles
around characters, you can take one character sequence and map it to the
glyph in a font. Systems that can't do even that need to be fixed. 
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Unicode & Shorthand?

2004-09-19 Thread D. Starner

Christopher Fynn writes:
> One trouble is that OpenType shaping engines apply shaping features in a 
> script specific manner. 

Then OpenType is broken in this respect. Unicode constantly tells people
that they will have to use better technology instead of Unicode adding
something.

> If someone wants to make an OpenType font for a complex shorthand script 
> then he must have a user community in mind since this is not a trivial 
> task. 

Because someone wants to make a font doesn't mean they will. And because
someone does make a font, doesn't mean there's a user community for it;
the world is littered with 'better' mousetraps that no one wanted.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: Saudi-Arabian Copyright sign

2004-09-19 Thread D. Starner

Jorg Knappen writes:
> On Sun, 19 Sep 2004, Jon Hanna wrote: 
> > Looks like {U+062D, U+20DD} 
> 
> Yes, it does look like that. But it forms a separate entity, just like its 
> precedents COPYRIGHT SIGN or SOUND RECORDING COPYRIGHT SIGN or REGISTERED. 

And why aren't those precedents wrong? There's an endless stream of things
like these; I personally don't see any reason why we should encode each of
them seperately. Especially for an Arabic symbol, since they're probably
running systems with the sophistication to combine U+062D and U+20DD already.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Unicode & Shorthand?

2004-09-18 Thread D. Starner

Christopher Fynn <[EMAIL PROTECTED]> writes:
> Shorthand symbols are of course printed in books on shorthand :-)

But as images, not text. There's likely to be arrows, showing the
directions, and any changes to glyph form are likely to be errors.

> Stenotype and similar machines also produce shorthand symbols which you 
> might want to store in data files for transcribing.

Do stenotype machines produce shorthand symbols? What I've seen to
TV seem to produce Latin letters, and the keyboard image found through
Google had Latin letter on it.

In any case, that's possibly a valid case but it would be nice if the 
people who had such data were actually saying they were interested. 

> Different shorthand systems seem to work differently - some appear to be 
> more or less phonetic, others seem to have symbols for frequent words.

That's another problem. The wikipedia article on Gregg Shorthand lists
six different versions, and that's just Gregg shorthand. 

> This came up because someone says they want to make an OpenType font for 
> Gregg shorthand symbols -  which made me wonder which script block you'd 
> map the glyphs to as the symbols often don't correspond directly to 
> Latin characters.

I'd map them to a private use area. IPA should be used for IPA, not any
random phonetic transcription that may or may not match the way the IPA
breaks down speech.

There may be cause to encode shorthand, but how many people really want
to store text in shorthand? And which shorthands?
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Unicode & Shorthand?

2004-09-18 Thread D. Starner

Christopher Fynn wrote:
> Is there any plan to include sets of shorthand (Pitman, Gregg etc.) 
> symbols in Unicode? Or are they something which is specifically excluded? 

They're a form of handwriting, which is generally excluded. Why do
they need to be encoded in a computer? General practice, at least,
is to transcribe them into standard writing first. 
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Combining across markup? (Was: RE: sign for anti-neutrino - gree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread D. Starner

Peter Kirk writes:
> That one is easy: this is the closing tag followed by a combining 
> solidus. The difficult case is if the parser encounters a not greater 
> than symbol. The parser will need to know to decompose such characters 
> first, but then a good parser would always need to do that. 

So all existing XML emitters should be changed, to make sure that not
less than symbols and not greater than symbols are escaped? If I were
writting a XML document with math content, and added a not less than
symbol, I would be sorely surprised to find it starting a tag. Being
a Unicode geek, I could figure it out, but I bet many mathematicians
wouldn't. Letting not less than symbols open tags would be a big
mistake.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Combining across markup? (Was: RE: sign for anti-neutrino - gree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread D. Starner

Philipp Reichmuth <[EMAIL PROTECTED]> writes:

> Jon Hanna schrieb:
> > The W3C Character Model does not, or will not since it's not yet a
> > Recommendation, allow text nodes or attribute values to begin with defective
> > combining character sequences.
> 
> What am I supposed do when I need a black a with a red macron?  Or for a 
> less obscure example, an Arabic text with the letters correctly ligated, 
> in black, and the vowel marks in another colour, such as in practically 
> *any* printed edition of the Koran?

Use PDF files, or images. The W3C Character Model is not the end all and be
all of Unicode text processing, but there's a certain limit to what HTML will
do. 

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Looking for transcription or transliteration standards latin- >arabic

2004-07-09 Thread D. Starner

Michael Everson writes:

> I don't agree that Dvorak is "the English name" 
> for the composer. But I don't agree that "façade" 
> is correctly spelled in English without the ç 
> either. 

The Society for Pure English 
() disagreed:

"We still borrow as freely as ever; but half the benefit of this 
borrowing is lost to us, owing to our modern and pedantic attempts 
to preserve the foreign sounds and shapes of imported words, which 
make their current use unnecessarily difficult. Owing to our false 
taste in this matter many words which have been long naturalized 
in the language are being now put back into their foreign forms, 
and our speech is being thus gradually impoverished. This process 
of de-assimilation generally begins with the restoration of foreign 
accents to such words as have them in French; thus role is now 
written rôle; debris, débris; detour, détour; depot, 
dépôt; and the old words long established in our language, 
levee, naivety, now appear as levée, and naïveté."

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: Looking for transcription or transliteration standards latin- >arabic

2004-07-08 Thread D. Starner

> transliteration is no longer needed or useful. Transliteration 
> is a one-to-one mapping between scripts, and the reader needs to be familiar 
> with both scripts and the transliteration rules to make sense of it. 

That's not true. Looking at Wright's Historical German Grammar, I 
see "Goth. baírand, OHG. bërant=Skr. bháranti." It would be illegible
to me, and probably many Germantists, if it were written in three
scripts instead of one. Using foreign scripts is rarely of help to
the casual reader, especially in the frequent cases where it's not
important that understand the details of the transliteration scheme.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Bharathi Lipi

2004-07-01 Thread D. Starner

> We, K. Kasturi & G. Kasturi have devised a script which is common 
> to the 12 principal languages of India. After a comparitive study 
> of the alphabet of the languages, we now present a common font for 
> them. We call it the "Bharathi Lipi". Details of the font, the 
> methodology employed in devising it, the incorporation of the font 
> into a standard ASCII keyboard, and sample sentences in each language 
> presented in Bharathi font are given in our website at:

This is a subject of some interest to many of us. For example, Doug Ewell
has created Ewellic , primarily
for English. My favorite Indian script is Nikhilipi, 
. However, as Unicode will be
in use for an indefinite period of time, quite possibly centuries, we'd like
to avoid filling it with characters that nobody will use. As such, we prefer
to wait until Bharathi Lipi has seen actual use, like books printed in the
script, before encoding it. Until that time, the Private Use Areas have been
created in part for this purpose, so you can use Bharathi Lipi before it
gains widespread usage.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: letter names for Old Hungarian Runes

2004-06-19 Thread D. Starner

"Doug Ewell" <[EMAIL PROTECTED]> writes:

> The good thing is that character names are not prescriptive.  What a
> character ends up being called does not influence or restrict its
> potential usage.

Why do you think that's true? People use characters all the time based
on their names. The fact that Unicode doesn't change them after encoding
makes it all the more important to name them right in the first place.
 
> > (Actually, by some mistake the latter seems to be absent from the
> > present proposal. I hope these data are a convincing argument, that
> > it is actually necessary to include it. A .gif of the letter form
> > (as.gif) is accessible on the page
> > http://fang.fa.gau.hu/~heves/abc/abc.html )
> 
> AS was listed as a ligature in the earlier proposal, N1686. 

That would be incorrect if modern users consider it a letter.
 

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: Bantu click letters

2004-06-10 Thread D. Starner

"Mike Ayers" <[EMAIL PROTECTED]> writes:

> > >  I'm not
> > > even sure you can trust a commissioned font to be 
> > installable on the operating
> > > systems of the next few decades.
> 
>   Font support has only improved with time.  What causes you to
> foresee a sharp reversal?

I don't expect a reversal; but if I commissioned a Type-1 font 15 years
ago, I'd have a hard time installing it on a lot of computers nowdays.
Just because OpenType is common now, doesn't mean that everyone will
support it in 20 years.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Bantu click letters

2004-06-10 Thread D. Starner

> But Gutenberg may not care: they mostly (now exclusively?) publish texts
> in the public domain.

We publish anything previously published we can get permission on, but since 
we can't afford to pay for anything, we're primarily public domain. In any
case, we have decades of the Reports of the Bureau of American ethnology
plus many more public domain works of linguistics, so we really don't need to 
ask for more text.

(This is really getting off topic, though.)
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Bantu click letters

2004-06-10 Thread D. Starner

> Simply because some images appear in some
> documents does not mean that they automatically should be represented as encoded
> characters.

These aren't images. They're clearly letters; they occur in running texts and represent
the sounds of a spoken language. If I were transcribing them, I wouldn't encode them 
as pictures; I would encode them as PUA elements or XML elements (which are usually
more easier to use and more reliable than the PUA). I don't think any transcriber would
treat them as images (maybe display them as images, but that's purely presentational.)

I'll admit that it's a bit sketchy encoding these characters based on one article by
one author. But I think it important to remember that more and more text is available
online, even stuff that might never get reprinted in hardcopy, and that needs Unicode.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Bantu click letters

2004-06-10 Thread D. Starner

John Cowan <[EMAIL PROTECTED]> writes:

> We must be talking past one another somehow, but I don't understand how.
> To represent the text as originally written, I need a digital representation
> for each of the characters in it.  Since all I want to do is reprint
> the book -- I don't need to use the unusual characters in interchange --
> the PUA and a commissioned font seem just perfect to me.

But that doesn't work if you're reprinting to XML or HTML, where you can't
rely upon a commissioned font being installed and correctly used. I'm not
even sure you can trust a commissioned font to be installable on the operating
systems of the next few decades.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Bantu click letters

2004-06-10 Thread D. Starner

> > Due to the latest US
> > copyright extensions, it will take us a couple decades, but we'll want
> > to transcribe this article.
> 
> In 2050.  I wouldn't worry about it.

It's 95 years from publication, so it's 2022. In any case, it's entirely likely
that some commercial organization will license these and start digitially transcribing
old linguistics documents for sale to libraries. And I hardly see how the issues will
change in the next 18 years.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: Bantu click letters

2004-06-10 Thread D. Starner

"Peter Constable" <[EMAIL PROTECTED]> writes:

> If
> the small n with left loop is not accepted, it will be because it was a
> proposal that never gained currency and has no user community.

There's at least a small user community; those people who are actively
transcribing old works, like Project Gutenberg. Due to the latest US
copyright extensions, it will take us a couple decades, but we'll want
to transcribe this article.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Revised Phoenician proposal

2004-06-06 Thread D. Starner

> (As a rhetorical device,) I have to say that I'm puzzled by this. All 
> I've seemed to hear from Semiticists is that Phoenician is not a 
> separate script. How, then, can these same Semiticists be the major 
> users of something that doesn't exist?

There's a big difference between Phoenician not being a separate script
from those already encoded in Unicode, and it not existing. It certainly
exists as a script variant, like Fraktur.

In that sense, treating Phoenician as a script variant of Hebrew is a big
win for many of the users of the script, since they would have a hard time
deciphering the bizarre (to them) script variant but have no problem reading
texts originally written in it in different fonts. 
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Revised Phoenician proposal

2004-06-06 Thread D. Starner

> Scholars of Semitic languages do *not* have a monopoly on the heritage 
> of ancient writing systems. There are other people in the world besides 
> them (a few), 

"The heritage of ancient writing systems." All of a sudden these letters
are incredibly important (despite the fact you could take every class
some major universities offer and not hear word one about them), and suddenly
all these people who don't know anything about Phoenician have a huge vested
interest in the matter.

Let's be honest; the only people who matter in the least when discussing
a script is the people who actually use it. And all evidence presented here
indicates that scholars of Semitic languages--that is, the people who can
actually read the stuff written in the script--are, not surprisingly, the
majority users of Phoenician. 

> and some of them wish to use Phoenician letters distinctly 
> from Square Hebrew, and their desires and needs are *EVERY* *BIT* as 
> important as those of your precious Semiticists.  

No, they aren't. The people who use the script are the most important
concerns. 

> since the scholars in question demonstrably do 
> NOT need a single encoding: they've been managing okay without one for 
> quite some time. 

Like everyone else in the world. By that reasoning, we shouldn't have
bothered with Unicode. 

> there will not be a unique encoding in use by Semitic scholars for a 
> *long* time, whether or not Phoenician is ever encoded).

Just like there isn't for Russian. Does that mean suddenly all right to
seperate the Russian p and Serbian p?

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Definition of Script etc.

2004-05-30 Thread D. Starner

Christopher Fynn <[EMAIL PROTECTED]> writes

> D. Starner wrote:
> 
> >So are we going to encode the Japanese, Fraktur and Farsi scripts?
> >Users of those scripts have been told they can just use a different
> >font.
> >  
> >
> No - and no one is seriously proposing that these are  scripts in the 
> sense used in iso10646.

Iâve heard Japanese so proposed repeatedly. Iâve also heard, and agree
with, the arguments that IPA is a script in the sense used in iso10646.
Itâs just not as simple that every script that is seriously proposed
should be accepted.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Definitio "Sn ofcript" etc. (was: Re: Phoenician & Kharoṣṭhī proposals)

2004-05-30 Thread D. Starner

Christopher Fynn <[EMAIL PROTECTED]> writes:

> Telling people who propose a script  that they  can "just use a 
> different font "  could very easily contradict this stated goal.

So are we going to encode the Japanese, Fraktur and Farsi scripts?
Users of those scripts have been told they can just use a different
font.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: PH technical issues (was RE: Why Fraktur is irrelevant

2004-05-28 Thread D. Starner

"Peter Constable" <[EMAIL PROTECTED]> writes:

> Alternate scenario (desireable):
> 
> The editor receives submissions as described above. Because Phoenician
> script and Hebrew script are encoded distinctly, there is never any
> concern as to how text provided to reviewers will appear. She saves many
> hours of work both in preparing submissions for reviewers and in final
> typesetting. Embarrassing errors and the need to publish corrigenda are
> significantly reduced.
> 
> 
> Now tell me that's an unrealistic or trivial scenario.

The unification of these alphabets into a single Old Italic script 
requires language-specific fonts because the glyphs most commonly 
used may differ somewhat depending on the language being 
represented.  The Unicode Standard, Old Italic, page 336.

For actual use, it might be advisable to use a seperate font for each
Runic system.  Ibid, Runic, page 342.

Id say its an unrealistic scenario.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: PH technical issues (was RE: Why Fraktur is irrelevant

2004-05-28 Thread D. Starner

"Peter Constable" <[EMAIL PROTECTED]> writes:

> But, as we have already seen, only *some* scholars, and the overall user
> community includes many people other than paleography scholars.

Shouldnt the encoding be geared towards those who use it the most?
So far, all the people who actually use this script on a day to day basis
who have actually spoken up have been in favor of unification. (I may
be mistaken; its been a long thread.) 

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: PH technical issues (was RE: Why Fraktur is irrelevant

2004-05-27 Thread D. Starner

"Peter Constable" <[EMAIL PROTECTED]> writes:

> > From: D. Starner [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, May 27, 2004 5:16 PM
> 
> [David replied to me off-list, but as there's nothing particularly
> private or controversial, I'm taking the liberty to respond on list, as
> it seems relevant for the thread.]

My fault. 
 
> > * A comparable discussion could appear involving Fraktur and Latin
> characters
> > and Chao and Chang.
> 
> I agree, but only somewhat. I think those situations are probably not as
> representative of the casual-, non-specialist-user scenario, and that in
> that case Sally and Latisha are probably more likely to be paying close
> attention to the fonts being used. Even for the non-specialist
> situation, in a Fraktur/Antigua case (the Chao vs Chang is definitely
> out at least for *non-Asian* non-specialists), Sally is telling Latisha,
> "Make sure it shows up with those dark, old-English-looking characters",

That was the point of Chao vs. Chang. Surely there's some group of students
that might need to display Fraktur characters in a school report on writing 
who aren't readily familiar with the normal Latin script.

> > * Sally probably won't have a Phoenician font, so this fails
> > no matter what Unicode decides.
> 
> Well, if Phoenician is to be encoded in the 05xx block, you're right. If
> it's encoded separately and platform or word-processor vendors bundle
> fonts that provide coverage for various ranges of Unicode, then she very
> well may have a Phoenician font. 

I'm not familar with fonts for most of the Plane 1 characters, except for
Code2001. I imagine there are commerical fonts for many Plane 1 scripts, but
I doubt they'll show in Windows or MacOS in the near future. 
 
> > * If they did use a Phoencian font, they could still be surprised
> mid-presentation
> > when they discover the school's computers don't have a Phoenician font
> installed.
> 
> Certainly true; but that is an independent cause that could just as well
> be used to argue for not encoding any new script. You might as well say
> because the school's computer didn't have Arabic fonts there was no
> reason to encode Arabic. So, I think it's not a relevant
> counter-argument.

The school's computer quite possibly doesn't have Arabic, and that it is a
good reason not to encode Arabic _for Sally and Latisha's sake._
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: PH technical issues (was RE: Why Fraktur is irrelevant

2004-05-27 Thread D. Starner

> Is nobody willing to give
> acknowledgement to the problems presented?

Scholars often need to seperate text by the particular
script the text was written in, often down to the
very scribe. That's done by storing it some sort
of tagged format, and having your search system
let you select based on the script--trivial in most
database systems. Phoencian and Hebrew are just a bit
broader than most distinctions.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: New Public Review Issue posted

2004-05-26 Thread D. Starner

"Mark Davis" <[EMAIL PROTECTED]> writes:

> Why modifier letters -- those are not really
> superscripts. Waw?
 
Last time I went looking for Modifier Letter Small N,
I decided it was encoded as U+207F, SUPERSCRIPT LATIN SMALL
LETTER N. If it's not, pretty much every variant of n has
been encoded as a modifier letter, except for the basic small
letter.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Response to Everson Phoenician and why June 7?

2004-05-25 Thread D. Starner

"Mark E. Shoulson" <[EMAIL PROTECTED]> writes:
> Yeah, I've wondered about this.  I've said it before: if you put my back 
> to the wall, I really don't think I could defend the disunification of 
> U+0041 LATIN CAPITAL LETTER A and U+0410 CYRILLIC CAPITAL LETTER A.  But 
> that's why they don't put me on the UTC.

The simplest answer is source seperation. Moreover, there have been at least
a dozen Cyrillic character sets, and to the best of my knowledge, every one of
them disunified Latin and Cyrillic, including the most commonly used ones, so 
the desires of the people who write Russian is clear. The decision on how to
encode Cyrillic was made before Unicode was even a dream, and Unicode had no
option but to follow.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Response to Everson Phoenician and why June 7?

2004-05-24 Thread D. Starner

[EMAIL PROTECTED] (James Kass) writes:

> And we use language tagging in plain text how?

I seem to remember the Japanese asking that. And I seem to remember
Unicode encoding the Plane 14 tags for that. And I seem to remember
people saying that if you want language tagging, you shouldn't
be using plain text. 


-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Response to Everson Phoenician and why June 7?

2004-05-24 Thread D. Starner

[EMAIL PROTECTED] (James Kass) writes:

> Guessing's not their job.  It's up to a sophisticated search
> engine to find what users seek.  Some of us have tried to
> dispel some of these fears by pointing out possible solutions.

The exact same search engine can search among Fraktur and
Roman scripts, too. Unicode shouldn't add to the complexity
of systems, except where necessary.

> the idea that "complex scripts"
> couldn't even be *displayed* didn't stop them from being
> encoded as complex scripts in the standard.

That's because that's what they were, and that's how they
needed to be encoded for proper handling.

> Can a Sanskrit scholar find Sanskrit text on-line if the search
> string uses Devanagari characters and the on-line text is in
> a different script? 

Probably not; and I don't see that feature being added in the
near future. You couldn't help the Sanskrit scholar without
hurting more important groups, but the Phoenician scholar is
the most important group using Phoenician.

> > Because plain-text distinction 
> > of script variant text in the same language is just 
> > about the least important thing in their work?
>
> Because they've never had the ability to do this in the past?

But they have. They could have printed in a Phoenician font,
but they chose modern Hebrew fonts, just like the middle English scholar
uses modern English fonts.

> Because it's there?  If Sir Edmund Hillary (hope the name's spelled
> right) had awaited some kind of an epiphany revealing a better 
> reason, would he have ever made it to the top?

Klingon is there too. So is Ewellic. Neither would cause any problem with
the standard, or have anything debatable about structure or encoding.
If we're going to start encoding stuff because it's there, maybe we should
start with stuff that doesn't get in other people's way?
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: PH as font variant of Hebrew (was RE: Response to Everson Phoenician and why June 7?

2004-05-24 Thread D. Starner

> - for the non-Semiticist interested in PH but not Hebrew, searching for
> PH data in a sea of Hebrew data (if they are unified) is all but
> impossible.

But that's true for every two uses of a script. I can't search for German or 
Irish in a sea of English data, or Japanese in a sea of Chinese. I guess
considering the close relation of the two, I should say I can't search for
Norweigan Nynorsk in a sea of Bokal.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Bangla

2004-05-24 Thread D. Starner

Philippe Verdy <[EMAIL PROTECTED]> writes:

> May be there's OCR working with Hangul basic Jamos (written linerarily,
> instead
> of with syllabic squares).

I question heavily the rest of this email, considering that it took less
than a minute to hit google and find there are several commerical Korean
OCR programs, that handle syllabic squares (anything else would be worthless).
One of which claims 99% accuracy and compatibility with Windows 3.1/95, so
it's not a new thing, either.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Fraktur yet again (was: Re: Response to Everson Phoenician and why June 7?)

2004-05-23 Thread D. Starner

> I absolutely DO disagree with the premise that lots of people would use
> a separate Fraktur encoding. 

I would use it when transcribing works that mix Fraktur and 
Latin constantly, or when there's only a quote or a couple letters in Fraktur. 
Sure a lot of people would transcribe their texts into Latin, but I
think it established that doesn't mean a script shouldn't be encoded, nor
that people wouldn't use it.

(Not that I actually encourage encoding Fraktur, but modern systems seem to
lack the ability to switch between Fraktur and Roman fonts, like you switch
between Roman and italic fonts. HTML doesn't even include a generic Fraktur 
font-type.)
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Bangla

2004-05-23 Thread D. Starner

> How bangla ocr can be developed using current unicode?

OCR doesn't depend on the character encoding. Like any other OCR,
you need to develop a glyph collection for the OCR to translate to, and
then map that glyph collection to underlying characters, in whatever
character encoding is used.

Perhaps if you rephrased your question, someone could provide a more
helpful answer.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: interleaved ordering (was RE: Phoenician)

2004-05-13 Thread D. Starner

> > If the input is in
> > multiple (Indic) scripts, and let's assume that the audience
> > (which may be a single person just asking for an sorted list
> > of his/her files) can read the Indic scripts used, it may be
> > helpful to interleave. (But I will not push this.)
> 
>   Now let's asume that person can't read all the scripts.  Then they
> get lots of unintelligible garbage in their sort.  This, and the upside is
> "may be helpful".  Which side did you say you're making the case for?

Garbage in, garbage out. If you didn't want unintelligible garbage in the
output, you shouldn't have put it in the input, and no sort procedure is
going to remove it. The user that can't read all the scripts is not an 
interesting person here, because it doesn't really matter to them if the
garbage is interfiled or at the end.

What's the actual usage pattern for multi-lingual sorts? Possibly the most 
common case, IMO, is a collection of Serbian or Tibetan or Sanskrit or Hebrew 
data in mixed scripts; the most convenient thing to do there is to interfile.
Another common case is computer directory listings in English & some other 
language, which should probably be seperate; but that's Latin, which is out
of the scope of this discussion. Again, a Serbian user would probably like
Latin and Cyrillic interfiled, and someone working on paleo-Hebrew or Sanskrit
would probably like their characters interfiled. I've never seen a multi-script
index; is there any real legacy behavior here, besides computer programs which
were forced to do something?
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Coptic/Greek (Re: Phoenician)

2004-05-12 Thread D. Starner

"Doug Ewell" <[EMAIL PROTECTED]> writes:

> Peter Kirk  wrote:
> 
> > Because each such case has to be judged on its individual merits,
> > according to proper justification and user requirements. There can be
> > no hard rules like "always split" or "always join".
> 
> Nobody, neither Michael nor anyone else, ever advocates such a rule.

But that's what Patrick implied when he asked how you support the Hebrew/Phoencian
unification and the Coptic/Greek unification, that such a rule exists.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Hooks and Curls and Bars, oh my (was: New contribution)

2004-05-06 Thread D. Starner

"Ernest Cline" <[EMAIL PROTECTED]> writes:

> I wasn't aware that there was an ARABIC SMALL LETTER D to add
> a curl to, 

There wasn't a Devanagari question mark to make a glottal stop
out of, but the Latin glottal stop was added to Devanagari anyway.

> Still, even with potential glyph unifications of distinct characters,
> if Phoenician is unifiable with Hebrew, one should be able
> to come up with a system for Phoenician that would incorporate
> the richness of the Hebrew point and cantilation marks.  I'm not saying
> that it can't be done; I don't know enough about the scripts to say that.
> I am saying that unless it can be done, unification would be a mistake.

The points are just accents; add them to the Phoenician characters at about
the same locations. When descenders get in the way, either move the accent
or transform it, ala LATIN SMALL LETTER D WITH CARON. If anyone really
cares, they could be added without problem, whether or not Phoenician is
unified with Hebrew.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Dorsey's Turned C-cedilla

2004-05-06 Thread D. Starner

"Anto'nio Martins-Tuva'lkin" writes:

> On 2004.05.05, 12:10, D. Starner <[EMAIL PROTECTED]> wrote:
> 
> > He uses a turned c -- that is, an open o. But he also uses a turned
> > c-cedilla.
> 
> I'd guess that typographically it is indeed a turned c-cedilla, a hack
> for an open o with "something" above -- maybe an acute or whatever he
> used to differenciate usual vowels...

It's not a vowel, and he didn't seem to have a problem with acutes. Perhaps
it's inaccurate to call it an open-o: c and ç are ʃ and θ respectively, and
ɔ and this new character are the medial/sonant-surd forms of this unturned
versions. But turned-c is LATIN SMALL LETTER OPEN O, IMO.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Just if and where is the then?

2004-05-06 Thread D. Starner

"African Oracle" <[EMAIL PROTECTED]> writes:

> By the time we get to know one another very well, people will notice some
> humour in my style of writing. Like I pulled Peter's leg, I was pulling
> yours with that question. Smiling with you.

Would you mind cutting it out? I get enough mail on this mailing list
already, without someone trolling on it.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Hooks and Curls and Bars, oh my (was: New contribution)

2004-05-06 Thread D. Starner

"Ernest Cline" <[EMAIL PROTECTED]> writes:

> In dubious hopes of ending this argument, let me offer up the following
> thought experiment.  Normal Latin script, Gaelic, and Fraktur while they
> have all diverged to a certain extent, have not diverged to the point
> where additions made to one of them is unimplementable on the other.
> To wit, altho the various hooked, curled, and barred letters added to
> the normal Latin script to accommodate other languages could be
> implemented in Gaelic or Fraktur.  LATIN SMALL LETTER D WITH CURL
> would look peculiar in Gaelic, but it looks peculiar in normal Latin too,
> and it would be distinctively recognizable as such to anyone who
> knew both Gaelic and normal Latin.

I don't buy it. If you are creative enough, of course you can add new
letters to any script. I could add LATIN SMALL LETTER D WITH CURL to
Arabic, too. On the flip side, there are many Latin letters that don't
fit with all Latin fonts. When U-breve was written in Sutterlin, it
was effectively unified with U. When IPA tap is written in my personal
handwritting, it's unified with r. Many fonts write i and g like dotless i 
and script g. As is, many Latin letters require a specialized font to
distinguish between other Latin letters.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: New contribution

2004-05-05 Thread D. Starner

Dean Snyder <[EMAIL PROTECTED]> writes:
> Mark E. Shoulson wrote at 12:11 AM on Wednesday, May 5, 2004:
> 
> >The Samaritan newsletter A-B is available both in Square Hebrew and in 
> >Samaritan-script editions.
> 
> Which, by the way, is an argument AGAINST encoding Canaanite/Phoenician
> separately from Hebrew AND encoding Samaritan separately from Hebrew.

It's hardly unheard of to publish documents in multiple scripts. As mentioned
here, Project Rastko publishes all its Serbian texts in Latin and Cyrillic,
and has plans to do the analogous things for Ottoman Turkish and Moldovian
and other languages like that. 

For another example, the Klingon Institute publishes its newsletter online 
in both Klingon script and Latin transliteration. (-: (-: (-:
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Dorsey's Turned C-cedilla

2004-05-05 Thread D. Starner

I'm compiling a proposal for the characters used by Dorsey in the BAE,
and I came across a problem. He uses a turned c -- that is, an open o.
But he also uses a turned c-cedilla. Should it be encoded as a new character,
a turned c-cedilla? Or should a turned combining cedilla be encoded, or
is U+0312 just that? (If it were my language, I wouldn't be happy with 
U+0312, but I doubt anyone is attached enough to Dorsey's orthography to
care about the difference.)

While I'm at it, I'll probably have a rough PDF in the next 24 hours,
but no webspace. Will someone do me the courtesy of hosting it?
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: New contribution

2004-05-05 Thread D. Starner

> Vietnamese in SÃtterlin ought to be an interesting challenge, because
> (as those of you who can read SÃtterlin know) the 'u' has a breve over
> it to distinguish it from an 'n', and in Vietnamese the letter 'a' can
> have a real breve over it, but 'u' cannot.

That wouldn't be a problem once you knew the script. Esperanto has a u
and u-breve which became indistinguishable in SÃtterlin, which was occasionally
used to write Esperanto by German Esperantists.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: New Contribution: In support of Phoenician from a user

2004-05-04 Thread D. Starner

Peter Kirk writes:
> Resending this and several other messages which I sent about 24 hours 
> ago, I thought before the server was supposed to have been switched off, 
> but which don't seem to have appeared on the system.

They appeared on my system. I can tell you, there's nothing I enjoy more
than reading verbatim repeated arguments in this debate rather than merely
rehashed ones.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: New contribution

2004-05-04 Thread D. Starner

> A possible question to ask which is blatantly leading would be:
> 
>  Would you have any objections if your bibliographic database
>  application suddenly began displaying all of your Hebrew
>  book titles using the palaeo-Hebrew script rather than
>  the modern Hebrew script and the only way to correct
>  the problem would be to procure and install a new font?

Again, change Hebrew to Latin and palaeo-Hebrew to Fraktur and see 
how many objections you get. Again, no, you can't use archaic forms
of letters in many situations, but that doesn't mean they aren't
unified with the modern forms of letters. No one would have procure
and install a new font, because Arial/Helevica/FreeSans/misc-fixed
have the modern form of Hebrew and will always have the modern form
of Hebrew and all other scripts that have a modern form.

I mean, maybe you're right and Phonecian has glyph forms too far from
Hebrew's to be useful, and it's connected with Syriac and Greek as
much as Hebrew, but this argument just doesn't fly.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: New contribution

2004-05-03 Thread D. Starner

> the 
> argument that despite how complex Square Hebrew has become with it 
> signs and diacritics and stretched letters and alef-lamed ligatures 
> and Yiddish ligatures

The Latin alphabet has 23 letters, IIRC. The Latin alphabet as
encoded in Unicode has hundreds of letters, including many caseless
letters and diacritics of all sizes and shapes and Fraktur ligatures,
but it's still unified with the Alphabet that Virgil used.

> If you people, after all of this discussion, can think that it is 
> possible to print a newspaper article in Hebrew language or Yiddish 
> in Phoenician letters, 

1) Of course it is. Even if it is encoded as a seperate script, we
can always transliterate the text.

2) Of course it isn't. Newspapers appear in an incredibly limited
variety of fonts. You wouldn't sell a newspaper in English in a 
Fraktur or Gaelic font, or probably even in a sans-serif font.

> then all I can say is that understanding of 
> the fundamentals of script identity is at an all-time low. I'm really 
> surprised.

This looks to me like a textbook example where two scripts should
be unified, and none of the things you appeal to seem to be a 
factor in any other script unification or disunification. I 
understand that sometimes you have to go beyond the textbook, but
I'm surprised by the people who seem to have trouble understanding
why people would argue against it. 

> You can map one-to-one to Hebrew? So what? You can map 
> one-to-one to Syriac and Greek, and probably others. 

Greek has 24 characaters, with different character properties (such 
as being LTR).  No mapping could be bijective. Syriac is cursive. 
Hebrew has the same 22 characters, with the same character properties.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: New contribution

2004-05-03 Thread D. Starner

> Phoenician script, on the other hand, is so 
> different that its use renders a ritual scroll 
> unclean. 

And I've got Latin fonts, whose use will render a Bible unclean.
(Might come in handy for Tantric religious works, though.) More
seriously, I imagine some German religious communities were very
strict on the Bible in Fraktur instead of a radical new Roman font.
It still doesn't mean they're seperate scripts; it just means that
they are picky on how their religious texts are presented. 
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Nice to join this forum....

2004-05-03 Thread D. Starner

> Yes, I have looked at the code and infact used the Microsoft Keyboard
> Keyboard Layout without any success. One thing I observed is that since the
> character are not drawn with the accent assigned where they should be, at
> low font size they are disaster. Using Fontlab to design the fonts and
> assigned codes the way they appear on the link only generate two characters
> in the font table.
> 
> I thing it will be better if they are drawn out which I can do and
> appropriate code assigned by UNICODE.

Unicode will not allocate any more codes for characters that can be made
precomposed, as it would disrupt normalization. Others can better tell
you how to get the job done with what you have.

As for the GB ligature, that might actually get encoded if you can
provide sufficent evidence for it.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Arid Canaanite Wasteland

2004-05-02 Thread D. Starner

> > So now if you think that two scripts that are isomorphic and closely related
> > should be unified, then you're exerting "political pressure"?

> Since no rational basis for the heated objections to the proposal
> seems apparent, "political pressure" appears to be a likely choice.

Excuse me? This is a 22-character script with one-to-one correspondence with a
preëncoded script, that uses the same sounds as that script and even the
same spelling in the major languages that use that script, and which people
who work with the older version generally encode in the newer version and 
print in the modern style. (This differs from the difference between Fraktur
and Roman fonts because there isn't an exact one-to-one correspondence between
the two; the long-s disappeared and the essett was added.) That sounds like
a font difference to me.

Now, there definitely is a rational basis for the proposal, but if you think
the objections have no rational basis, you need to grow up and pay attention to
what other people are saying instead of dissmissing it off-hand as "political
pressure". 




-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Romanian and Cyrillic

2004-05-02 Thread D. Starner

I posted this message to the message boards of Distributed Proofreaders-Europe 

(a joint effort of Project Rastko  and Project Gutenberg 
), 
and got this response from one of the site admins.

> nikola wrote:
> Haha Romanian use Cyrillic up to 19th century, so sooner 
> or later, we WILL have Romanian books in Cyrillic here

Nikola, David refers to Moldavian situation which is little 
bit different compared to situation in modern Romanian state since its formation. 

David, here are some preliminary thoughts:

> Prosfilaes wrote:
> From the Unicode mailing list:
> Quote:
>> Since we're talking about Romanian...
>>
>> Prior to 1991, the Soviet-controlled administration attempted to create
>> a distinct linguistic identity, Moldovian, which as I understand it
>> basically amounted to Romanian written in Cyrillic script. (They tried
>> to introduce some archaic Romanian forms and Russian loans, but
>> apparently none of it stuck.)

I expect gradual influx of Romanian, Moldavian, Tzintzar and Vlach members 
after May 24. I'm in almost daily contact with our friends and colaborators 
from Bucharest and Timisoara these days, relating our Romanian NGO which is 
under the registration at the moment, and they'll also serve as medium of our 
future local Moldavian network.

Before their more detailed opinion, I can offer some analogies which we have 
with similar cases. Bi- or three-alphabet situation is not rare in SE European 
or Eurasian cultures. In previous centuries we find all combinations of paralel 
use of Cyrillic, Glagolithic, Latin, Grek or Arabic scripts among Serbs, Croats, 
Romanians, Albanians etc. Religious or ideological affilitions are to be blame 
for very recent and opressive reducing down to usage of just one major script, 
but even now we have Serbian case with Cyrillic as only standard script, but 
Latin script widely used on daily social level without prejudicies even in the 
core of Serbian culture.

Project Rastko's general policy is more or less to OCR/publish version in original 
script, but also to provide transliterated versions into other common used scripts. 
Although we are proponents of having one "official" script, we publish Serbian works 
in additional Latin version in order to be easily read also in Muslim or Croat areas 
of former Yugoslavia (which share common language with Serbian culture).

For Romanian and Moldavian books printed in Cyrillic, I suppose only logical solution 
is to apply Rastko's rules: to process it in original script but to parallely publish 
Latin script version which modern Romanian readers could read.

Prosfilaes wrote:
> Quote:
>> How relevant is Romanian in Cyrillic script at this point? For instance,
>> what's the likelihood that someone might want to put Romanian-Cyrillic
>> content on the web? Already being done? A reasonable possibility?
>> Extremely unlikely?

It is reasonable possibility. The phenomenon of script is supranational 
and for academic purposes should be also treated as supraconfessional or 
supraideological.

Prosfilaes wrote:
> I know DP-EU plans to do it sometime, but do we have stuff that could be uploaded 
> tomorrow, 
> or is there something in our plans, or is it something that we'll do if and when 
> something 
> clearable comes along (which will be hard, as this is strictly post-1945.)

"Tomorrow"? Yes, if it is desperately needed, it could be uploaded in less 48 hours by 
Bucharest 
guys. More realistically speaking, the end of the summer or last quarter should be 
more systematic 
phase for Moldavian case.

Copyright clearability does not play an issue, since Rastko's material is mostly of 
modern authors 
who gave non-exclusive rights to publish their works on the Net for free.

David, please let us know anything new you get about this subject, for it could be 
important 
for several publishing projects our network prepares [We have in our computers perhaps 
100 
eBooks processed in 2003 about Romanian culture, waiting to be posted this year]
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Arid Canaanite Wasteland (was: Re: New contribution)

2004-05-02 Thread D. Starner

> My site certainly does not consider Gaelic to be a separate script from Latin.

Did you remove Latg and Latf from the scripts standard? Which is exactly on-point
to my message--it is useful to distinguish scripts in many cases that Unicode
may not.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Arid Canaanite Wasteland (was: Re: New contribution)

2004-05-02 Thread D. Starner

> (Note that this site considers "Palaeo" a separate script, this is quite
> clear in the paragraph quoted above.)

And there are sites that consider Gaelic and Fraktur seperate scripts, 
including one by Michael Everson. Even if we assume knowledge and competence,
we still can't assume they're using the same definition for a seperate script
as Unicode does.

> Imagine going back in time ten years or so and approaching the
> user community with the concept of a double-byte character
> encoding system which could be used to store and transfer
> electronic data in a standard fashion.  If they'd responded to
> this notion by indicating that their needs were already being
> well-served by web-Hebrew, would the Unicode project have
> been scrapped?

Yes. How many millions of dollars have gone into defining and implementing 
Unicode? Do you honestly think that Microsoft and IBM and Apple would
have spent all the money they have if their users were well-served by
what you call web-Hebrew?

> Should the proposal proceed as planned, or should we bow our heads 
> to political pressure before burying them in the sands of time?

So now if you think that two scripts that are isomorphic and closely related
should be unified, then you're exerting "political pressure"?
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Public Review Issues Updated

2004-05-01 Thread D. Starner

Peter Kirk <[EMAIL PROTECTED]> writes:
On 30/04/2004 11:21, Michael Everson wrote:
> At 06:47 -0400 2004-04-30, John Cowan wrote:
>
>> Ah, I see the next battle line forming:  Is Fraser a separate script, or
>> just an oddball application of Latin caps for which we need a few new 
>> ones?
>
>
> It is a separate script.

In your opinion. Or have you consulted with experts on this one, as you 
failed to do on Phoenician? If so, you might be able to cite a body of 
opinion that it is a separate script. There are clearly some opinions 
that it is not. 
[...] 
By the way, I know nothing about Fraser.
Can we keep the personal sniping down? I don't agree with Everson's authoritarian 
statements either, but Unicode is running at a very high level here.

I don't see why Fraser is fundamentally different from Cherokee. Both scripts use
shapes from the Latin script*, but not with the same range of glyphic variation. 
Both are caseless. Unlike Cherokee, Fraser doesn't even use Latin punctuation. 
Fraser may not be terribly inventive, but there's not much similarity to Latin
once you look past the glyphs.

--
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: New contribution

2004-04-30 Thread D. Starner

> Hobbyists and lay people. Encyclopedia writers? Overall, much of the same  
> crowd who would be immediately well-served by encoding the "Gardiner" set  
> of Egyptian hieroglyphics.

I consider myself as one of the people who would be well-served by the
encoding of Egyptian hieroglyphics. But Dover has a book on writing 
heiroglyphics, a book on Egyptian and a two volume dictionary, all currently
in print. They've also turned up in a book on the masons, and I've seen
a dorm floor paint their walls with them. 

As for Phoencian, the only place I've ever seen it is in charts of alphabets,
often as not sitting next to a comparison of Fraktur to Roman script. It
could be a font variant of Latin for all those charts matter. I
don't think there's any active popular usage like that for runic or
hieroglyphics.


-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Public Review Issues Updated

2004-04-30 Thread D. Starner

Kenneth Whistler writes:

> At any rate, since *neither* the capital C-stroke nor the capital turned-T 
> are in Unicode currently, anyone who is thinking about putting together 
> a proposal for the first one based on this Dorsey material might 
> as well include the other character as well, so we don't have to 
> "rediscover" it 6 years from now. 

Distributed Proofreaders () is transcribing the whole BAE 
Report series for Project Gutenberg right now. There's more than the turned-T;
there's also a turned-K (capital and small) and I've been told of a turned-p.
I brought up the capital c-slash only because the relevance to the unification
of the small c-slash. I will hopefully be able to produce a proposal for the
whole collection in due time.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-20 Thread D. Starner

"Raymond Mercier" <[EMAIL PROTECTED]> writes: 

> The problem of the size of Unihan has nothing at all to do with the cost of 
> storage, and everything to do with the functioning of programs that might 
> open and read it. 

It's a data file stored as a text file to be simple; it's not designed 
with reading as its primary goal. less and vim work just fine here. 

> I wish the people who designed this file would accept the need for a more 
> structured and sophisticated approach. Why not, for example, have a basic 
> html file, with html-links to the various sections ? 

Because it's a data file, and it's easier to process without all that HTML 
junk to discard.   

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

2004-04-03 Thread D. Starner

Peter Kirk <[EMAIL PROTECTED]> write: 

But the good screen reader would still need to distinguish their  
pronunciations. Is there any type of character which could be defined,  
in Unicode, to preserve this distinction, but to be completely hidden in  
display? Perhaps some kind of zero width morpheme break character?  
Okay, so you want to distinguish pronunciations, so you propose a character 
totally insufficient to do the job, and one that will rarely if ever 
get used in practice, so the good screen reader has to solve the problem 
anyway? 
--
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Re[2]: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

2004-04-02 Thread D. Starner

> > It only affects its (visual) aesthetic 
> > quality. 
>  
> That is arguable. An aural user agent could pronounce "1, 2, 3" a bit 
> different from "1, 2, 3" if there is a (say) thin space between the 
> digits in the latter case. It could pronounce it quicker, for example. 
 
And it could read it as "thin space", too. But it's questionable if any 
speech reader is going to try and interpret such ambiguous and rarely 
used characters specially. Even if it does, that doesn't make it plain  
text; italics and  can be interpreted by speech  
readers much more usefully, but are clearly not plain text. 
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)

2004-03-31 Thread D. Starner

Peter Kirk wrote: 

Louis   XVI   was 
guillotinedin 
1793. 
 
Here is what I do want: 
 
Louis XVI was 
guillotinedin 
1793. 
Louis\ XVI was guillotined in 1793. If you aren't using TeX, 
and you're doing this type of justification in small columns, 
your program ought to provide a way to do this. This is approaching 
italics or small capitals; it's necessary to look right, but it's 
not plain text. 
--
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: What is the principle?

2004-03-31 Thread D. Starner

"Mike Ayers" <[EMAIL PROTECTED]> writes: 

>   Support?  ROFL!  Call up one of those companies and tell them that 
> you are having trouble displaying PUA fonts, eastern or otherwise.  I'd like 
> to snoop on that call. 

Apple seemed pretty concerned about displaying PUA fonts on Mac OS X  
recently on this mail list. Personally, I doubt if I could get Microsoft 
to care if Windows or Word was causing my monitor to spin around and 
spit out pea soup, but if, say, Xerox was having trouble displaying the 
correct spelling of its directors' names and mentioned that they might 
have to go to Open Office for this, I'm sure Microsoft would find it 
quite important. It has nothing to do with the PUA; it has to do with 
whose complaining and how much weight they carry. 

> > >This is the kind of stuff the UTC refuses to start up by trying 
> > >to provide some subdivision of semantics in the PUA. *That* is 
> > >the principle, by the way, which guides the UTC position on 
> > >the PUA: Use at your own risk, by private agreement. 
>  
>   ..."and quit bothering us about it."  That's gotta be in there 
> somewhere.  If not, I have an amendment to propose. 

Why don't we add that note to other blocks? It'd be so much easier 
if we could just tell the people using, say, the Hebrew block that 
we've thrown something together for you, don't bother us if it doesn't 
work. Surely Unicode didn't waste two planes for something that 
no one can practically use. 
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: What is the principle?

2004-03-30 Thread D. Starner

Rick McGowan <[EMAIL PROTECTED]> writes: 

> It was written that way on purpose. 

That's a nice solution that I wish more systems had adopted. 

> "Unicode" has never written any platform software, so it   
> could hardly have made the PUA "too hard to use".  

There's two private use planes. That's more than enough area 
to make some of it RTL and some of it combining, and so on for 
the major patterns of properties. If you really need 130,000 
LTR private use characters, you could still change the  
properties, but that's not the common usage pattern. 

> For most purposes, 9,999 out of 10,000 users should never have any use   
> whatsoever for the PUA.  

There's large sections of the standard for which this is true. I'm not 
sure there's a person in the world with an actual need to transmit 
Gothic text.  

> It's   
> more to their bottom line to support exotic scripts than to support the   
> PUA, which "nobody needs" anyway.  

More importantly, Gothic support comes for free with support for combining 
and astral characters. At its best, support for exotic scripts and characters 
comes for free with support for the basic rules of Unicode. It would be 
nice if PUA support was the same, instead of having to provide APIs to 
change properties for many basic usages. 

> If there is a real need for exchanging   
> some bunch of symbols, people should be trying to standardize them, not   
> standardize ways of *not* standardizing them. 

For one, there's scripts like Klingon and the Tolkien scripts that get 
no respect. Also, considering some of the changes and problems with 
scripts like Mongolian and Khmer, it'd be nice to have a PUA encoding 
with a serious body of texts encoded before committing Unicode to one 
particular encoding. 

Is the PUA only for "real need"s? Should play scripts be encoded as  
ASCII, or should there be some way to get it right, even if only 
one person in the world is going to use it? 
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: What is the principle?

2004-03-30 Thread D. Starner

"Dominikus Scherkl (MGW)" writes: 

> I would expect any application to allow _all_ properties to 
be 
> defined by the user for each and any PUA charakter. 
> If not so, it's a bug in the application! (at least if it can 
> handle charakters with the same properties elsewhere in the 
Unicode.) 

That's a nice theory. But in practice I don't know of a single 
program that allows you to change the properties of Unicode 
characters without a recompile. You'd really need a standard 
format for defining the PUA, because even Unicode geeks would 
get tired of inputing the various properties by hand into every 
program. 

I think Unicode made the PUA too hard to use, deliberately or 
through apathy. If there were some standard way to announcing 
the system being used or even to segregate characters by users, 
it might be usable. But in practice, cuneiform PUA usage in 
plain text was broken due to concerns about confusion, and 
Klingon webpages tend to get their characters mixed with random 
junk from other pages. It's almost more reliable to use the 
ASCII or Latin-1 area for your PUA characters than the PUA 
itself--at least there, the programs won't usually switch fonts 
randomly. 

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: A proposed change of name for Latin Small Letter THwithStrikethrough

2004-03-06 Thread D. Starner

Peter Kirk <[EMAIL PROTECTED]> writes:
> but then there could well be a dictionary out 
> there somewhere which uses one of your supposedly equivalent ligatures 
> for the voiced th and another one for the unvoiced th.

So we make decisions based on one _hypothetical_ dictionary? I've
got a grammar here that uses a Fraktur ch as a phonetic symbol;
I've got a book on the sounds of German and English that uses various
small capital letters as phonetic symbols, some of which weren't in
Unicode last time I checked (pre-4.0); I've got a dictionary that
uses Arabic numerals as combining superscripts; and I've got a book
on Chakquiel that uses Latin letter Tresillo and Quatrillo that I'd
really like to put online. All of these are _real_ books, yet no one
is rushing to make changes solely based on them. (Which is not to slight
the people who actually are working on Tresillo and Quatrillo, but the
holdup is finding other usages.)

The correct behavior is take a large-print copy of the Handbook of the 
IPA and beat the publishers about the head until they start using real
symbols. As a fall-back plan, as that may be illegal in some states,
we might encode a th-ligature or point to t--h. But encoding a
bunch of th ligatures that exist _because_ there's nothing standardized
and imagine that they might be used somewhere in a way that someone not
obsessing about legitimate font changes might care about is absurd.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: A proposed change of name for Latin Small Letter THwithStrikethrough

2004-03-06 Thread D. Starner

Peter Kirk <[EMAIL PROTECTED]> writes:
> But 
> do we take U and V as different presentation forms of one character? 
> Should one of them be only in the PUA? Surely not.

But there are documents that use u and v, like those referring
to vultures. As far as I can tell, there are no documents that
use both of these ligated th's, so the cases aren't parallel.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: A proposed change of name for Latin Small Letter THwithStrikethrough

2004-03-05 Thread D. Starner

"Peter Constable" <[EMAIL PROTECTED]> writes:
> > TH WITH STRIKETHROUGH
> > ITALIC TH LIGATED BY HOOK
> > PLAIN TH LIGATED BY CROSSBAR
> > 
> > three separate glyphic representations of the same character
> > 
> > LEXICOGRAPHIC VOICED TH
> 
> No; three separate character representations of the same orthographic
> function.

Why are they seperate characters? They are seperate glyphs, but that
doesn't make them seperate characters. They're all ligated th's used
for the same purpose, with more glyphic similarity than many of the 
other glyphs unifed in the Unicode standard.
 
> Me thinks the editorial staff of a given dictionary publisher that needs
> to maintain it's conventions across different editions will certainly
> say "yes".

Does said editorial staff also need serif versions of all the Latin characters
encoded so that the convention of using a serif font for the dictionary
is also maintained? If they reset the text in a gaelic font, or even in
a san-serif font for a compressed pocket edition, would the differences 
between the characters make any difference? I seriously doubt it.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

RE: A proposed change of name for Latin Small Letter TH withStrikethrough

2004-03-05 Thread D. Starner

"Peter Constable" <[EMAIL PROTECTED]> writes:
> Unicode encodes characters, not orthographic functions. 

And the character is a th ligature. Why do the fine details
of how that's drawn in various unstandardized phonetic
alphabets make more difference than the huge variations in
the g's used in Latin alphabets? If we worry about variation
selectors here, why not with Runic? In this case, unlike Runic,
it is unimportant which character is used, and I would be 
surprised if anybody really cares about the fine details.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: LATIN SMALL LIGATURE CT

2004-03-01 Thread D. Starner

> But can someone explain to me why a ligatures such as ct which CANNOT be 
> accurately decomposed into individual characters (at least, it can't if 
> it's designed PROPERLY) shouldn't be encoded in its own right?  
> Non-decomposability is the special feature of all the ligatures currently 
> included in Alphabetic Presentation Forms.

No, it's not. As has been said several times, the only reason for those
ligatures is because they happened to be in older standards that needed
one to one mappings.

> How about the German double s/eszett (U+017F) a ligature of long s and s 
> which cannot be accurately built up from it's components. 

The eszett, as used in modern texts, is not a ligature. It's a letter in
its own right. (I assume you mean U+00DF and not U+017F, the long S.)
 
> There must be countless historical facsimile editions printed every year 
> which use the st and ct ligature extensively. The production of these 
> items would hugely benefit from having a fixed codepoint for "ct" instead 
> of it wandering all over the PUA depending on what font you're using.

First place, most facsimile editions aren't retypeset; they're graphical
copies of the original typeset edition. Furthermore, facsimile editions
are strongly tied to the original font. Most importantly, you don't need
to wander all over the PUA - with modern typesetting systems and good fonts,
you just place a ct there and the software automatically ligatures it for you.
You can use a ZWJ to ask for a ligature and ZWNJ to make sure there isn't one.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: (SC2WG2.609) New contribution N2705

2004-02-18 Thread D. Starner

> >And the subscript / is over the edge, as far as I am concerned.
> 
> U+208D and U+208E aren't.

Why not? That's like saying that U+2128 ANGSTROM SIGN is 
justification for adding further canonically equivelent
characters. U+208D and U+208E were, as I understood it,
added soley because some terminal supported them as characters
and Unicode wanted to support that terminal.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Fwd: Re: (SC2WG2.609) New contribution N2705

2004-02-17 Thread D. Starner

> My point is that characters such as 02B0, 02B2, and 02E0 are already used in the 
> same fashion as the newly proposed Indo-European characters. Therefore, it's 
> not clear to me why there should be any objection to the latter. 
 
Because any mathematical alphanumberic character can appear superscripted  
in mathematics; does that mean that we need to create superscripted characters 
for all of them? If these are part of a small, closed set, like U+02B0, then  
it's appropriate to encode them; but if just any character can appear  
superscripted or subscripted, then it goes outside Unicode, and needs to be 
dealt with in markup. 
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Panther PUA behavior

2004-02-05 Thread D. Starner

> So you're saying PUA is only useful for rich or marked-up text?

In any multi-person case, the codepoints in PUA can't be just moved on
a whim, and many PUA users will have a wide array of fonts, including
some with conflicting PUA coverage. Given those assumptions, PUA use
can't work outside rich or marked-up text.

> Plus I have only one PUA-based font installed on my computer.

Then it works for you. But I doubt Apple is interested in a solution
that has that constraint.
 
> But even there, that
> would seem to me to be a font vendor problem that should not dictate OS
> policy. 

Would you be happy if Apple passed the buck here to font vendors or someone
else? When you're writing a program that uses a file format with a bunch
of not-quite-standard files out there, you have to conform to reality, not
the standard. I'm not sure anything strictly says that junk in the PUA is 
wrong, though it's clearly sub-optimal.

In any case, you're willing to work with a system where you could install
Gentium and a bunch of cuniform characters became reversed Latin letters.
I'm not sure that's a standard reliable enough that Apple is interested
in working towards. 
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Panther PUA behavior

2004-02-03 Thread D. Starner

> Dean and others working with him prefer not to use transliteration. Why 
> should they be forced to? 

They aren't forced to. They are more than welcome to develop their own
operating system or patch an existing one. Or they can wait until
Unicode 5.0 with cuniform comes out. That doesn't mean that it will
magically work right now.

> Your allegation that it is a "because-I-can" 
> thing may well be totally unfounded. 

Instead of coming up with analogies, can you come up with some reasons
why it would be rationally better to use cuniform rather then 
transliteration in filenames? It's easier to type transliteration,
more reliable moving between systems, and more reliable in the same 
system, given that filenames aren't inherantly linked to one font.

>  From how I understand what Dean wrote, the issue is a very simple one. 
> What he wanted did work in Jaguar. It doesn't work in Panther. He is 
> unhappy about that.

That's the type of attitude that drives people nuts. Try seeing the 
big picture. From what I understand, the issue is a very simple one. 
Apple broke a feature that was unreliable (by Dean's admission) in 
order to support another feature that's used much more often that 
could be made reliable. If Dean's unhappy with that, I'm sorry, but
I don't think Apple's going to break their operating system to make
his filenames work.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: interesting SIL-document

2004-02-03 Thread D. Starner

> So why not beh-ind, ah-ead, beeh-ive etc? Is there a good phonetic 
> reason? 

As a native speaker of English, because it is. I might yell be (breath) hind
or be-f***ing-hind, never beh (breath) ind or beh-f***ing-hind. If I were
emphasizing my words, it might be a (pause) head, but never ah (pause) ead.
Beehive is a compound word, and would be pronounced as such.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Panther PUA behavior

2004-02-03 Thread D. Starner

> Might it not seem rather sensible of them to use the (SIL PUA) codes 
> F20E and F20F, generated by their keyboards and displayed with their 
> fonts? 

In the computing world, there's always a trade-off between expressiveness
and flexibility. They might not find it so sensible if they have
to access their data anywhere else or have anyone else access their
data. 

In any case, the vast majority of people working with cuniform would use
a transliteration, likely even written on their paper files. To use real
cuniform is a "because-I-can" thing, which I am not personally insensible,
but doesn't get the highest priority bug fixes.

> You may not think what Dean 
> and his colleagues were doing was very sensible, but it obviously made 
> sense to them, so what was the point of banning it?

The point of banning it, if I understand it right, was that the old way
didn't work right when viewing PUA data under all circumstances, and
the only way was, as Dean put it, to uninstall fonts and rearrange 
codepoints. To enable the functionality in text editors, they had an
unexpected side-effect of breaking PUA characters in file names. Which
way to go is obvious to me.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Panther PUA behavior

2004-02-02 Thread D. Starner

> I hope Apple re-thinks this, because it makes PUA useless in plain text. 

That's because it is. Without further specification, the PUA is completely
ambigious.

> end
> users get to control display behavior by re-assigning PUA code points or
> de-installing fonts, whereas they have no control and no visual
> information if the OS just gives up.

You can binary patch your OS to fix this behaviour. That's about as 
reasonable as reencoding your data or removing fonts until the system
pseudo-randomly picks the right font.

> So, for example, in Jaguar I had been using a PUA-based cuneiform font
> for file and folder names, which I found to be very nice and very useful;

Nice and useful? At least in my experiance, giving my folders names I can't
write from the keyboard, that can't be displayed in many of the fonts on
my system, is at best an affection. Using PUA characters for filenames is
unportable and it's a marginal use, even among the uses of PUA characters.
Given the choice between the private use area working right in wordprocessors
and text editors or it working in filenames, I'd pick the first, and not
be real sorry about disrupting the second.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Latin Theta?

2004-01-28 Thread D. Starner

"Mark E. Shoulson" <[EMAIL PROTECTED]> wrote:

If IPA deserves Latin 
versions of ÏÏ, then it needs them for ÎÎ too.
IPA is a mixture of Latin and Greek characters. 
 
discusses why there's no Latin Î or Î; basically all the other characters 
had new Latin uppercase versions in African languages, or were written in
IPA in a way incompatible with Greek. 
--
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: best editor in *nix that can do UTF16LE

2004-01-25 Thread D. Starner

> Hello. I didn't find a good editor to work with some UTF16LE text file 
> (starts with FFFE) on *nix.
> 
> The only edior I know deals with UTF16LE in the *nix world is mozilla 
> composer. I didn't test emacs: I'm not used to it.
> 
> Any hint? Thank you.

yudit will work. The bigger question is why are you editing UTF-16LE on
*nix? Recode (using recode or iconv) it to UTF-8 on the way in, and
recode it to UTF-16LE on the way out, and use whatever Unicode editor
you want. 

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Three new Technical Notes posted

2004-01-23 Thread D. Starner

>  #12 UTF-16 for Processing
>   by Markus Scherer

This is incorrect in saying that Ada uses UTF-16. It supports
UCS-2 only. The text of the standard says:

The predefined type Wide_Character is a character type 
whose values correspond to the 65536 code positions of 
the ISO 10646 Basic Multilingual Plane (BMP). [...] As 
with the other language-defined names for nongraphic characters, the names FFFE and 
 are usable only with 
the attributes (Wide_)Image and (Wide_)Value; they are 
not usable as enumeration literals. All other values of
Wide_Character are considered graphic characters, and 
have a corresponding character_literal. 

which doesn't include surrogate code points. The next 
version of Ada will have 32-bit characters to fully
support Unicode - the text of the proposal is here:



plus lengthy discussion on the issues. 
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Detecting encoding in Plain text

2004-01-14 Thread D. Starner

- Original Message -
From: Peter Kirk <[EMAIL PROTECTED]>
Date: Tue, 13 Jan 2004 09:03:48 -0800
To: Doug Ewell <[EMAIL PROTECTED]>
Subject: Re: Detecting encoding in Plain text
On 13/01/2004 08:34, Doug Ewell wrote:

>Peter Kirk  wrote:
>
>  
>
>>>If a certain Unicode plain text file uses ASCII punctuation OR spaces
>>>OR end-of-line characters, AND the file is not too short or has a
>>>very odd formatting, then the algorithm should work.
>>>  
>>>
>>True. But there may be certain languages (perhaps Thai?) for which all
>>of these circumstances regularly occur together. It would be very
>>inconvenient for users of these languages if programs regularly
>>attribute the wrong encoding to their text.
>>
>>
>
>Whether this is specifically true for Thai or not -- and I doubt that
>the "short file or odd formatting" condition could ever be considered
>language-dependent -- I would say an otherwise-good heuristic that
>performs badly for Thai ought to have special cases built in for Thai,
>rather than being discarded.
>
>
>  
>
I may have confused you with what I wrote,  but my "all of these 
circumstances" referred not to "the "short file or odd formatting" 
condition", but to Marco's "*all* these circumstances", which you 
snipped, which were originally:

>Some scripts include their own digits and punctuation; not all scripts use spaces; 
and controls are not necessarily used, if U+2028 LINE SEPARATOR is used for new lines.
>
I agree that heuristics should be adjusted for Thai. But problems may 
arise if they have to be adjusted individually, and without regression 
errors, for all 6000+ world languages.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/


--
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Detecting encoding in Plain text

2004-01-14 Thread D. Starner

Peter Kirk writes:
I agree that heuristics should be adjusted for Thai. But problems may 
arise if they have to be adjusted individually, and without regression 
errors, for all 6000+ world languages.
Thai is hard because of the writing system. But most writing systems weren't
encoded pre-Unicode, so if they were typed into a computer, it was with
a Latin (or Cyrillic?) transliteration that probably used spaces and new lines,
and in fact was probably ASCII. 

More cynically, those who use obscure character sets or font encodings have 
trouble viewing them; that is one of the reasons for Unicode. That this tool 
may to some extent be an example of that problem is a simple fact of life, 
and doesn't call for it to be thrown out.

[If a reply to this message with no reply appeared, I'm sorry. Hit the enter
key in the wrong place and off it went.]
--
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

1 2 >

1 - 100 of 135 matches

Mail list logo