RE: Difference between EM QUAD and EM SPACE

2000-07-12 Thread Edward Cherlin

At 2:09 AM -0800 7/11/00, Roozbeh Pournader wrote:
On Mon, 10 Jul 2000, Jonathan Coxhead wrote:

 In TeX, the difference is that an EM QUAD (\qquad) and an EN QUAD
  (\quad) provide spaces that are legitimate breakpoints for lines within a
  paragraph; while EM SPACE, EN SPACE (\enspace) and THIN SPACE (\thinspace)
  produce horizontal space that cannot cause a line-break.

Very close, except for the size of the quads.

I don't think so. I remember thatn in TeX, \quad was an an em quad, and
\qquad a double em quad. Would someone look at a good source for that?

--roozbeh

Correct.

Knuth says

The macros \enskip, \quad, and \qquad provide spaces that are 
legitimate breakpoints within a paragraph; \enspace, \thinspace, and 
\negthinspace produce space that cannot cause a break...

\def\enskip{\hskip.5em\relax}
\def\quad{\hskip1em\relax} \def\qquad{\hskip2em\relax}
\def\enspace{\kern.5em }
\def\thinspace{\kern .16667em }...

The TeXBook, p. 352.

Roughly, then

\enskip ~ en space
\quad   ~ em space
\qquad  ~ 2em space
\enspace~ en kern
\thinspace  ~ thin kern

Not the most enlightening choice of names, but we have that problem as well.


Edward Cherlin
Generalist
"A knot!" exclaimed Alice. "Oh, do let me help to undo it."
Alice in Wonderland



Re: Han character names?

2000-07-12 Thread Jon Babcock


Thomas Chan wrote:

I was interested in seeing an example of a Han graph that has no
documented pronunciation because I was under the impression that such
a graph doesn't/cannot exist.

   The "beikao" chapter (pp. 1585-1631) of the _Kangxi Zidian_ would be one
   place to start for those unconfirmable that have pronunciations but no
   meanings or having neither.  e.g., 1585.9 (two U+4E36's, one over the
   other, and all that overlaid across the leftmost stroke of U+4E43) and 
   1593.23 (U+5B80 above U+4E43), both of which have no pronunciation/meaning
   information documented in Morohashi or _Hanyu Da Zidian_ either.


Those with the Kodansha reprint of the Biaozhu Dingzheng Kangxi Zidian
(ISBN4-06-121033-5), will find the 'beikao' chapter on pages 3533 -
3602, the last chapter in the book. The first example above is the 9th
character in the chapter.  The second example is the first character
under U+5B80 as a classifier (i.e., Kangxi classifier #40, 'sheltered,
under a roof, thatch') p. 3545.

Thanks for calling attention to the 'beikao' chapter of Kangxi Zidian.
Very instructive.  I noticed in the first 600 or 700 entries listed in
this chapter, that these two characters are the only examples where
both the pronunciation and meaning are said to be totally
missing. (Both characters seem to have been created at the same time
(AD 841 - 846)). But within this same range of entries, there are at
least four more cases where both the pronunciation and meaning are
said to be 'not yet clear'.  If this rate of discovery holds
throughout the chapter (there seems to be about 4000 - 5000 entries in
all), we would expect to find around 40 characters that either totally
lacked documented pronunciation or for which the pronunciation was not
clear or was questionable at the time of compilation.  Certainly
proves that such critters exist.

It is also interesting that all the other entries that are not listed
as mistakes for proper characters or as allographs, and which must
number a couple thousand or more, are listed as having pronunciation
but no known meaning. Oh, how rudely out of character for a script
persistently characterized as "ideographic"...

Jon

--
Jon Babcock [EMAIL PROTECTED]













Re: Detecting installed fonts in a browser window [was Re...

2000-07-12 Thread Bob Hallissy

   Subject:
   Re: Detecting installed fonts in a browser window [was Re:
   Tradi
   -

   
   
   Due to a bug in Arabic-enabled fonts distributed with IE 5,
   Tahoma, Arabic Traditional, Courier New, etc., the medial form
   of
   U+06CC (ARABIC LETTER FARSI YEH) gets rendered exactly like the
   isolated form.

   Some comments:

   Looks to me like the initial form of U+06CC is also wrong. You
   didn't mention that, but I hope your "font fixer" tool will
   correct it also.

   Even Times New Roman and Arial Unicode have this flaw.

The bug is only a forgotten field in the GPOS table of the
   fonts

   Isn't it actually in the GSUB table? I don't think these fonts
   have GPOS information

   Bob Hallissy





Persian developers (was Re: Detecting installed fonts in ...

2000-07-12 Thread Bob Hallissy

   Subject:
   Persian developers (was Re: Detecting installed fonts in a
   browser window)
   -

   
   
   [EMAIL PROTECTED] said:


   That has created a major problem for Persian developers trying
   to maintain a web page. They should check the page for any case
   of medial form of ARABIC LETTER FARSI YEH, and replace it with
   ARABIC LETTER YEH, because they look like each other in the
   medial form. But that also creates a problem when the users
   uses a local search on the document.

   This raises a question that I've been wondering about:

   It has been my impression that many Persion applications use
   the Arabic YEH code point (Windows character 237, U+064A) for
   the Farsi Yeh, and then depend on the font to have been
   modified to show the final and isolate without dots. This, of
   course, would not be considered "correct Unicode", but it was a
   way to adapt Arabic software to Farsi needs. Similar hacks, if
   I may call them that, are typically made with a couple of other
   characters, namely Teh Marbuta (Windows 201, U+0629) and Kaf
   (223, U+0643), to get the correct Farsi shapes.

   With wider Unicode coverage from Microsoft and other vendors
   (albeit with occasional bugs as you have pointed out), these
   hacks are no longer necessary. But there is surely a large body
   of Farsi text already encoded using the hacks. What is the
   general mood of Persian software industry towards this problem:
   Are they moving rapidly to Unicode or are they staying with the
   old? Is a standard mechanism (e.g., import/export filters)
   being developed for migrating and exchanging the data?

   I'd appreciate any insight you or others on this list have.

   Bob Hallissy




Re: Euro character in ISO

2000-07-12 Thread Roozbeh Pournader



On Tue, 11 Jul 2000, Asmus Freytag wrote:

 The only safe way to encode a Euro in HTML appears to be to use Unicode - 
 e.g. by using 8859-1 together with the numeric character reference (NCR) of 
 #x20AC;

euro; is much safer. Netscape 4 doesn't recognize hexadecimal character
references.

--roozbeh




Re: Detecting installed fonts in a browser window [was Re...

2000-07-12 Thread Roozbeh Pournader



On Wed, 12 Jul 2000, Bob Hallissy wrote:

Looks to me like the initial form of U+06CC is also wrong. You
didn't mention that, but I hope your "font fixer" tool will
correct it also.

You're right. It will also do that. I had forgotten that.

Isn't it actually in the GSUB table? I don't think these fonts
have GPOS information

Sorry, I meant to write GSUB.




Re: Persian developers (was Re: Detecting installed fonts in ...

2000-07-12 Thread Roozbeh Pournader



On Wed, 12 Jul 2000, Bob Hallissy wrote:

It has been my impression that many Persion applications use
the Arabic YEH code point (Windows character 237, U+064A) for
the Farsi Yeh, and then depend on the font to have been
modified to show the final and isolate without dots. This, of
course, would not be considered "correct Unicode", but it was a
way to adapt Arabic software to Farsi needs. Similar hacks, if
I may call them that, are typically made with a couple of other
characters, namely Teh Marbuta (Windows 201, U+0629) and Kaf
(223, U+0643), to get the correct Farsi shapes.

I've not heard anything about the Teh Marbuta in this regard. But I know
about the YEH and KAF used instead of FARSI YEH and KEHEH. The problem
with YEH is still there when someone uses the CP1256, since that does not
have the FARSI YEH.

With wider Unicode coverage from Microsoft and other vendors
(albeit with occasional bugs as you have pointed out), these
hacks are no longer necessary. But there is surely a large body
of Farsi text already encoded using the hacks. What is the
general mood of Persian software industry towards this problem:
Are they moving rapidly to Unicode or are they staying with the
old? Is a standard mechanism (e.g., import/export filters)
being developed for migrating and exchanging the data?

The volume seems to be Word documents only. Many people are writing
convertors to make these OK. We are also among the convertor writers.

Also few ones are moving rapidly to Unicode. The Worders want their
WYSIWYG. They only want to edit and print their old docs. So they install
the old fonts on their newer OS-es, and thing go OK for them.




Euro symbol in HTML (was: Euro character in ISO)

2000-07-12 Thread Otto Stolz

Am 2000-07-11 um 23:30 UCT hat Asmus Freytag geschrieben:
 The only safe way to encode a Euro in HTML appears to be to use Unicode -
 e.g. by using 8859-1 together with the numeric character reference (NCR) of
 #x20AC;

This does, however, not work with Netscape 4.x, as these browsers only
understand decimal NCRs.  Pre-4.7 Netscape browsers do not correctly
interpret NCRs abobe 255, if an 8-bit encoding (e. g., Latin- is used,
in blatant contrast to the standard, cf.
http://www.w3.org/TR/REC-html40/charset.html#h-5.1 (I do not remeber
the exact version when this bug has been fixed).

Hence, the only safe way to encode the Euro symbol seems to be:
- Use the euro; entity, cf. the last line of
  http://www.w3.org/TR/REC-html40/sgml/entities.html#h-24.4.1;
This will cause Netscape 4.7 to display "EUR" if the Euro glyph
is not available (at least the version on my Unix box does so).

The following two ways are safe, if the Euro glyph is available in
the fonts specified by the user:
- use UTF-8 together with the decimal NCR "#8364;";
- use UTF-8 together with the UTF-8 encoding 'E2 82 AC' (in hex).

In all cases, do not forget to declare your HTML source as either
HTML 4.0 or HTML 4.01,
cf. http://www.w3.org/TR/REC-html40/struct/global.html#h-7.2.

Cf. my examples:
- http://www.rz.uni-konstanz.de/y2k/test/Euro-Latin-1.htm in Latin-1,
- http://www.rz.uni-konstanz.de/y2k/test/Euro-Latin-9.htm in Latin-9,
- http://www.rz.uni-konstanz.de/y2k/test/Euro-UTF.htm in UTF-8.

Best wishes,
   Otto Stolz



Re: Euro character in ISO

2000-07-12 Thread Michael Everson

Ar 15:30 -0800 2000-07-11, scríobh Asmus Freytag:
At 01:25 PM 7/11/00 -0800, Leon Spencer wrote:
Has ISO addressed the Euro character?

Yes. It's at 0x20AC in ISO/IEC 10646-1.

This is not a standard notation. Please use U+20AC or one of the other
standard notations to refer to UCS code positions.

ME





Re: Han character names?

2000-07-12 Thread Antoine Leca

[EMAIL PROTECTED] écrivit :
 
   If they did, would the SIP overflow?
  But that is what Plane 3 is for. MDIP ("More damn
  ideographs plane") ??

 Yes, let us call it the MDIP. What would that be
 in French?

Prise de tête

(That is probably too much Parisian French to get wide
acceptance in Canada. But I believe people in France
will get it correctly).

Antoine



Re: Euro character in ISO

2000-07-12 Thread Michael Everson

Ar 18:19 -0800 2000-07-11, scríobh Robert A. Rosenberg:

The problem would go away if the ISO would get their heads out of
their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and
put the CP125x codes there.

Excuse me, but that is not appropriate. The ISO/IEC 8859 series is
conformant with ISO/IEC 2022, and protocols which adhere to that standard
should not be compromised by what you suggest.

Then when you said you used 8859-21 you'd get CP-1252 and Windows
would no longer need to lie (or tell the truth by admitting it is
CP-1252).

The problem is that some companies do/did not correctly identify their code
pages. The world can live with Latin-1 and CP-1252. It shouldn't have to
live with CP-1252 being identified as Latin-1.

Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire





Re: Han character names?

2000-07-12 Thread Michael Everson

Ar 10:23 -0800 2000-07-11, scríobh Jon Babcock:

But covering the jiaguwen [J. koukotsumoji] (oracle bone script) is
another story.  First of all, it's a moving target.

Isn't it best treated as a font variant of CJK?

Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire





Re: Euro character in ISO

2000-07-12 Thread Antoine Leca

Robert A. Rosenberg wrote:
 
 At 15:30 -0800 on 07/11/00, Asmus Freytag wrote about Re: Euro
 character in ISO:
 
 There has been an attempt to create a series of 'touched up' 8859
 standards. The problem with these is that you get all the issues of
 character set confusion that abound today with e.g. Windows CP 1252
 mistaken for 8895-1 with a vengeance:
 
 The problem would go away if the ISO would get their heads out of
 their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and
 put the CP125x codes there.

Sorry. It may work for CP1252/iso-8859-1, and CP1254/iso-8859-9,
but won't for the others. Since Windows starts with the same letter as
Word --or is the reason that they both come from the same company.
No! I cannot believe that-- there are a couple of requirements
that makes effectively the "other" codepages slighty incompatible,
such as the necessary presence for · at position B5 (because this
is the character Word uses when you ask it to "display" the spaces,
and this is hard-coded in the product).


 Then when you said you used 8859-21 you'd get CP-1252 and Windows
 would no longer need to lie (or tell the truth by admitting it is
 CP-1252).

Even if 8859-21 is defined to be exactly the same as some stage of
CP1252, and everyone in the standardization community admits this
as such, habits are so much entrenched, and love against Microsoft
so rare in the Unix world, that you may bet a lot that such a
standard will never gain wide acceptance.

Furthermore, this is completely unnecessary, as nowadays such
a standard exists, and it is used to be called 'charset=windows-1252'...

The real problem is that:
- Windows browsers/MAs did not know that until 1999 (as it seems)
- Windows HTML-tools/MAs are reluctant to add the test for presence
of non-Latin1 characters to either tag as iso-8859-1 or
windows-1252. Apparently they are too lazy (because they already
did such a test for ASCII).
Well, I am angry, because probably nowadays browsers do the job correctly.


Antoine



Re: correction (was: Not all Arabics are created equal...)

2000-07-12 Thread Roozbeh Pournader



On Wed, 12 Jul 2000, Gregg Reynolds wrote:

 But in any case, this doesn't change the main point:  Persian may be
 spoken MSD-first, but its written forms are LSD-first.

No. Except when adding etc. (just like in English), Persian numbers are
written MSD-first. When I (and any other Persian speaker I know) try to
write something like "I have 12 books", which is "man 12 ketaab daaram" in
Persian, I write it in this fashion:

  M
 AM
NAM
 1  NAM
 12 NAM
   K 12 NAM
  EK 12 NAM
...
   MARAAD BAATEK 12 NAM

This means that Persian is also written MSD-first.

--roozbeh




RE: Euro symbol in HTML (was: Euro character in ISO)

2000-07-12 Thread Alan Wood

 Otto Stolz wrote:
 
  Hence, the only safe way to encode the Euro symbol seems to be:
  - Use the euro; entity
  This will cause Netscape 4.7 to display "EUR" if the Euro glyph
  is not available (at least the version on my Unix box does so).

  The following two ways are safe, if the Euro glyph is available in
  the fonts specified by the user:
  - use UTF-8 together with the decimal NCR "#8364;";
  - use UTF-8 together with the UTF-8 encoding 'E2 82 AC' (in hex).

  In all cases, do not forget to declare your HTML source as either
  HTML 4.0 or HTML 4.01,

I can confirm that euro; and #8364; also work with Netscape 4.73 under
Windows 95.

However, the euro symbol seems to be the exception.  In my index of HTML 4
named character entities at:

http://www.hclrss.demon.co.uk/demos/ent4_frame.html

Netscape 4.73 does not recognise any of the other named character entities
that correspond to decimal numbers greater than 255.  (With ViewCharacter
Set set to Unicode (UTF-8)), and using Arial Unicode MS.)

Alan Wood
(Documentation Writer / Web Master)
Context Limited
(Electronic publishers of UK and EU legal and official documents)
mailto:[EMAIL PROTECTED]
http://www.context.co.uk/
http://www.alanwood.net/ (Unicode, special characters, pesticide names)




ATM light glyphs for Unicode characters?

2000-07-12 Thread Christopher J. Fynn

Anyone know if Adobe's (free) ATM lite
http://www.adobe.com/products/atmlight/main.html
supports display of glyphs for Unicode characters when these are named
according to Adobe's document "Unicode  Glyph Names"
http://partners.adobe.com/asn/developer/typeforum/unicodegn.html

- Chris

--
ཿརྨ༼སྦྷྲུ཭༼཭རྦྷྱུཧ༼཭སྙར༼འཾིར།




Re: Euro character in ISO

2000-07-12 Thread brendan_murray



Robert A. Rosenberg wrote:
Then when you said you used 8859-21 you'd get CP-1252 and Windows
would no longer need to lie (or tell the truth by admitting it is
CP-1252).
And because certain companies had (and still have) bugs in their comms products, incorrectly identifying CP1252 data as ISO 8859-1, ISO standards should reject ISO-2022 and populate C1 with graphic characters?

I suppose other inconsiderate incompatibilities such as the incorrect encoding of half-pitch kana in ISO-2022-JP is the fault of ISO too?

Perhaps those companies that have these major bugs in their software, all of which have been repeatedly pointed out, should fix the probems there. The rest of the industry bends over backwards to accomodate these corrupt data, so a little effort on the part of the guilty would help a lot, and might prevent misguided postings like the above.

B=


Proposal to make the unicode list more transparent! (Sender:

2000-07-12 Thread Doug Ewell

Jens Siebert [EMAIL PROTECTED] wrote:

 However, because of the tremendous amount of mails
 I would like to suggest splitting the list into
 various lists, divided by main-topics.

 These could be sorted by „groups of languages“,
 such as CJK(+V) and other groups.
 Another sector could be „technical issues“, such
 as encoding-related mails, statements about
 programm-code source-samples etc. !

I cannot speak for the list adminstrators, but I am on about four
mailing lists, and almost every list gets a request like this from time
to time.  It seems at first glance to be a worthy goal.

The problem is that topics and people naturally stray, and what starts
as a discussion about one Unicode-related topic ends up being about a
totally different one, or even something completely unrelated to
Unicode.  Recently a discussion about how Japanese furigana should be
encoded in Unicode mutated into a discussion about the history of
control codes.  This is called "topic drift," and it is not necessarily
bad, but it is usually difficult to control and would be much more so
if there were separate lists for CJK issues, Arabic issues, font issues,
technology (fonts/browsers/terminals), etc.

There is already a separate list called "unicore" where members discuss
proposals for new characters and scripts and other nuts-and-bolts
issues.  (BTW, how can I join that list?  Is it for Unicode members
only?)

 I put this idea here, because personally I only
 read unicode-list-mails related to CJK and technical
 issues. I believe many of you may face the same
 problem, and would like to receive only certain mails
 related to specialized topics.

The best solution is to scan the "Subject" line of messages and to use
your "delete" button on messages you don't care about.  I know this
sounds flippant every time someone says it, but experience shows it is
really the best way.  We can help by changing the "Subject" line of a
thread to reflect that the underlying topic has changed.

-Doug Ewell
 Fullerton, California



Re: Persian developers (was Re: Detecting installed fonts in ...

2000-07-12 Thread Michael \(michka\) Kaplan

The source for the Wiondows codepages, http://www.microsoft.com/globaldev  !

This one is up at

http://www.microsoft.com/globaldev/reference/sbcs/1256.htm

michka


- Original Message -
From: "Roozbeh Pournader" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Cc: "Unicode List" [EMAIL PROTECTED]
Sent: Wednesday, July 12, 2000 6:53 AM
Subject: Re: Persian developers (was Re: Detecting installed fonts in ...




 On Wed, 12 Jul 2000, Michael (michka) Kaplan wrote:

  One thing they do is use the LATEST cp 1256, which includes the Farsi
  characters, so the hacks are not needed and then they would not have
to
  move to Unicode, actually. I ran across several localizers who were
willing
  to produce files in three formats:

 Would you please give me a link to the conversion table from the latest
 CP1256? The version I saw on the Unicode web site lacks:

 U+066B ARABIC DECIMAL SEPARATOR
 U+06A9 ARABIC LETTER KEHEH
 U+06C0 ARABIC LETTER HEH WITH YEH ABOVE
 U+06CC ARABIC LETTER FARSI YEH

 which are needed for Persian.

 --roozbeh






Re: Han character names?

2000-07-12 Thread John H. Jenkins

At 4:27 AM -0800 7/12/00, Michael Everson wrote:
Ar 10:23 -0800 2000-07-11, scríobh Jon Babcock:

But covering the jiaguwen [J. koukotsumoji] (oracle bone script) is
another story.  First of all, it's a moving target.

Isn't it best treated as a font variant of CJK?


That's really an open question. We'd need to get a solid survey of
the oracle bone characters and their modern counterparts. One problem
is that a significant percentage of the former aren't identified (or
even identifiable) with modern characters.

--
=
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://www.blueneptune.com/~tseng



Re: C1 controls and terminals (was: Re: Euro character in ISO)

2000-07-12 Thread Frank da Cruz

 Frank da Cruz [EMAIL PROTECTED] wrote:
 
  . If you send a code in the 0x80-8x9f range to such a terminal or 
emulator, it properly treats it as a control code.  If it was
intended as a graphic character ("smart quote" or somesuch) the
result is a fractured screen, sometimes even a frozen session.
 
 This is the widely reported compatibility problem between UTF-8 and
 terminals.  I know I read somewhere, possibly on Markus Kuhn's Unicode
 page, possibly somewhere else, that ISO 2022 codes exist to switch out
 of "ISO 2022 mode" and into "UTF-8 mode" and to either allow or prevent
 switching back to 2022.  Is there any progress on implementing this so
 terminals and emulators can live with UTF-8?
 
Maybe Markus can clarify.  I would be surprised if there's anything in
ISO 2022 about UTF8, except that it does provide a way to switch out of
and back into ISO 2022 mode, allowing the use of character sets that do
not comply with ISO 2022 and 4873.  That's what the designating escape
sequences "with standard return" and "without standard return" are for.

But that's not quite the same thing.  There is no good reason why UTF-8
couldn't be used by (say) a VT320 emulator without switching out of the
ISO 2022 regime, except that UTF-8 contains C1 control codes as data.
This was discussed here a while back and "the other Markus" showed how
a C1-safe form of UTF-8 could have been designed:

  http://www.mindspring.com/~markus.scherer/utf-8c1.html

But, as they say, "it's too late now".  Therefore, those of us who want
to make use of UTF-8 within the ISO 2022 regime must reverse the layers.
First decode the UTF-8, then parse for escape sequences.  Of course your
emulator can get into awful trouble that way if the data stream isn't
really UTF-8.  But overall it's not that bad; we can live with it, and
in fact have done it this way in practice in our own emulator.

- Frank




RE: Proposal to make the unicode list more transparent!

2000-07-12 Thread Mike Newhall

And what about using "on-topic" prefixes? E.g. (CJK), (Indic), (Fonts),
(BIDI), etc. This could be a big help for both manual and automatic
filtering. The actual "dictionary" of prefixes does not need to be formally
defined a priori: its maintenance could be and partially or totally
spontaneous (e.g.: one uses a new prefix and, if it is informative, others
will use it for next messages on the same topic).

_ Marco

This is, I think, a good idea.  If we informally agreed to a syntax, like
"use square brackets for the topic", then people could filter for things
like "[CJK]".  Actually, I suppose there's no reason to restrict to one
subject, a single message about CJK fonts might use "[CJK][fonts]", so
really this could be almost a keyword list.  Also I think it has been a
good practice in the past to change the subject when there is enough drift,
BUT keep the previous topic for at least the first changed subject line, to
make the transition clear to those only scanning subjects.  So perhaps an
example subject line with all of the above would be:

Subject: [CJK][fonts] Where can I find a good Korean font? (was: Re:
[Arabic][fonts] Where can I find Arabic fonts?)


Mike



Re: Persian developers (was Re: Detecting installed fonts in

2000-07-12 Thread Bob Hallissy




   Of these, only U+06A9 exists in the Windows CP1256, as can be
   demonstrated by using MultiByteToWideChar() API or by reading
   ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP
   1256.TXT

   Bob Hallissy

   [As an interesting aside, the WideCharToMultiByte() API maps
   both U+06CC (FARSI YEH) and U+064A (YEH) to Windows character
   code 237 (xED). ]




   From: [EMAIL PROTECTED] AT Internet on 12-07-2000
 11:53

   To:   [EMAIL PROTECTED] AT Internet@Ccmail
   cc:   [EMAIL PROTECTED] AT Internet@Ccmail (bcc: Bob
 Hallissy/IntlAdmin/WCT)

   Subject:  Re: Persian developers (was Re: Detecting installed
 fonts in






   On Wed, 12 Jul 2000, Michael (michka) Kaplan wrote:

One thing they do is use the LATEST cp 1256, which includes
   the Farsi  characters, so the hacks are not needed and
   then they would not have to  move to Unicode, actually. I ran
   across several localizers who were willing  to produce files
   in three formats:

   Would you please give me a link to the conversion table from
   the latest CP1256? The version I saw on the Unicode web site
   lacks:

   U+066B ARABIC DECIMAL SEPARATOR
   U+06A9 ARABIC LETTER KEHEH
   U+06C0 ARABIC LETTER HEH WITH YEH ABOVE
   U+06CC ARABIC LETTER FARSI YEH

   which are needed for Persian.

   --roozbeh








Re: Persian developers (was Re: Detecting installed fonts in ...

2000-07-12 Thread Roozbeh Pournader



On Wed, 12 Jul 2000, Michael (michka) Kaplan wrote:

 http://www.microsoft.com/globaldev/reference/sbcs/1256.htm

That only adds KEHEH. I still lack:

U+066B ARABIC DECIMAL SEPARATOR
U+06C0 ARABIC LETTER HEH WITH YEH ABOVE
U+06CC ARABIC LETTER FARSI YEH

--roozbeh




Re: Persian developers (was Re: Detecting installed fonts in ...

2000-07-12 Thread Roozbeh Pournader



On Wed, 12 Jul 2000, Michael (michka) Kaplan wrote:

 I looked at two of the docs it looks like they were using U+002C for
 the decimal separator even when they were using Unicode (I do not know
 how common that choice would be).

That's not good for typography. For Persian usages, U+002F (slash) is even
better than that. The slash is usually misused for that purpose when the
charset lacks the Persian decimal separator.

--roozbeh





Re: Persian developers (was Re: Detecting installed fonts in ...

2000-07-12 Thread Michael \(michka\) Kaplan

From: "Roozbeh Pournader" [EMAIL PROTECTED]
  I looked at two of the docs it looks like they were using U+002C for
  the decimal separator even when they were using Unicode (I do not know
  how common that choice would be).

 That's not good for typography. For Persian usages, U+002F (slash) is even
 better than that. The slash is usually misused for that purpose when the
 charset lacks the Persian decimal separator.

I will forward that onto them (not knowing Farsi, I am only as good as the
localizer behind it all in these cases!).

michka




RE: Proposal to make the unicode list more transparent!

2000-07-12 Thread Frank da Cruz

 This is, I think, a good idea.  If we informally agreed to a syntax, like
 "use square brackets for the topic", then people could filter for things
 like "[CJK]".

This might sound silly, but some people still use ISO 646-based displays,
in which square brackets show up umlauts, etc.  Parentheses are safer.

Also note that RFC 822 has included a Keywords: header for just this purpose
ever since 1982.

Anyway, all attempts to tame mailing lists generally fail so let's not waste
too much time on this.  After all, the relation of the Subject: line to the
body is only one of our problems.  Others include inappropriate (or non-)
tagging of character sets, Silly-MIME-Enclosure Syndrome, Hideous-Formatting
Syndrome, and Profligate-Quoting Syndrome.  But at least I don't recall
seeing any virus-bearing messages here yet...

- Frank :-]




Re: Persian developers (was Re: Detecting installed fonts in ...

2000-07-12 Thread Michael \(michka\) Kaplan

The ones that they were having trouble with were U+0649 and U+064A. I looked
at two of the docs it looks like they were using U+002C for the decimal
separator even when they were using Unicode (I do not know how common that
choice would be).

michka


- Original Message -
From: "Roozbeh Pournader" [EMAIL PROTECTED]
To: "Michael (michka) Kaplan" [EMAIL PROTECTED]
Cc: "Unicode List" [EMAIL PROTECTED]
Sent: Wednesday, July 12, 2000 9:00 AM
Subject: Re: Persian developers (was Re: Detecting installed fonts in ...




 On Wed, 12 Jul 2000, Michael (michka) Kaplan wrote:

  http://www.microsoft.com/globaldev/reference/sbcs/1256.htm

 That only adds KEHEH. I still lack:

 U+066B ARABIC DECIMAL SEPARATOR
 U+06C0 ARABIC LETTER HEH WITH YEH ABOVE
 U+06CC ARABIC LETTER FARSI YEH

 --roozbeh






Re: Eudora?

2000-07-12 Thread Pete Resnick

On 7/12/00 at 8:42 AM -0800, Mark Davis wrote:

By the way, does anyone know if Eudora lets you read and write email 
with UTF-8?

The latest version of Mac Eudora lets you read UTF-8. If I can get my 
act together, the next version may let you write. I'm not sure what 
we'll be able to get into Windows for the next version.

pr
-- 
Pete Resnick mailto:[EMAIL PROTECTED]
Eudora Engineering - QUALCOMM Incorporated



Re: Han character names?

2000-07-12 Thread Asmus Freytag

At 12:56 PM 7/11/00 +, [EMAIL PROTECTED] wrote:

  If you bought a copy of the book, you would have known.

I saw 2.0 in the Barnes  Noble book store the other evening,
but they only had one left and it was a struggle to get to it through
the competing crowd...  Of course, they were competing to reach
the latest Harry Potter... and I did flip through 2.0.  It was mostly
useless, a picture book with uninteresting pictures.

Thanks for the endorsement, John. But... 2.0 is pretty out of date. BN is 
apparently more devoted to stocking the most recent Harry Potters than to 
stocking the most recent Unicode Standards. Wonder whethere there's a 
message there.

Now, if there were an on-line version that could be searched
and had accompanying fonts close at hand instead of those
aggravating PICTS/GIFs/JPEGs scattered about, then it'd
be useful.

If you have access to Win2K, you might try the Unibook character browser on
http://www.unicode.org/unibook

It also works with Win9x and NT4.0. In either case, the trick is to make 
sure the large Asian fonts and Arial Unicode MS are installed. On systems 
prior to Win2K you can get them via the Office 2000 or IE5 language packs 
etc. as described in many earlier postings on this list.

A./



Re: Euro character in ISO

2000-07-12 Thread Robert A. Rosenberg

At 04:27 AM 07/12/2000 -0800, Michael Everson wrote:
Ar 18:19 -0800 2000-07-11, scríobh Robert A. Rosenberg:

 The problem would go away if the ISO would get their heads out of
 their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and
 put the CP125x codes there.

Excuse me, but that is not appropriate. The ISO/IEC 8859 series is
conformant with ISO/IEC 2022, and protocols which adhere to that standard
should not be compromised by what you suggest.

 Then when you said you used 8859-21 you'd get CP-1252 and Windows
 would no longer need to lie (or tell the truth by admitting it is
 CP-1252).

The problem is that some companies do/did not correctly identify their code
pages. The world can live with Latin-1 and CP-1252. It shouldn't have to
live with CP-1252 being identified as Latin-1.

Which is what I am saying when I talk about admitting that you are using 
CP-1252 not
ISO-8859-1 (in your MIME/HTML headers) at least in the case where there are 
glyphs in the
x80-x9F range in use. If a system can claim US-ASCII if no codes in the 
x80-xFF range appear and ISO-8859-1 otherwise (as many MUAs do), it should 
have the smarts to claim CP-1252 if in its scan it found a x80-x9F glyph).


Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire




Re: Euro character in ISO

2000-07-12 Thread Robert A. Rosenberg

At 08:56 PM 07/11/2000 -0800, Geoffrey Waigh wrote:
On Tue, 11 Jul 2000, Robert A. Rosenberg wrote:

  At 15:30 -0800 on 07/11/00, Asmus Freytag wrote about Re: Euro
  character in ISO:
 
  There has been an attempt to create a series of 'touched up' 8859
  standards. The problem with these is that you get all the issues of
  character set confusion that abound today with e.g. Windows CP 1252
  mistaken for 8895-1 with a vengeance:
 
  The problem would go away if the ISO would get their heads out of
  their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and
  put the CP125x codes there.

Except that would break all the systems that understand that C1 "junk,"
and a number of systems do so because they are adhering to other
ISO standards.  If you are going to force someone to change their
datastreams to something new, they might as well go to some flavour
of Unicode anyways.

Who is going to get broken if I say on my MIME header (or HTML) that my 
CHARSET is (example) ISO-8859-21? You are talking about uses where the 
computer is talking to a device and needs the C1 range to tell it what to 
do not another computer (where it is just passing a text stream). The C1 
codes are DEVICE CONTROL and have no purpose (except to occupy slots that 
are better used for extra GLYPHS) in EMAIL or HTML transfer. I am NOT 
asking for anyone to change their mode of operation - only for ISO-8859-x 
codes that are designed for transfer of printable data. UNICODE is not a 
viable option since all we are talking about is the ability to select from 
a number of 256 codepoint 8-bit tables not go over to UTF-8 or UTF-16 
(which would require changes to the program code).


Geoffrey
"tilting at terminal emulators, err windmills."




Re: Euro character in ISO

2000-07-12 Thread Frank da Cruz

On Wed, 12 Jul 2000 10:43:59 -0800, Robert A. Rosenberg wrote:
 At 08:56 PM 07/11/2000 -0800, Geoffrey Waigh wrote:
 On Tue, 11 Jul 2000, Robert A. Rosenberg wrote:
   At 15:30 -0800 on 07/11/00, Asmus Freytag wrote:
   There has been an attempt to create a series of 'touched up' 8859
   standards. The problem with these is that you get all the issues of
   character set confusion that abound today with e.g. Windows CP 1252
   mistaken for 8895-1 with a vengeance:
  
   The problem would go away if the ISO would get their heads out of
   their a$$ and drop the C1 junk from the NEW 'TOUCHED UP" 8859s and
   put the CP125x codes there.
 
 Except that would break all the systems that understand that C1 "junk,"
 and a number of systems do so because they are adhering to other
 ISO standards.  If you are going to force someone to change their
 datastreams to something new, they might as well go to some flavour
 of Unicode anyways.
 
 Who is going to get broken if I say on my MIME header (or HTML) that my 
 CHARSET is (example) ISO-8859-21?

We go through this exercise about twice a year.  First, let's recognize
that ISO is not about to revoke Standards 4873 and 2022, so there's not
much point in suggesting it.  Second, think of a terminal that complies
with these standards.  A physical terminal such as a VT320.  I am using it
to access my mail host in text mode, and I'm reading mail with (say) Unix
'mail'.  The terminal does not interpret the MIME headers.  It doesn't
parse HTML.  It implements a very straightforward finite state automaton
that implements the ISO 2022 based terminal.  Unix 'mail' sends to my
terminal the bytes of the message, period.

Perhaps you're suggesting the Unix 'mail' should become a translation
agent between the character set of the mail and that of the user's
terminal?  I hope not, since given that practically any character set
anybody can dream up is "MIME-compliant" as long as it's tagged, then
every mail program must know how to convert from every character set in
existence to every other one.  Or is it the mail transfer agent?  Or both?
It's really quite a mess; let's not go out of our way to make it worse.

To understand the implications of using 8-bit character sets that contain
graphic characters in the C1 area FOR INTERCHANGE, imagine trying to do
the same thing to the C0 area.

- Frank




Re: Euro character in ISO

2000-07-12 Thread Frank da Cruz

 On Wed, 12 Jul 2000, Frank da Cruz wrote:
 
  Perhaps you're suggesting the Unix 'mail' should become a translation
  agent between the character set of the mail and that of the user's
  terminal?  I hope not, since given that practically any character set
  anybody can dream up is "MIME-compliant" as long as it's tagged, then
  every mail program must know how to convert from every character set in
  existence to every other one.
 
 Yes, it damn well should. And this is easy, as there is a standard Unix
 function that knows how to do this. (it's called iconv).
 
I'm logged into unix right now:

  $ iconv
  bash: iconv: command not found
  $

How standard can it be?  And what about VMS, VMS/CMS, VOS, OS/390, OS/400,
Tandem, and all the others?

How does the mail client know what character set my terminal has?

Anyway, between you and me, there are potentially lots of places where
character-set conversion can occur.  Your mail client, your MTA, my MTA,
my mail client, my Telnet server, my Telnet client, my terminal emulator.
Let's think carefully about this before we have random combinations of
these clients, agents, and servers stepping on each others' toes.

- Frank




(no subject)

2000-07-12 Thread Akil Fahd

I'am trying to create a bilingual and bi-directional (Arabic and English 
Qur'an)e-Book, that will be compliant with the Open eBook OEB specification. 
  This is targeted at the PalmOS, but should be renderable in XML and/or 
XHTML compliant browsers such IE 5.0 and Netscape 6.0 or any type of Open 
eBook reader.

I already have the HTML files entire of the Qur'an in Arabic and English - 
though I will have them proof read many times before I distribute the 
completed eBook.

The Arabic pages are coded using the win-1256 (Arabic) codepage in the 
following manner:

HTML DIR=RTL

head

META content="text/html; charset=windows-1256" http-equiv=Content-Type
body

p align="right"

font face = "Traditional Arabic"

font size = "5pt"

These pages show up fine (correct font and directionality) when using the IE 
5.0 browser, however when I convert them to the PalmOS, the right to left 
directionality is lost.

In order to convert the HTML pages to the OEB eBook format I'm using the 
MobiPocket Publisher (home page 
http://www.mobipocket.com/en/HomePage/default.asp)that creates a prc file 
from the HTML files.

In order to test the conversion to the PalmOS, I'm using the PalmOS Emulator 
(running a 3.5 Palm OS IIIc rom) with the APOS 2.0 (home page 
http://www.arabicpalm.com/) and Mobipocket Reader software installed.

The above setup is being tested on Windows 98 (Arabic Enabled Edition) and 
Windows 2000 PCs.

The prc files created using this method, display the Arabic font on the the 
emulator's Palm IIIc screen (when using the MobiPocket reader), however the 
correct direction is not enforced.

Please note that Arabic and English text are coded with separate html files.

My questions are as follows

How can I convert from cp 1256 to unicode, without doing it character by 
character? Is there software that will do this?

Dose the eBook Spec. allow for the nesting of a right to left languages 
(Arabic) inside of a left to right language (English) on the same page?

Does anyone know if APOS is unicode compliant?

Any advise or examples would be greatly appreciated, as I have not found any 
examples on how nest languages (with different text and directionality) with 
in the Palm doc nor prc formats.

Akil


Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com




Re: Han character names?

2000-07-12 Thread John Cowan

Michael Everson wrote:
 
 Ar 10:23 -0800 2000-07-11, scríobh Jon Babcock:
 
 But covering the jiaguwen [J. koukotsumoji] (oracle bone script) is
 another story.  First of all, it's a moving target.
 
 Isn't it best treated as a font variant of CJK?

Partly so.  But only about 30% of the jiaguwen are unifiable with known
modern hanzi.

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan [EMAIL PROTECTED]
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,   || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)



Qur'an Arabic eBook port to PalmOS related MISC.

2000-07-12 Thread Akil Fahd

I'am trying to create a bilingual and bi-directional (Arabic and English 
Qur'an)e-Book, that will be compliant with the Open eBook OEB specification. 
  This is targeted at the PalmOS, but should be renderable in XML and/or 
XHTML compliant browsers such IE 5.0 and Netscape 6.0 or any type of Open 
eBook reader.

I already have the HTML files entire of the Qur'an in Arabic and English - 
though I will have them proof read many times before I distribute the 
completed eBook.

The Arabic pages are coded using the win-1256 (Arabic) codepage in the 
following manner:

HTML DIR=RTL

head

META content="text/html; charset=windows-1256" http-equiv=Content-Type
body

p align="right"

font face = "Traditional Arabic"

font size = "5pt"

These pages show up fine (correct font and directionality) when using the IE 
5.0 browser, however when I convert them to the PalmOS, the right to left 
directionality is lost.

In order to convert the HTML pages to the OEB eBook format I'm using the 
MobiPocket Publisher (home page 
http://www.mobipocket.com/en/HomePage/default.asp)that creates a prc file 
from the HTML files.

In order to test the conversion to the PalmOS, I'm using the PalmOS Emulator 
(running a 3.5 Palm OS IIIc rom) with the APOS 2.0 (home page 
http://www.arabicpalm.com/) and Mobipocket Reader software installed.

The above setup is being tested on Windows 98 (Arabic Enabled Edition) and 
Windows 2000 PCs.

The prc files created using this method, display the Arabic font on the the 
emulator's Palm IIIc screen (when using the MobiPocket reader), however the 
correct direction is not enforced.

Please note that Arabic and English text are coded with separate html files.

My questions are as follows

How can I convert from cp 1256 to unicode, without doing it character by 
character? Is there software that will do this?

Dose the eBook Spec. allow for the nesting of a right to left languages 
(Arabic) inside of a left to right language (English) on the same page?

Does anyone know if APOS is unicode compliant?

Any advise or examples would be greatly appreciated, as I have not found any 
examples on how nest languages (with different text and directionality) with 
in the Palm doc nor prc formats.

Akil


Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com




Re: Euro character in ISO

2000-07-12 Thread Rick McGowan

 There are lots of Unixes:
   http://www.columbia.edu/kermit/unix.html
 How many of them have an iconv function?

rangda 47: man iconv
man: no entry for iconv in the manual.
rangda 48: cat /etc/motd
Welcome to Darwin!
rangda 49: well, hmmm...
zsh: command not found: well,
rangda 50: 



Re: Subset of Unicode to represent Japanese Kanji?

2000-07-12 Thread foster . feng

I am NOT a unicode expert but I am a Japanese speaker. Here are my 2 cents:

Japanese document must consist of:

hiragana: less than 100 characters
katakana: less than 100 characters
kanji: basic kanji has 6,879 characters as defined in JIS X 0208-1990
  extended kanji has 6,067 characters as defined in JIS X 0212-1990

The extended kanji are rarely used -- less than 1% of daily newspaper. The
MicroSoft's developed Shift-JIS encoding support hiragana, katakana, basic
kanji, but not extended kanji.

Technically, a Japanese document can be written in all Roman characters, but
this is not a true Japanese document. It is very difficult to read and it leads
to ambiguity and misunderstanding. It was only used back in the Telex days, when
people had no choice.

Foster Feng
Programmer/analyst
MIS Department
National Instruments





Otto Stolz [EMAIL PROTECTED] on 2000/07/12 07:41:35

To:   "Unicode List" [EMAIL PROTECTED]
cc:   [EMAIL PROTECTED] (bcc: Foster Feng/TYO/NIC)

Subject:  Re: Subset of Unicode to represent Japanese Kanji?




 The Japanese I must support is the Kanji form. [...] I cannot support
 Unicode in its entirety due to memory constraints.

If I am not mistaken, Kanji is ideographic characters, which would take
the lion's share of memory to implement. Probably, you have to support
kana (hiragana or katakana).

I do not know Japanese, so others may jump in.

Best wishes,
   Otto Stolz







Miscellaneous comments/questions.

2000-07-12 Thread Alex Bochannek

Hi!

I just returned from a lengthy trip through parts of Europe and
thought I mention some observations.

In Greece, I noticed that almost all signs used monotonic Greek. I saw
some older road signs and a couple of store signs that used polytonic
Greek, but according to a Greek acquaintance, everybody is very happy
to not have to deal with it anymore. When did the switch actually
happen? He claimed it was only about a decade ago?

What was interesting to see was how the printing of the tonos
varied. For the most part it did look like a steeper acute as
described in Chapter 7.2 of Unicode 3. A number of times, I did see a
variation though which looked more like, e.g., U+03B1 U+0307, but I
suspect that to be just a font style. I also noticed that frequently,
certain characters are written in variants which at first were
completely indecipherable to me. I especially recall the beta
(U+03D0), theta (U+03D1), and maybe pi (U+03D6) as well as the
upper-case upsilon (U+03D2.) As someone who learned classical Greek in
school, it added to the problems I already had with the modern
pronunciation of a lot of the letters ;-)

One thing I found very confusing was the mixing of Latin and Greek
script which is very common on billboards. A couple of times I found
myself unable to tell whether a word was spelled in Latin or Greek
since it only used glyphs which both scripts share and hence I could
not derive the proper pronunciation at first. It was interesting to
see some brand name products and proper names transcribed while
sometimes Latin script is used in mid-sentence for foreign words.

A similar issue was very interesting to observe in France and
Germany. The use of the English language in advertisement seems to run
rampant in Germany while almost all ads that include English in France
(mostly tag lines) are followed by an asterisk and the literal French
translation somewhere near the edge of the sign. At first I thought it
was somewhat silly but when I saw how the German language currently is
absorbing English words like a sponge, the footnotes seemed to make
sense.

While in Germany, I bought a children's book that was first published
in 1921 and used a simplified Fraktur. As a native German, I had no
problems reading it, but for my wife who doesn't have German as her
native language, the long-s did throw her off at first. After I
explained the logic behind it, it was a lot easier, but she did make a
good point as to why it isn't used in the "sp" digraph. Maybe Otto can
shed some light on this?

In looking at older Fraktur text, it was very interesting to see how
foreign words are set in an Antiqua font similar to how in English
text foreign words are often in italics (and similar to the use of
Latin script in Greek above.) This brings up a font question I have
been wondering about for a long time: How similar are typesetting
features of fonts across different scripts? It seems that most
European scripts have print and cursive versions (I saw some beautiful
cursive signs in Greece), serifs and mono-spaced fonts, and boldness
and slant seems to be common as well. But what about other scripts? It
seems that all(?) scripts currently represented in Unicode have at
least some typographical tradition albeit only scholarly in some
cases. How much of the features are overlapping, i.e., how much sense
does it make to define a serif font for CJK scripts? What about
italics in Arabic? Can there be a font family which covers all the
scripts in Unicode and which complies with the local typographic
esthetics? I apologize for the glyph-centric nature of the question
;-)


Two other topics of discussion that came up in recent weeks were very
interesting to me: Time zones and location names. The latter was
something I have been curious about myself for a while. It is true
that in Germany for example, rarely the state (Bundesland) is
indicated when referring to a location. When ambiguity arises,
regional names or other landmarks are used to distinguish, sometimes
to the point of becoming part of the name. Examples: Hamm (Westfalen)
and Frankfurt am Main versus Frankfurt an der Oder. Even more
interesting to me though would be the local name of places and I would
love to find a World Atlas who first indicates every location's name
in the local language and script, then the accepted Latin
transliteration, and finally the name in English (or, say, German, if
published in Germany.) Are the large publishing houses equipped to
produce something like this? Or more importantly, would they use
Unicode for it? What about smaller printers (like for business cards?)

The other issue that was brought up about time zones is fascinating. A
while ago, when I was looking into locale issues, it occurred to me
that there really needs to be a comprehensive database of "cultural
defaults." For extensive localization, you need to know more than just
date format, language, and script (OK, I am oversimplifying the extent
of the locale information.) 

RE: Euro character in ISO

2000-07-12 Thread Chris Wendt

The trick is HTML4.

Since you sent the message in HTML format, the Euro is encoded as numeric
character reference. Exchange knows how to decode HTML and generate RTF,
depending on what your email client needs.

If you had sent plain text, the Euro would have turned into ?. As is the
case in the plain text part of the multipart message.

This is the case for Outlook Express 5. Older versions of OE treated
Windows-1252 and iso-8859-1 the same.

Here is the source of the message from my Outlook Express Sent Mail folder.
(To see the source, open message and press Ctrl-F3).

From: "Chris Wendt" [EMAIL PROTECTED]
To: "Chris Wendt" [EMAIL PROTECTED]
Subject: Euro test
Date: Wed, 12 Jul 2000 15:17:49 -0700
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="=_NextPart_000_0005_01BFEC14.57202A10"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.50.4133.2400
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400

This is a multi-part message in MIME format.

--=_NextPart_000_0005_01BFEC14.57202A10
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

abcdef ? abcdef

--=_NextPart_000_0005_01BFEC14.57202A10
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
HTMLHEAD
META content=3D"text/html; charset=3Diso-8859-1" =
http-equiv=3DContent-Type
META content=3D"MSHTML 5.00.3103.1000" name=3DGENERATOR
STYLE/STYLE
/HEAD
BODY bgColor=3D#ff
DIVFONT color=3D#008000 face=3DVerdana size=3D2abcdef #8364;=20
abcdef/FONT/DIV/BODY/HTML

--=_NextPart_000_0005_01BFEC14.57202A10--


-Original Message-
From: Leon Spencer [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, July 12, 2000 2:38 PM
To: Unicode List
Subject: RE: Euro character in ISO


Is Microsoft playing tricks in MS Outlook or IE?
If I send text from Outlook Express to my exchange
account, with charset set to iso-8859-1 but containing
the Trademark symbol ((tm)) in the body, it shows up
okay. The body of the message is in text/html.

Is it possible that MS Outlook's HTML ActiveX control
(which I'm assuming to be the same used for IE) is
defaulting to Cp1252/Windows-1252 when it sees iso-8859-1?

Leon

BTW, The body also contains the Euro!



Re: Subject lines in UTF-8 mssgs? [was: Proposal to make ...]

2000-07-12 Thread Christopher J. Fynn


"Jaap Pranger" [EMAIL PROTECTED] wrote:

 At 16:44 +0200 2000.07.12, [EMAIL PROTECTED] wrote:

 Everybody (beginning by myself!) should probably be more careful 
 in naming subject lines, and renaming them when a reply deviates 
 from the subject.
 
 Marco,
 
 This wil not help very much when you send UTF-8 messages. Your 
 Subject lines in those messages show up completely "garbled", at 
 least in my non-UTF-8-aware email client. OK, that's my problem. 
 But mostly other people's UTF-8 messages show 'neat' Subject headers.  
 What's going on, why this difference? 
 
 Jaap
 
In Outlook Express under Tools, Options, Send,  International Settings 
it is possible to specify that only English  (? ASCII) is used in headers 
and under Tools, Options, Send, Plain Text Settings  Tools, Options, 
Send, HTML Settings it is possible to specify whether or not 8-bit 
characters may be used in message headers.

These settings seem to apply whatever encoding is used for the body 
of the message.

- Chris




Re: Eudora?

2000-07-12 Thread Piotr Trzcionkowski





  
By the way, does anyone know if Eudora lets you read and 
write email with UTF-8?
  The latest version of Mac Eudora lets you read UTF-8. If I 
  can get my act together, the next version may let you write. I'm not 
  sure what we'll be able to get into Windows for the next 
version.
Is it default encoding ? What about other Ianaencodings 
?

Doesit able to produce structuralized text/html or 
text/xml partin multipart/alternative messages or alone 
?


RE: Subject lines in UTF-8 mssgs? [was: Proposal to make ...]

2000-07-12 Thread Chris Wendt

From: Christopher J. Fynn [mailto:[EMAIL PROTECTED]]
In Outlook Express under Tools, Options, Send,  International Settings 
it is possible to specify that only English  (? ASCII) is used in headers 

This is relevant when you are running with a non-English OS locale. It will
prevent entering non-usascii characters for day and month names in the reply
header so as to not force you to send in UTF-8 in case you write in a
different script than the OS locale is.

and under Tools, Options, Send, Plain Text Settings  Tools, Options, 
Send, HTML Settings it is possible to specify whether or not 8-bit 
characters may be used in message headers.

This does not prevent non-usascii characters in the header. It only decides
if the non-usascii characters will be RFC1522 encoded or sent as raw 8-bit
bytes - each in the chosen encoding.

These settings seem to apply whatever encoding is used for the body 
of the message.

Yes, correct.



RE: correction (was: Not all Arabics are created equal...)

2000-07-12 Thread Gregg Reynolds

Again:  the writing protocol (or algorithm) does not matter.  Look at the
many ways I can write the number four thousand two hundred fifty seven:

The conventional way:
4
42
4257

"Backwards":
   7
  57
 257
4257

"Evens first, forwards":
 2
 2 7
42 7
4257

"Odds first, backwards":
  5
4 5
425
4257

"Evens first, forwards, then odds, backwards":
 2
 2 7
 257
4257

etc. etc. etc.  We can run through the same exercise in any language.  The
outcome is always the same.  What counts (no pun intended) is the
mathematical rule of evaluation, which says that the LSD position is ones,
the next over is tens, then hundreds, etc.  In English and most European
languages, the MSD, as defined by the mathematical rule of evaluation, comes
first in reading order, and "first in reading order" in English means to the
left of the other figures.  In Arabic, and Persian, and Urdu, etc., "first
in reading order" means to the right of the remaining figures, and that
means the LSD.  "Reading order" means typographically, on the page, and not
verbally; don't forget that the figures on the page denote numbers, not
words, so pronouncing the words that represent the same number should not be
construed as a reading of the figures, but of their meaning.  So although
Persian written forms are LSD first, the spoken translation is MSD first.

The key point is that a mathematical modeling of written language (which is
what Unicode amounts to) should model the semantics of written forms, and
not the protocols/algorithms of putting ink on paper or emitting sounds into
the air.

I suspect the audience has become thoroughly bored by now, so if you'd like
to continue the conversation maybe we should do so privately.

Sincerely,

Gregg


 -Original Message-
 From: Roozbeh Pournader [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, July 12, 2000 7:58 AM
 To: Unicode List
 Cc: Unicode List
 Subject: Re: correction (was: Not all Arabics are created equal...)




 On Wed, 12 Jul 2000, Gregg Reynolds wrote:

  But in any case, this doesn't change the main point:  Persian may be
  spoken MSD-first, but its written forms are LSD-first.

 No. Except when adding etc. (just like in English), Persian
 numbers are
 written MSD-first. When I (and any other Persian speaker I
 know) try to
 write something like "I have 12 books", which is "man 12
 ketaab daaram" in
 Persian, I write it in this fashion:

   M
  AM
 NAM
  1  NAM
  12 NAM
K 12 NAM
   EK 12 NAM
 ...
MARAAD BAATEK 12 NAM

 This means that Persian is also written MSD-first.

 --roozbeh






BTW, Anyone working with MS JVM AND Unicode?

2000-07-12 Thread Leon Spencer

BTW, Anyone working with MS JVM AND Unicode?  I'd like to 
override the core ByteToChar Unicode classes used by the MS JVM.
 
Currently, I'm modifying the TrustedClasspath so my modified sun.io
package can be loaded first. 
 
Is there someway to get rid of MS JVM's ByteToChar classes all 
together?
 
Leon