[no subject]

2016-06-08 Thread David Faulks
Hello,

Just a question here.

The Zodiac sign Capricorn has an alternate Glyph/Symbol (see below):
http://www.capricornzodiacsign.net/capricornsymbol.htm

It is only vaguely similar to the glyph found in the Unicode charts and 
astrological sites, and sometimes astrological software offers a choice between 
the two.

Since every font I have checked on my computer, uses a glyph close to the 
Unicode charts (if they have Zodiac symbols at all), I am thinking that it 
might be best to propose this as a separate character.

Is this a good idea? 

Also, Zodiac signs right now have Emoji representations. Would I have to submit 
this as an Emoji rather than a symbol? Would I have to make up a coloured Emoji 
Glyph?

Thanks for any responses.

David Faulks


[no subject]

2015-03-28 Thread verdy_p
[Note: message resent using another domain. Visibly the Unicode mailing list 
rejects as spam all emails posted from Gmail's webmail, and containing all 
relevant tracking mime headers and 
regularly signed by Google and my proven identity].

2015-03-28 12:30 GMT+01:00 Michael Norton :

 Thanks Doug.  I did not know there exists a representative sample of the 
 world's text. :)
 I do know that 400 years ago there were about 10,000 languages; now there are 
 about 6,500.
 Time flies!  

 Your frequency chart is great.The average char appearance is 2.91%. Only 34% 
 from your list exceed 10% of it.
 Therefore, U+0020 is the elephant in the room (ie. 15%.05% is far  2.91%).
 In fact, it's almost 50% greater than the next most-appearing character.   

 So from the two frequency lists you've given me (my email and yours) we begin 
 to see some patterns emerge.
 Provided prior data and observation, most useful patterns prevail over other 
 more obscure ones
 and present a provocative opportunity for webbers out there...
 
 While this is probably out of context for most of the 700 Unicode members, I 
 can report that it's good news.

Long time ago I learned a word (or is it an acronym? it's not really an 
abbreviation by itself even if it is pronounceable) used by French 
cryptanalists (using simple encryption schemes by 
substitution): ESARTINULOC (some older sources gave ESANTIRULO). Which is 
the ordered list of most frequently basic letters used in French (ignoring case 
and diacritic differences). It's 
also used implicitly by gamers (e.g. playing or composing crosswords, or 
playing games such as Scrabble(TM), where the top letters of the list have 
lower scoring values, different between 
French Scrabble and English Scrabble).

That word is slightly different in English, or in the limited global 
counting Doug did (over an extremely limited set of source texts); but of 
course in French the SPACE would also lead the 
list before that word (but that does not enter into account for crosswords or 
Scrabble, even in languages that don't use spaces for word separation).

More accurate statistics may be found using statistics collected by databases 
with plain-text search capabilities (in the structure of their index), provided 
they correctly track the language used 
and their data concerns a large enough set of domains (e.g. statistics of 
plain-text search engines for each **localized** edition of Wikipedia, 
Wiktionnary, or Wikisource). If you want global 
statistics it will be more difficult (Wikimedia Commons is insufficiently 
translated, with a too wide presence of English), but what you may do is to 
estimate the rate of usages for each main 
language (or macrolanguage) and weight the statistics collected for each 
language to return an estimated global frequency list.

But be careful, each language has its own set of collation rules such that 
letters that are considered having the same primary weight in one language are 
distinguished and counted separately 
in some other language: you may find that a source ü or ä had its rate 
actuelly computed as UE or AE in German, but only as U or A in English 
or French, and this wil not allow you 
to correctly estimate the global frequency rates of U, A and E. A simple 
linear mathematic transform (scalar products of usage rates of languages and 
usage rates of letters per 
language) would not work: the global usage rate of E would be underestimated 
where it also represents the German umlaut, and both U and A would be 
overestimated...

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?

2012-07-21 Thread Martin J. Dürst

Hello Karl,

On 2012/07/21 0:41, Karl Pentzlin wrote:

Looking for an example of plain text which is obvious to anybody,
it seems to me that the Subject field of e-mails is a good example.
Common e-mail software lets you enter any text but gives you never
access to any higher-level protocol. Possibly you can select the font
in which the subject line is shown, but this is completely independent
of the font your subject line is shown at the recipient.
Thus, you transfer here plain text, and you can use exactly the
characters which either Unicode provides to you, or which are PUA
characters which you have agreed upon with the recipient before.

In fact, the de-facto-standard regulating the e-mail content (RFC 2822,
dated April 2001 http://www.ietf.org/rfc/rfc2822.txt , afaik)


No. If you go to http://tools.ietf.org/html/rfc2822, you'll see
Obsoleted by: 5322, Updated by: 5335, 5336.
RFC 5322 is the new version, date October 2008, but doesn't change much.
RFC 5335 and 5336 are experimental for encoding the Subject (and a lot 
of other fields) as raw UTF-8 if the email infrastructure supports it. 
There are Standards Track updates for these two, RFC 6531 and 6532.


But what's more important for your question, at least in theory, is 
http://tools.ietf.org/html/rfc2231, which defines a way to add language 
information to header fields such as Subject:. With such information, it 
would stop to be plain text.


In practice, RFC 2231 is not well known, and even less used, so except 
for detailed technical discussion, your example should be good enough.


Regards,   Martin.



defines the content of the Subject line as unstructured (p.25),
which means that is has to consist of US-ASCII characters, which in
turn can denote other (e.g. Unicode) characters by the application of
MIME protocols. Thus, the result is an unstructured character
sequence.

There is e.g. no possibility to include superscripted/subscripted
characters in a Subject of an e-mail, unless these characters are
in fact included as superscript/subscript characters in Unicode
directly.

Thus, proving the necessity to include a character in the text of a
Subject line of an e-mail, is proving that the character has to be
available as a plain text character. If, additionally, the character
is used outside a closed group (which can be advised to use PUA
characters), then there is a valid argument to include such a
character in Unicode.

Is my assumption correct?

(I think of the SUBSCRIPT SOLIDUS proposed in WG2 N3980.
  It is in fact annoying that you cannot address DIN EN 13501
  requirements in an e-mail subject line written correctly,
  as Unicode, although being an industry standard, until now
  did not listen to an industry request at this special topic.)

- Karl







Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?

2012-07-20 Thread Karl Pentzlin
Looking for an example of plain text which is obvious to anybody,
it seems to me that the Subject field of e-mails is a good example.
Common e-mail software lets you enter any text but gives you never
access to any higher-level protocol. Possibly you can select the font
in which the subject line is shown, but this is completely independent
of the font your subject line is shown at the recipient.
Thus, you transfer here plain text, and you can use exactly the
characters which either Unicode provides to you, or which are PUA
characters which you have agreed upon with the recipient before.

In fact, the de-facto-standard regulating the e-mail content (RFC 2822,
dated April 2001 http://www.ietf.org/rfc/rfc2822.txt , afaik)
defines the content of the Subject line as unstructured (p.25),
which means that is has to consist of US-ASCII characters, which in
turn can denote other (e.g. Unicode) characters by the application of
MIME protocols. Thus, the result is an unstructured character
sequence.

There is e.g. no possibility to include superscripted/subscripted
characters in a Subject of an e-mail, unless these characters are
in fact included as superscript/subscript characters in Unicode
directly.

Thus, proving the necessity to include a character in the text of a
Subject line of an e-mail, is proving that the character has to be
available as a plain text character. If, additionally, the character
is used outside a closed group (which can be advised to use PUA
characters), then there is a valid argument to include such a
character in Unicode.

Is my assumption correct?

(I think of the SUBSCRIPT SOLIDUS proposed in WG2 N3980.
 It is in fact annoying that you cannot address DIN EN 13501
 requirements in an e-mail subject line written correctly,
 as Unicode, although being an industry standard, until now
 did not listen to an industry request at this special topic.)

- Karl




Re: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?

2012-07-20 Thread Philippe Verdy
The Subject filed is subject to special encoding like
Quoted-Printable or Base64 using specific prefixes. This is necessary
because the MIME headers spreciying the ail encoding only applies to
the mail body but not to the headers themselves.

For this reason it is not stricly plain text.

Additionally it has specific formatting conventions related to the use
of spaces and continuation lines if needed.

Not all mail reader agents will recognize the Quoted-Printable or
Base64 signatures found in these headers (notably in: subject, from,
to), but most now actually decode them properly, privded that the
prefixes are specifying a supported charset. UTF-8 is one of thoese
charsets that will be most fequently recognized, but the ISO-8859-1 is
still much more often recognized. For Chinese, or Japanese, UTF-8 is
rarely used.

There's no way to specify a font to render the encoded characters.
When the headers contain 8-bit byte values, there's some assumption
that it will be decoded like with the encooding found or specified in
the mail body, but this is unreliable.

2012/7/20 Karl Pentzlin karl-pentz...@acssoft.de:
 Looking for an example of plain text which is obvious to anybody,
 it seems to me that the Subject field of e-mails is a good example.
 Common e-mail software lets you enter any text but gives you never
 access to any higher-level protocol. Possibly you can select the font
 in which the subject line is shown, but this is completely independent
 of the font your subject line is shown at the recipient.
 Thus, you transfer here plain text, and you can use exactly the
 characters which either Unicode provides to you, or which are PUA
 characters which you have agreed upon with the recipient before.



Re: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?

2012-07-20 Thread Asmus Freytag

On 7/20/2012 8:41 AM, Karl Pentzlin wrote:

Looking for an example of plain text which is obvious to anybody,
it seems to me that the Subject field of e-mails is a good example.


By common convention, certain notational features have been relegated to 
styled text. Super and subscript in mathematical, chemical and other 
notation belongs to that class.


There have been occasional calls to add certain explicit characters, but 
they have been either rejected or met with such chilly response on 
preliminary inquiry that no formal submission was ever made.


Subscript and superscript are essential features of such a notation, but 
most people can live with not having access to the full notation in 
the subject line. (No mathematician expects to be able to place a fully 
built-up equation there, even if his software supports plain text math, 
as defined in UTN#28).


A much stronger case than subject lines are regulatory databases with 
plain-text fields in their records. A German company had approached 
Unicode with the problem that even the in-line formulas for chemical 
compounds needed a few subscript character beyond digits, in particular 
the Greek letters alpha, beta  and gamma (not the whole alphabet).


That request died before being taken up by the committee.

I have no idea how that industry solved their problem, after all, the 
regulatory mandate didn't disappear. However, as it stands, the de-facto 
precedent is to not accommodate such usage by coding characters. The 
situation with DIN EN 13501 seems to be entirely equivalent, in fact I 
find it less likely that a subject line, to be intelligible and specific 
would require the particular character in question than the letters 
needed to be able to write a full chemical formula (in the style of 
C₂H₆O). People just make do, writing C2H6O etc. (check chemical formula 
of alcohol on google, to see what I mean). [Some organic compounds also 
use Greek letters, I don't have an example, not being a chemist.]


If the users for which such near plain text notations are part of 
their daily work were to report that subject lines, database plain 
text fields and other such bottlenecks are causing serious issues, then 
I think Unicode and WG2 should listen carefully. However, this should be 
something that's broadly anchored in those user communities. Let them 
demonstrate that there's a real practical need that outweighs the dual 
representation issue.


A./



Re: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?

2012-07-20 Thread Jukka K. Korpela

2012-07-20 19:52, Philippe Verdy wrote:


The Subject fi[el]d is subject to special encoding like
Quoted-Printable or Base64 using specific prefixes.


This is a matter of character encoding. All plain text inevitably has 
some encoding, and the encoding may vary without changing the plain text 
status. Admittedly, QP and Base64 may be interpreted as being a 
higher-level protocol, but they can be applied to any plain text, and I 
don’t think this changes plain text to non-plain.



Additionally it has specific formatting conventions related to the use
of spaces and continuation lines if needed.


This is a real deviation from plain text principles and applies to 
e-mail message headers in general. As per clause 2.2.3 of RFC 2822, the 
header is logically a single line but may contain CR LF, which will be 
unfolded.


Yucca





RE: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?

2012-07-20 Thread Shawn Steele
A) it can use quoted-printable
B) See RFC 6532/6530 - Now it can be UTF-8 :)

-Shawn










Re: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?

2012-07-20 Thread Jukka K. Korpela

2012-07-20 20:19, Asmus Freytag wrote:


On 7/20/2012 8:41 AM, Karl Pentzlin wrote:

Looking for an example of plain text which is obvious to anybody,
it seems to me that the Subject field of e-mails is a good example.


By common convention, certain notational features have been relegated to
styled text. Super and subscript in mathematical, chemical and other
notation belongs to that class.


I’m afraid I don’t quite follow. Superscripts and subscripts can be 
presented using styling or other higher-level protocols, or specialized 
superscript or subscript characters can be used, in many cases. But this 
does not seem to be relevant to the question whether “Subject” fields 
are a good example of plain text.



A much stronger case than subject lines are regulatory databases with
plain-text fields in their records.


It’s part of the database design to decide whether fields are plain 
text, so I don’t quite get the point. Sometimes people would like plain 
text to cover things that do not exist as Unicode characters now, but 
that’s a different topic.



If the users for which such near plain text notations are part of
their daily work were to report that subject lines, database plain
text fields and other such bottlenecks are causing serious issues, then
I think Unicode and WG2 should listen carefully.


Instead of getting into theoretical considerations of “near plain text”, 
I think the question is whether there is sufficient evidence of 
real-life needs for new subscript or superscript characters. In general, 
coding of new characters requires demonstrated *use* of symbols as text 
characters, rather than arguments about *need* to use them. But even the 
need is questionable: e-mail headings are supposed to be short texts 
that tell what the message is about, not complicated formulas. And it’s 
part of database design to decide that you use some fields for some 
purposes and make them plain text fields, instead of (somehow) allowing 
styling inside them.


Yucca






Re: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?

2012-07-20 Thread Asmus Freytag

On 7/20/2012 1:34 PM, Jukka K. Korpela wrote:

2012-07-20 20:19, Asmus Freytag wrote:


On 7/20/2012 8:41 AM, Karl Pentzlin wrote:

Looking for an example of plain text which is obvious to anybody,
it seems to me that the Subject field of e-mails is a good example.


By common convention, certain notational features have been relegated to
styled text. Super and subscript in mathematical, chemical and other
notation belongs to that class.


I’m afraid I don’t quite follow.


Yeah, I think in this case you missed the point of what I was trying to say.

A./



[no subject]

2004-11-27 Thread Flarn
I know that there are some combining characters, and a lot of base 
characters. But, is there any way to use a base character as a 
combining character? Please help me!

- Michael Norton (a.k.a. Flarn)
E-mail address: [EMAIL PROTECTED]



[no subject]

2004-10-13 Thread Magda Danish \(Unicode\)

A new translation has been posted on the Unicode website:
What is Unicode? in Slovenian 
http://www.unicode.org/standard/translations/slovenian.html


---
Magda Danish
Sr. Administrative Director
The Unicode Consortium
650-693-3921
[EMAIL PROTECTED]
 





[no subject]

2004-09-24 Thread unicode-bounce
mail3.microsoft.com with Microsoft SMTPSVC(6.0.3790.196);
 Thu, 23 Sep 2004 17:14:34 -0700
Received: from RED-MSG-52.redmond.corp.microsoft.com ([157.54.12.12]) by  
mailout2.microsoft.com with Microsoft SMTPSVC(6.0.3790.0);
 Thu, 23 Sep 2004 17:14:31 -0700
X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary=_=_NextPart_001_01C4A1CB.7796A6F7
Subject: unspecified by sender
Date: Thu, 23 Sep 2004 17:14:29 -0700
Message-ID:  
[EMAIL PROTECTED]
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
Thread-Index: AcShy3Vk88P4a4OPQdeIADUfCaV1aw==
From: Peter Constable [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
X-OriginalArrivalTime: 24 Sep 2004 00:14:31.0530 (UTC)  
FILETIME=[76B62CA0:01C4A1CB]
X-archive-position: 16576
X-Approved-By: [EMAIL PROTECTED]
X-ecartis-version: Ecartis v1.0.0
Sender: [EMAIL PROTECTED]
Errors-to: [EMAIL PROTECTED]
X-original-sender: [EMAIL PROTECTED]
Precedence: bulk
List-help: mailto:[EMAIL PROTECTED]
List-unsubscribe: mailto:[EMAIL PROTECTED]
List-software: Ecartis version 1.0.0
List-ID: unicode.sarasvati.unicode.org
X-List-ID: unicode.sarasvati.unicode.org
X-list: unicode

This is a multi-part message in MIME format.

--_=_NextPart_001_01C4A1CB.7796A6F7
Content-Type: text/plain; charset=Windows-1252
Content-Transfer-Encoding: quoted-printable

Here=92s the abstract for one of the presentations at ATypI next week. =
Will this be the every-character-has-a-story repository we=92ve always =
wished for?

=20

Decode Unicode!

A typographic database

Johannes Bergerhausen=20

Friday 1 October | 14:15 =96 15:00
Location: A-2 (Archa Hall 2)
Presentation | Theme: Typographic Babylon | Duration: 45 minutes=20

After the DNA, the ASCII-Code is the most successful code on this =
planet. The Unicode will even be better. Now is the right time to gather =
and explain the meaning, history and correct typographic use of each =
Unicode-Caracter. Who =93invented=94 the full stop? When did the =
Infinity-Sign come into being? What=92s an Ogonek? In an 18-month =
project in the department of Design at the University of Applied =
Sciences in Mainz, Germany, we are collecting images, samples and texts =
about each and every sign in the Code. In the near future, the project =
will be opened for anyone to submit their own material. In his lecture, =
Prof. Bergerhausen will give an introduction to code-history from ASCII =
to Unicode and will present the project that is supported by the Germany =
Federal Ministry of Education and Research.=20

Speaker details

Johannes Bergerhausen =
http://www.atypi.org/08_Prague/30_program/40_speakers/view_person_html?p=
ersonid=3D1130  Professor Fachhochschule Mainz | Germany

Prof. Johannes Bergerhausen, born 1965 in Bonn, Germany, studied Visual =
Communication at the University of Applied Sciences in D=FCsseldorf. =
From 1993 to 2000, he lived and worked in Paris. First he collaborated =
with the Founders of Grapus, G=E9rard Paris-Clavel and Pierre Bernard, =
then he founded his own office. In 1998 he was awarded a grant from the =
French Centre National des Arts Plastiques for a typographic research =
project on the ASCII-Code. Lectures in Amiens, Paris, Rotterdam, Warsaw, =
Weimar. He returned to Germany in 2000, since 2002 he is Professor of =
Typography at the University of Applied Sciences in Mainz. In 2003, =
together with Paris-Clavel, he published the font =93LeBuro=94 at ACME =
Fonts, London.

=20


--_=_NextPart_001_01C4A1CB.7796A6F7
Content-Type: text/html; charset=Windows-1252
Content-Transfer-Encoding: quoted-printable

html xmlns:o=3Durn:schemas-microsoft-com:office:office =
xmlns:w=3Durn:schemas-microsoft-com:office:word =
xmlns:st1=3Durn:schemas-microsoft-com:office:smarttags =
xmlns=3Dhttp://www.w3.org/TR/REC-html40;

head
meta http-equiv=3DContent-Type content=3Dtext/html; =
charset=3Dwindows-1252
meta name=3DGenerator content=3DMicrosoft Word 11 (filtered medium)
o:SmartTagType =
namespaceuri=3Durn:schemas-microsoft-com:office:smarttags
 name=3Dcountry-region/
o:SmartTagType =
namespaceuri=3Durn:schemas-microsoft-com:office:smarttags
 name=3DCity/
o:SmartTagType =
namespaceuri=3Durn:schemas-microsoft-com:office:smarttags
 name=3Dplace downloadurl=3Dhttp://www.5iantlavalamp.com//
o:SmartTagType =
namespaceuri=3Durn:schemas-microsoft-com:office:smarttags
 name=3DPlaceName/
o:SmartTagType =
namespaceuri=3Durn:schemas-microsoft-com:office:smarttags
 name=3DPlaceType/
!--[if !mso]
style
st1\:*{behavior:url(#default#ieooui) }
/style
![endif]--
style
!--
 /* Font Definitions */
 @font-face
{font-family:SimSun;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Georgia;
panose-1:2 4 5 2 5 4 5 2 3 3;}
@font-face
{font-family:Verdana;
panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
{font-family:[EMAIL PROTECTED];
panose-1:2 1 6 0 3 1 1 1 1 1;}
 /* Style Definitions */
 p.MsoNormal

Back to the subject: Folding algorithm and canonical equivalence

2004-07-19 Thread Peter Kirk
There has been extensive discussion in this thread on the specifics of 
accent and diacritic folding. But no one has answered my point, repeated 
below, that there seems to be a conflict between the folding algorithm 
(rather than the details of specific foldings) and the principle of 
canonical equivalence. Specifically, it seems to breach the principle in 
Unicode Conformance Clause C9:

Ideally, an implementation would always interpret two 
canonical-equivalent character
sequences identically. There are practical circumstances under which 
implementations
may reasonably distinguish them.
Are the authors of UTR #30 claiming that folding is one of those 
practical circumstances, or is this just an oversight?

Peter Kirk
On 17/07/2004 23:25, Peter Kirk wrote:
I was just reviewing the UTR #30 draft in response to Rick's notice 
about it. And I believe I may have found a point in which the folding 
algorithm as given may violate the principle of canonical equivalence. 
But I would like some clarification from list members before providing 
formal input on this point.

Consider a sequence made up of a base character B and two combining 
marks M1 and M2, in which the combining class of M1 is less than that 
of M2. B, M1, M2 and B, M2, M1 are canonically equivalent 
representations of the same sequence, but only the former is in 
canonical order. Suppose that a folding is defined including the 
operation B, M2 - X, but no other relevant operations. When this 
folding is applied, according to the folding algorithms defined in 
sections 4.1.1 and 4.1.2 of the UTR #30 draft, in step (a) the 
sequence B, M2, M1 will be folded to X, M1 and will not be further 
changed, but the sequence B, M1, M2 will not be changed at all by 
the folding because the sequence B, M2 will never be found. (By 
contrast, a folding operation B, M1 - Y will be applied to both 
sequences, because the canonical decomposition step converts B, M2, 
M1 to B, M1, M2 and the folding operation is re-applied and finds a 
match the second time.) The implication is that folding of two 
canonically equivalent strings gives different (and not canonically 
equivalent) results.

This is not a purely theoretical point. The Diacritic Folding as 
specified in 
http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt 
includes operations like 05D1 05BC - 05D1, i.e. BET, DAGESH - BET, 
but no general rule to delete DAGESH (or any other combining marks; I 
think there needs to be such a rule, and I have already posted a 
formal response saying that). Sequences like BET, DAGESH, PATAH are 
very common in Hebrew text, and commonly written in this order which 
is logically correct and preferred by current rendering technologies, 
but the canonical order is in fact BET, PATAH, DAGESH; thus both 
sequences will be found in data depending on whether or not it has 
been normalised. The effect of applying Diacritic Folding exactly as 
specified is that BET, DAGESH, PATAH is folded to BET, PATAH, but 
the canonically equivalent BET, PATAH, DAGESH is unchanged. (In fact 
I consider that both should be folded to just BET, but that is not 
what the current data file specifies.)

I hope I have not totally misunderstood the folding algorithm here. 
But it seems to me that what is missing in the algorithm is an initial 
step of normalising the data. The introductory text to section 4 seems 
to suggest that this has been avoided because folding may need to 
preserve the distinction between NFC and NFD data - although the 
algorithm as presented does not in fact do this. Since in practice the 
input data is not necessarily in either NFC or NFD and there is no 
easy way to detect which is being used, the only meaningful approach 
is for the user of the folding to specify whether the output of the 
folding should be NFC or NFD.

Of course there might be a real requirement for a folding which, for 
example, removes DAGESH when combined with BET (but not with other 
base characters) irrespective of what other combining marks might 
intervene. But such foldings would need a considerably more powerful 
folding algorithm.


--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: Back to the subject: Folding algorithm and canonical equivalence

2004-07-19 Thread Mark Davis
You did point out an oversight; Asmus and I have been working on the issue.

Mark

- Original Message - 
From: Peter Kirk [EMAIL PROTECTED]
To: Unicode List [EMAIL PROTECTED]
Sent: Monday, July 19, 2004 13:21
Subject: Back to the subject: Folding algorithm and canonical equivalence


 There has been extensive discussion in this thread on the specifics of
 accent and diacritic folding. But no one has answered my point, repeated
 below, that there seems to be a conflict between the folding algorithm
 (rather than the details of specific foldings) and the principle of
 canonical equivalence. Specifically, it seems to breach the principle in
 Unicode Conformance Clause C9:

  Ideally, an implementation would always interpret two
  canonical-equivalent character
  sequences identically. There are practical circumstances under which
  implementations
  may reasonably distinguish them.

 Are the authors of UTR #30 claiming that folding is one of those
 practical circumstances, or is this just an oversight?

 Peter Kirk

 On 17/07/2004 23:25, Peter Kirk wrote:

  I was just reviewing the UTR #30 draft in response to Rick's notice
  about it. And I believe I may have found a point in which the folding
  algorithm as given may violate the principle of canonical equivalence.
  But I would like some clarification from list members before providing
  formal input on this point.
 
  Consider a sequence made up of a base character B and two combining
  marks M1 and M2, in which the combining class of M1 is less than that
  of M2. B, M1, M2 and B, M2, M1 are canonically equivalent
  representations of the same sequence, but only the former is in
  canonical order. Suppose that a folding is defined including the
  operation B, M2 - X, but no other relevant operations. When this
  folding is applied, according to the folding algorithms defined in
  sections 4.1.1 and 4.1.2 of the UTR #30 draft, in step (a) the
  sequence B, M2, M1 will be folded to X, M1 and will not be further
  changed, but the sequence B, M1, M2 will not be changed at all by
  the folding because the sequence B, M2 will never be found. (By
  contrast, a folding operation B, M1 - Y will be applied to both
  sequences, because the canonical decomposition step converts B, M2,
  M1 to B, M1, M2 and the folding operation is re-applied and finds a
  match the second time.) The implication is that folding of two
  canonically equivalent strings gives different (and not canonically
  equivalent) results.
 
  This is not a purely theoretical point. The Diacritic Folding as
  specified in
  http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt
  includes operations like 05D1 05BC - 05D1, i.e. BET, DAGESH - BET,
  but no general rule to delete DAGESH (or any other combining marks; I
  think there needs to be such a rule, and I have already posted a
  formal response saying that). Sequences like BET, DAGESH, PATAH are
  very common in Hebrew text, and commonly written in this order which
  is logically correct and preferred by current rendering technologies,
  but the canonical order is in fact BET, PATAH, DAGESH; thus both
  sequences will be found in data depending on whether or not it has
  been normalised. The effect of applying Diacritic Folding exactly as
  specified is that BET, DAGESH, PATAH is folded to BET, PATAH, but
  the canonically equivalent BET, PATAH, DAGESH is unchanged. (In fact
  I consider that both should be folded to just BET, but that is not
  what the current data file specifies.)
 
  I hope I have not totally misunderstood the folding algorithm here.
  But it seems to me that what is missing in the algorithm is an initial
  step of normalising the data. The introductory text to section 4 seems
  to suggest that this has been avoided because folding may need to
  preserve the distinction between NFC and NFD data - although the
  algorithm as presented does not in fact do this. Since in practice the
  input data is not necessarily in either NFC or NFD and there is no
  easy way to detect which is being used, the only meaningful approach
  is for the user of the folding to specify whether the output of the
  folding should be NFC or NFD.
 
  Of course there might be a real requirement for a folding which, for
  example, removes DAGESH when combined with BET (but not with other
  base characters) irrespective of what other combining marks might
  intervene. But such foldings would need a considerably more powerful
  folding algorithm.
 


 -- 
 Peter Kirk
 [EMAIL PROTECTED] (personal)
 [EMAIL PROTECTED] (work)
 http://www.qaya.org/







Re: Back to the subject: Folding algorithm and canonical equivalence

2004-07-19 Thread Asmus Freytag
At 01:56 PM 7/19/2004, Mark Davis wrote:
You did point out an oversight; Asmus and I have been working on the issue.
‎Mark
As Mark wrote, your point is taken and we've taken that onboard. However, 
we won't try to *edit* text on the list, that's why we are not engaging in 
a long discussion on the details (and we've discovered many interesting 
ones, wait for the next version of the text).
In my replies I tend to focus on issues for which I need more information.

A./
PS: Just one final comment:
Ideally, an implementation would always interpret two 
canonical-equivalent character
sequences identically. There are practical circumstances under which 
implementations
may reasonably distinguish them.
Are the authors of UTR #30 claiming that folding is one of those practical 
circumstances, or is this just an oversight?
As it turns out, and not surprisingly, realizing that ideal for any 
arbitrary type of possible folding rule can get complicated (again, I won't 
go into details right now). There may be situations were an optimization 
would break canonical equivalence in the face of permissible, but unusual, 
if not to say 'non-sensical' input. That's what's meant with 'practical 
circumstances'.

If the ability to 'correctly' handle combining sequences that are a random 
mixture of Khmer and Arabic combining marks were to result in severe 
runtime penalties, would you rather have a 'correct' or a fast implementation?

Nobody argues that sequences that are expected to occur in realistic data, 
including specialized texts, definitely should be handled as expected, even 
where practicalities require some optimizations.

So, we are all agred. 





Re: Back to the subject: Folding algorithm and canonical equivalence

2004-07-19 Thread Peter Kirk
On 19/07/2004 23:23, Asmus Freytag wrote:
At 01:56 PM 7/19/2004, Mark Davis wrote:
You did point out an oversight; Asmus and I have been working on the 
issue.

Mark

As Mark wrote, your point is taken and we've taken that onboard. 
However, we won't try to *edit* text on the list, that's why we are 
not engaging in a long discussion on the details (and we've discovered 
many interesting ones, wait for the next version of the text).
In my replies I tend to focus on issues for which I need more 
information.

Fair enough. I just wondered if I needed to raise this one as a formal 
feedback issue. From what you say here, I assume not.

A./
PS: Just one final comment:
Ideally, an implementation would always interpret two 
canonical-equivalent character
sequences identically. There are practical circumstances under which 
implementations
may reasonably distinguish them.

Are the authors of UTR #30 claiming that folding is one of those 
practical circumstances, or is this just an oversight?

As it turns out, and not surprisingly, realizing that ideal for any 
arbitrary type of possible folding rule can get complicated (again, I 
won't go into details right now). There may be situations were an 
optimization would break canonical equivalence in the face of 
permissible, but unusual, if not to say 'non-sensical' input. That's 
what's meant with 'practical circumstances'.

If the ability to 'correctly' handle combining sequences that are a 
random mixture of Khmer and Arabic combining marks were to result in 
severe runtime penalties, would you rather have a 'correct' or a fast 
implementation?

Again, fair enough. But I would be surprised if this is a real issue 
with the folding algorithm. Indeed I would expect, given that 
decomposition, presumably to NFD, is anyway required after the first 
folding pass, that there would be little or no performance hit in 
normalising the text to be folded to NFD before the first folding pass.

Nobody argues that sequences that are expected to occur in realistic 
data, including specialized texts, definitely should be handled as 
expected, even where practicalities require some optimizations.

Yes, but I did make the point that the issue I brought up is not a 
purely theoretical one, but a very real one for Hebrew with the 
diacritic removal folding as defined.

So, we are all agred.



--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Subject lines that have nothing to do with message content

2004-05-10 Thread Rick McGowan
Personally speaking, I would have expected that a recent message on this  
list with the sujbect line Katakana_Or_Hiragana might have something to  
do with Japanese, Hiragana, Katakana, or at least Han, or perhaps even  
Asia. But no... It was about Phoenician.

It would be really helpful if people could use subject lines that have  
something to do with the subject of the message.

It just can't be that difficult for people to pick a reasonable subject  
line. And if you're going to go off-topic in a thread, you might consider  
getting a different subject line -- or at least adding a parenthetical  
about how you're going to go off the thread...

(As usual, this is my personal opinion and doesn't reflect an official  
policy, etc.)

Rick



RE: Subject lines that have nothing to do with message content

2004-05-10 Thread Peter Constable
Of course, if ever there was a subject line that permitted the topic to
wander howsoever far from where it started, the one on this thread is
it. :-)

Peter




(no subject)

2004-03-18 Thread Jon Hanna
Quoting Marion Gunn [EMAIL PROTECTED]:

how to guarantee continuance,
 in the specific context of Irish text computing, of the traditional
 restriction of the Irish diacritic dot (having only one single function in
 Irish) to the consonants to which it belongs?

A spell checker.

-- 
Jon Hanna
http://www.hackcraft.net/
…it has been truly said that hackers have even more words for
equipment failures than Yiddish has for obnoxious people. - jargon.txt



(no subject)

2003-08-17 Thread HILNET
To Unicode.org

In connection with the discussion about hexadecimal characters, one might find of interest my solution to the problem. As background, I developed a code for the unique identification of all recorded knowledge and information and proposed a universal system at a conference in Tokyo in 1967. Since then, my colleagues and I have been waiting for technology to develop to the stage that would make a universal information access system an essential component of a Global Information Infrastructure.

The technology is now here in bandwidth, processing speed and power, and cost of storage. Our alphanumeric code in a structured format has been supplemented with a 64-bt unique identifier for machine interaction also in a structured format. The standard keyboard would be replaced by one with 20 additional special function keys. Sixteen of these keys would have 16 color coded dots representing the hexadecimal coding. When the input is shifted to the universal code, the first two keys entered would automatically represent a Unicode character. The first 16 bits of the 17th bit field would represent the hexadecimal characters. The remaining 64-bits would identify devices, subject terms and phrases, proper names, geographic segments, documents and items in the system. The system is designed to handle both public and private information. 

Howard J. Hilton, Ph.D.


Major Defects in Subject Lines!

2003-06-26 Thread Rick McGowan
Wow... How on earth did the subject line Major Defect in Combining  
Classes of Tibetan Vowels turn into a discussion of Biblical Hebrew? At  
least, people, if you're going to transmogrify the discussion, please use a  
subject line such as Biblical Hebrew which someone already was wise  
enough to start using on some pieces of this thread.

Thanks,

Rick
(All my own opinions, of course)



Re: Khmer encoding model (had no subject)

2003-03-05 Thread Mijan
Quoting Marco Cimarosti [EMAIL PROTECTED]:

 Mijan wrote:
  [...]
   3. There are no other cases of a Vowel+Virama combination in the
   Unicode encoding model.
   
   Yes, there are. Khmer.
  
  I do not understand Khmer but I see that it does not use the 
  same 'encoding model'. Please look, you will see that you
  were wrong to use Khmer as an example.
 
 What do you mean by not using the same encoding model?
 
 There are actually three Indic scripts that have been encoded with a
 different model: Tibetan (subscript letters are encoded separately, rather
 than as combinations of virama + consonant), and Thai/Lao (reordrant vowel
 marks are encoded in visual order, rather than in phonetic order).
 
 But, AFAIK, this is not the case of Unicode Khmer, which is encoded in the
 same way as the scripts of India.

Thank you for the correction. I said I do not understand Khmer. I was 
understanding that scripts not based on ISCII were using different encoding 
model

Mijan




-
This mail sent through http://www.bangladesh.net 



Khmer encoding model (had no subject)

2003-03-04 Thread Marco Cimarosti
Mijan wrote:
 [...]
  3. There are no other cases of a Vowel+Virama combination in the
  Unicode encoding model.
  
  Yes, there are. Khmer.
 
 I do not understand Khmer but I see that it does not use the 
 same 'encoding model'. Please look, you will see that you
 were wrong to use Khmer as an example.

What do you mean by not using the same encoding model?

There are actually three Indic scripts that have been encoded with a
different model: Tibetan (subscript letters are encoded separately, rather
than as combinations of virama + consonant), and Thai/Lao (reordrant vowel
marks are encoded in visual order, rather than in phonetic order).

But, AFAIK, this is not the case of Unicode Khmer, which is encoded in the
same way as the scripts of India.

_ Marco



(no subject)

2003-03-03 Thread Mijan
Hi,

I read with interest about the japhalaa debate in Bangla and I have joined you 
to answer this question

I understand that unicode is supposed to represent the language, not the way it 
is written.
This is how bengali is currently described in unicode, and obviously it seems 
to work well for the most part.
I am convinced that this needs to be extended for cases that cannot be 
represented in unicode or have ambiguous interpretation on how it should be 
rendered as is the case of ya-phalaa.
Let's consider the ra+virama+ya case. In the mostpart the ra+virama+ya is 
displayed as ya+reph. This obviously seems to be an 
instance of ambiguous interpretation because ra+virama+ya could also represents 
ra+ja-phalaa. ya+reph and ra+ja-phalaa are used in different words and have 
different meaning.
Form this you see that ja-phalaa is not equivalent to virama-ya and is better 
as a separate letter in Unicode. We always thought of ya-phalaa as separate 
anyway.

Now to you questions on this:

Michael Everson wrote on 02 March 2003 13:22:

 1. The sequence 'Vowel+Virama+Ya...' is illogical to scholars of
 Bengali and indeed Indic languages in general.
 
 I refuted this yesterday by indication that this usage is an 
 innovation.

I think that only scholars of Bengali can have correct place to answer that!
 
 2. Such sequences are not semantically equivalent to the intended
 
 ... sentence fragment.

I think Andy meant 'not equivalent to vowels with ya-phalaa'

 3. There are no other cases of a Vowel+Virama combination in the
 Unicode encoding model.
 
 Yes, there are. Khmer.

I do not understand Khmer but I see that it does not use the same 'encoding 
model'. Please look, you will see that you were wrong to use Khmer as an 
example.

 4. Yaphalaa is not equivalent to 'Virama+Ya'
 
 Yes, it is, as I showed yesterday.

No one can show that Virama+Ya is the same as ya-phalaa because it is not!.
Please understand that ya-phalaa is originally an alternative form of 'Sanskrit 
letter Ya'. Now days 'Sanskrit letter Ya' is represented as YYA (Ya with nukta) 
in Bengali words. Bengali 'Ya' has a separate meaning and is pronounced 'Ja'.

The origin of ya-phalaa is clear but the present day Bengali equivalent letter 
is not. No one can be sure if ya-phalaa is a form of Ya or YYa.  I say that it 
is neither. Now days ya-phalaa has a very different purpose. It is used to 
alter the pronunciation of letters that proceed it or vowels that come after 
it. 
 
 5. ISCII implementations encode these letters as separate characters
 corresponding to the Devanagari Candra A  E. Unicode should follow 
 the example of these implementations.
 
 No, it shouldn't. Unicode has a method for writing these sequences 
 already and a second method for doing so should not be introduced. 
 Use mapping tables to exchange ISCII and Unicode data.

I have been taught to keep things simple when coding software. If adding 
letters to the Bengali code space do this, then it will be better.

I hope that this helps you
Best regards
Mijan

-
This mail sent through http://www.bangladesh.net 



Re: (no subject)

2003-03-03 Thread John Cowan
Mijan scripsit:

 Let's consider the ra+virama+ya case. In the mostpart the ra+virama+ya is 
 displayed as ya+reph. This obviously seems to be an 
 instance of ambiguous interpretation because ra+virama+ya could also represents 
 ra+ja-phalaa. ya+reph and ra+ja-phalaa are used in different words and have 
 different meaning.

I'm responding to this message in order to isolate this point.  If correct, then
the current model of YA PHALAA is inadequate.

-- 
Dream projects long deferredJohn Cowan [EMAIL PROTECTED]
usually bite the wax tadpole.http://www.ccil.org/~cowan
--James Lileks  http://www.reutershealth.com



Re: (no subject)

2003-03-03 Thread Christopher John Fynn
Michael Everson wrote

 At 16:48 -0500 2003-03-03, John Cowan wrote:
 Mijan scripsit:

   Let's consider the ra+virama+ya case. In the mostpart the ra+virama+ya is
   displayed as ya+reph. This obviously seems to be an
   instance of ambiguous interpretation because ra+virama+ya could 
 also represents
   ra+ja-phalaa. ya+reph and ra+ja-phalaa are used in different words and have
   different meaning.
 
 I'm responding to this message in order to isolate this point.  If 
 correct, then
 the current model of YA PHALAA is inadequate.
 
 ZWJ can be used to produce the required differentiation.

If this is the way the differentiation should be made there should probably be an 
explicit note to that effect in the introduction to the Bengali block .

- Chris 



Re: Key E00 (was: (no subject))

2002-02-06 Thread Michael Everson

At 02:24 -0500 2002-02-06, [EMAIL PROTECTED] wrote:

ISO keyboards have the section-sign (§) key, next to the 1 key 
above the tab key on the left of the keyboards. Some US keyboards 
(for instance the Mac PowerBook G3) don't have this key, but instead 
have the grave key there, while on the ISO keyboard the grave key 
is down next to the z.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Key E00 (was: (no subject))

2002-02-06 Thread Michael Everson

Apple calls what I have on my desk an ISO extended keyboard. It came 
with my Cube. It has the section key next to the 1, and the grave key 
next to the z. My Powerbook has the grave key next to the 1, and no 
key next to the z.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Key E00 (was: (no subject))

2002-02-06 Thread DougEwell2

In a message dated 2002-02-06 3:39:14 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

 ISO keyboards have the section-sign (§) key, next to the 1 key 
 above the tab key on the left of the keyboards. Some US keyboards 
 (for instance the Mac PowerBook G3) don't have this key, but instead 
 have the grave key there, while on the ISO keyboard the grave key 
 is down next to the z.

My draft copy of ISO/IEC 9995-3, acquired from:

http://iquebec.ifrance.com/cyberiel/sc35wg1/SC35N0233_9995-3.pdf

shows SECTION SIGN on key C02, level 2 of the common secondary group, and 
GRAVE ACCENT on key C12, level 1 on both the complementary Latin and common 
secondary groups.  (Note that C12 is frequently relocated to B00, down next 
to the 'z' as you indicated.)

In the complementary Latin group, key E00 is ASTERISK (level 1) and PLUS SIGN 
(level 2), while in the common secondary group it is NOT SIGN (level 1) and 
SOFT HYPHEN (level 2).

Which ISO keyboard are you referring to?  I'm not trying to be 
argumentative; I just got done implementing a lot of keyboards, and none of 
them had SECTION SIGN on key E00, so I'm curious.

For those unfamiliar with ISO 9995 terminology, please refer to the above 
document as well as:

http://iquebec.ifrance.com/cyberiel/sc35wg1/SC35N0232_9995-2.pdf

and John Cowan's explanation from yesterday.

-Doug Ewell
 Fullerton, California
 (address will soon change to dewell at adelphia dot net)




(no subject)

2002-02-05 Thread DougEwell2

On the official Web site of the Cherokee Nation (Tahlequah, Oklahoma), there 
is a Cherokee keyboard, there is a nice keyboard layout that goes with the 
font they offer:

http://www.cherokee.org/Extras/downloads/font/Keyboard.htm

For key E00, level 1 (i.e. the unshifted grave-accent key), there is a little 
squiggly mark called Accent.  I can't find any indication of the purpose of 
this character -- what it's supposed to accent -- but it's not encoded in 
Unicode.

Does anyone know what this character is for, or why it wasn't encoded?  I 
read Michael Everson's 1995 proposal for Cherokee (WG2 N1172) and couldn't 
find any mention of it.

-Doug Ewell
 Fullerton, California
 (address will soon change to [EMAIL PROTECTED])




Re: (no subject)

2002-02-05 Thread Mark Leisher


Doug On the official Web site of the Cherokee Nation (Tahlequah,
Doug For key E00, level 1 (i.e. the unshifted grave-accent key), there is
Doug a little squiggly mark called Accent.  I can't find any indication
Doug of the purpose of this character -- what it's supposed to accent --
Doug but it's not encoded in Unicode.

For those of us not in the know, please tell us what the heck key E00, level
1 means.
-
Mark LeisherOrthodoxy, of whatever color, seems to
Computing Research Lab  demand a lifeless, imitative style.
New Mexico State University
Box 30001, Dept. 3CRL  -- Politics and the English Language,
Las Cruces, NM  88003 George Orwell




Re: (no subject)

2002-02-05 Thread Michael Everson

At 12:09 -0500 2002-02-05, [EMAIL PROTECTED] wrote:
On the official Web site of the Cherokee Nation (Tahlequah, Oklahoma), there
is a Cherokee keyboard, there is a nice keyboard layout that goes with the
font they offer:

 http://www.cherokee.org/Extras/downloads/font/Keyboard.htm

For key E00, level 1 (i.e. the unshifted grave-accent key), there is a little
squiggly mark called Accent.  I can't find any indication of the purpose of
this character -- what it's supposed to accent -- but it's not encoded in
Unicode.

Does anyone know what this character is for, or why it wasn't encoded?  I
read Michael Everson's 1995 proposal for Cherokee (WG2 N1172) and couldn't
find any mention of it.

I've never seen it anywhere but on that web page, which I found some time ago.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: (no subject)

2002-02-05 Thread Michael Everson

At 10:55 -0700 2002-02-05, Mark Leisher wrote:
 Doug On the official Web site of the Cherokee Nation (Tahlequah,
 Doug For key E00, level 1 (i.e. the unshifted grave-accent key), there is
 Doug a little squiggly mark called Accent.  I can't find any indication
 Doug of the purpose of this character -- what it's supposed to accent --
 Doug but it's not encoded in Unicode.

For those of us not in the know, please tell us what the heck key E00, level
1 means.

It is the section-sign (§) key, next to the 1 key above the tab key 
on the left. Some US keyboards don't have this key, but instead have 
the grave key there.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: (no subject)

2002-02-05 Thread John Cowan

Mark Leisher wrote:

 For those of us not in the know, please tell us what the heck key E00, level
 1 means.

E00 is the leftmost key on the E row, which is the fifth row from
the bottom (the row containing the spacebar is A).  On U.S.-style
keyboards E01 is the 1 key, D01 is Q, C01 is A, B01 is Z.

Level 1 means
that no shift keys are in effect; Level 2 means that Shift is
down, and Level 3 that AltGr (typically the right Alt key
on keyboards that need it) is down.

This naming scheme allows us to talk about particular keys on the
keyboard without regard to what they are used for in one locale
or another.  ISO 9995 is the controlling standard.

-- 
John Cowan [EMAIL PROTECTED] http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_





RE: international characters in email subject line

2001-02-13 Thread Brendan Murray/DUB/Lotus


Raghu Kolluru [EMAIL PROTECTED] wrote:
 Do you know of any email client which CAN do this and also display the
from
 alias of the email in the desired charset?

Lotus Notes does this (and has done so for some considerable time),
although it's probably way too large for what you need.

Brendan




international characters in email subject line

2001-02-12 Thread Raghu Kolluru

Greetings!

I would like to send email in international charsets. I am able to send the
body using the desired charset but not the subject line.
Any help would be appreciated.
Thanks.



Re: international characters in email subject line

2001-02-12 Thread Michael \(michka\) Kaplan

What mail program are you using?

Many of them (Exchange, Outlook, etc.) do not support this. Some do not even
support international text in the body.

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/

- Original Message -
From: "Raghu Kolluru" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Monday, February 12, 2001 2:37 PM
Subject: international characters in email subject line


 Greetings!

 I would like to send email in international charsets. I am able to send
the
 body using the desired charset but not the subject line.
 Any help would be appreciated.
 Thanks.





Re: international characters in email subject line

2001-02-12 Thread Michael \(michka\) Kaplan

Well, like I said Outlook does not support this -- it will only use the
default system code page (b.k.a. CP_ACP) for subject lines and any other
part of the header.

michka

- Original Message -
From: "Raghu Kolluru" [EMAIL PROTECTED]
To: "'Michael (michka) Kaplan'" [EMAIL PROTECTED]; "Unicode List"
[EMAIL PROTECTED]
Sent: Monday, February 12, 2001 3:29 PM
Subject: RE: international characters in email subject line


 I wrote a java application which sends emails to a relay server (Postfix).
 My email client is outlook which does support international character
sets.
 I can send/recieve non-ascii encoded body but not the subject line.
 Probably this is a question for SMTP newsgroup. Does anyone know public
 email address of such a group?
 Thanks.

 -Original Message-
 From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]]
 Sent: Monday, February 12, 2001 3:21 PM
 To: Raghu Kolluru; Unicode List
 Subject: Re: international characters in email subject line


 What mail program are you using?

 Many of them (Exchange, Outlook, etc.) do not support this. Some do not
even
 support international text in the body.

 michka

 a new book on internationalization in VB at
 http://www.i18nWithVB.com/

 - Original Message -
 From: "Raghu Kolluru" [EMAIL PROTECTED]
 To: "Unicode List" [EMAIL PROTECTED]
 Sent: Monday, February 12, 2001 2:37 PM
 Subject: international characters in email subject line


  Greetings!
 
  I would like to send email in international charsets. I am able to send
 the
  body using the desired charset but not the subject line.
  Any help would be appreciated.
  Thanks.
 





RE: international characters in email subject line

2001-02-12 Thread Raghu Kolluru

Michael,
Do you know of any email client which CAN do this and also display the from
alias of the email in the desired charset?
Thanks.

-Original Message-
From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]]
Sent: Monday, February 12, 2001 3:31 PM
To: Raghu Kolluru; Unicode List
Subject: Re: international characters in email subject line


Well, like I said Outlook does not support this -- it will only use the
default system code page (b.k.a. CP_ACP) for subject lines and any other
part of the header.

michka

- Original Message -
From: "Raghu Kolluru" [EMAIL PROTECTED]
To: "'Michael (michka) Kaplan'" [EMAIL PROTECTED]; "Unicode List"
[EMAIL PROTECTED]
Sent: Monday, February 12, 2001 3:29 PM
Subject: RE: international characters in email subject line


 I wrote a java application which sends emails to a relay server (Postfix).
 My email client is outlook which does support international character
sets.
 I can send/recieve non-ascii encoded body but not the subject line.
 Probably this is a question for SMTP newsgroup. Does anyone know public
 email address of such a group?
 Thanks.

 -Original Message-
 From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]]
 Sent: Monday, February 12, 2001 3:21 PM
 To: Raghu Kolluru; Unicode List
 Subject: Re: international characters in email subject line


 What mail program are you using?

 Many of them (Exchange, Outlook, etc.) do not support this. Some do not
even
 support international text in the body.

 michka

 a new book on internationalization in VB at
 http://www.i18nWithVB.com/

 - Original Message -
 From: "Raghu Kolluru" [EMAIL PROTECTED]
 To: "Unicode List" [EMAIL PROTECTED]
 Sent: Monday, February 12, 2001 2:37 PM
 Subject: international characters in email subject line


  Greetings!
 
  I would like to send email in international charsets. I am able to send
 the
  body using the desired charset but not the subject line.
  Any help would be appreciated.
  Thanks.
 




Re: international characters in email subject line

2001-02-12 Thread Keld Jørn Simonsen

The email program I am using, mutt, can do this.

Kind regards
keld Simonsen

On Mon, Feb 12, 2001 at 02:55:41PM -0800, Michael (michka) Kaplan wrote:
 What mail program are you using?
 
 Many of them (Exchange, Outlook, etc.) do not support this. Some do not even
 support international text in the body.
 
 michka
 
 a new book on internationalization in VB at
 http://www.i18nWithVB.com/
 
 - Original Message -
 From: "Raghu Kolluru" [EMAIL PROTECTED]
 To: "Unicode List" [EMAIL PROTECTED]
 Sent: Monday, February 12, 2001 2:37 PM
 Subject: international characters in email subject line
 
 
  Greetings!
 
  I would like to send email in international charsets. I am able to send
 the
  body using the desired charset but not the subject line.
  Any help would be appreciated.
  Thanks.
 
 



Re: international characters in email subject line

2001-02-12 Thread Jungshik Shin




On Mon, 12 Feb 2001, Michael (michka) Kaplan wrote:


 From: "Raghu Kolluru" [EMAIL PROTECTED]

  I would like to send email in international charsets. I am able to send
 the
  body using the desired charset but not the subject line.

The question is so vague. If you need to get some help, you've gotta
provide as much information as possible(what mail program under what OS
for what character set).  There are so many possibilities and nobody
would wish to go thru all of them.

 What mail program are you using?

 Many of them (Exchange, Outlook, etc.) do not support this. Some do not even
 support international text in the body.

Mozilla and Netscape 6 support entering subject header in whatever script
for which input methods are available/installed in the OS (MS-Windows,
MacOS, Unix/X11). In this respect, I18N of Mozilla/Netscape 6 is ahead
of that of MS Outlook. The same is true of  display of subject headers
in scripts which happens not to be supported by the default codepage
(to use MS terminology). BTW, one of the worst MUAs in terms of I18N
(among the widely used) might be Eudora.

BTW, most modern Unix text-based mail   programs (e.g. Pine, Mutt)
work fine in this regard as long as you run them under the terminal
that supports input/ouput of the charset you want to use
(for UTF-8, the newest xterm works well for a pretty large range
of the BMP).

Jungshik Shin




[OT]RE: international characters in email subject line

2001-02-12 Thread Jungshik Shin




On Mon, 12 Feb 2001, Raghu Kolluru wrote:

 I wrote a java application which sends emails to a relay server (Postfix).

When you write your java  application, note that any 8bit character is
explicitly prohibited(IETF STD 11/RFC 822).  You need to encode them
per IETF RFC 2047 (and RFC 2184, 2231). Some MTAs(mail transport agent)
refuse to accept messages with 8bit characters in the header depending on
the configuration. BTW, the header encoding is not just for working around
those MTAs  but also for the sake of identifying MIME charset/encoding
used and allowing the possibility of multiple MIME charset/encoding mixed
in the header (the latter might be mute when UTF-8 is  exclusively used)



 My email client is outlook which does support international character sets.
 I can send/recieve non-ascii encoded body but not the subject line.
 Probably this is a question for SMTP newsgroup. Does anyone know public
 email address of such a group?

Usenet newsgroup comp.mail.mime  is the best place to ask your question.
(it has the mail-submission address as well, but I don't know it)
BTW, MS OE doesn't support it while Mozilla does support it.

Jungshik Shin

P.S. I'm afraid Unicode mailing list server strips off  too many header
lines of messages. In this case and some  other cases(e.g. when people
talke about the safe 'transport' of UTF-8 messages), 'X-Mailer:' header
would be nice to have.






Re: international characters in email subject line

2001-02-12 Thread Sean O Seaghdha

Ar 12 Feb 2001, ag 15:06 scrobh Michael (michka) Kaplan
fn bhar "Re: international characters in ema":

 Well, like I said Outlook does not support this -- it will only use the
 default system code page (b.k.a. CP_ACP) for subject lines and any other
 part of the header.

Ar 12 Feb 2001, ag 15:46 scrobh Jungshik Shin
fn bhar "[OT]RE: international characters in":

 On Mon, 12 Feb 2001, Raghu Kolluru wrote:

  My email client is outlook which does support international character
  sets. I can send/recieve non-ascii encoded body but not the subject line.
[snip]
 BTW, MS OE doesn't support it while Mozilla does support it.

This is simply not true!  I know we all like to bash MS from time to time,
but people really get far too carried away.  I don't know if the above is
true about Outlook (as my installation is stuffed as far as e-mail goes), but
it is NOT TRUE about Outlook Express.  OE encodes the subject line with the
same encoding as the body and often (?) the From header as well.

Whether or not this works for you would probably depend on what OS you are
using and what language features are installed.  It works for me with OE
5.50.4133.2400 on Windows NT 4.0 SP5.

Of course, since my preferred mail program is Pegasus Mail, which can only be
configured for one character set, I can't usually read such headers anyway.

`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~
 S e  n  S  a g h d h a   [EMAIL PROTECTED]

Nuair a bhonn an fon istigh, bonn an ciall amuigh.  Seanfhocal.




Re: international characters in email subject line

2001-02-12 Thread Jungshik Shin




On Mon, 12 Feb 2001, Sean O Seaghdha wrote:

 On 12 Feb 2001,   Michael (michka) Kaplan wrote:

  Well, like I said Outlook does not support this -- it will only use the
  default system code page (b.k.a. CP_ACP) for subject lines and any other
  part of the header.

 On 12 Feb 2001,  Jungshik Shin wrote:

  On Mon, 12 Feb 2001, Raghu Kolluru wrote:

   My email client is outlook which does support international character
   sets. I can send/recieve non-ascii encoded body but not the subject line.
 [snip]
  BTW, MS OE doesn't support it while Mozilla does support it.

 This is simply not true!  I know we all like to bash MS from time to time,
 but people really get far too carried away.  I don't know if the above is
 true about Outlook (as my installation is stuffed as far as e-mail goes), but
 it is NOT TRUE about Outlook Express.  OE encodes the subject line with the
 same encoding as the body and often (?) the From header as well.

I stand corrected(thank you for correcting me). It's possible to enter
whatever script supported by IMEs installed on your system in both
Subject(and other headers) and body of the message. However, what
I wrote about the display of the headers in scripts NOT supported
by the default system code page still stands.  For instance, MS
OE cannot display  Korean, Japanese, Chinese, Russian headers under
English/French/Spanish/Italian/German MS-Windows in _the message *list*
display pane_, which Mozilla can. MS OE  can display those headers for
individual messages.), though.

Not having checked out MS OE for a while, I was a bit confused what is
possible and what is not. Anyway, my comment and michka's have *nothing*
to do with MS bashing. I was just giving what I believed to be facts,
one of which was not true as it turned out.  Please, note that  Michael
(michka) Kaplan, I guess is, one of the last persons on this list to say
something not true just to make MS look bad. Of course, by this I'm not
implying by any means  that there are some people who would do that on
this list.

Jungshik Shin




Re: international characters in email subject line

2001-02-12 Thread Sean O Seaghdha

Ar 12 Feb 2001, ag 20:40 scrobh Jungshik Shin
fn bhar "Re: international characters in ema":

 I stand corrected(thank you for correcting me). It's possible to enter
 whatever script supported by IMEs installed on your system in both
 Subject(and other headers) and body of the message. However, what
 I wrote about the display of the headers in scripts NOT supported
 by the default system code page still stands.  For instance, MS
 OE cannot display  Korean, Japanese, Chinese, Russian headers under
 English/French/Spanish/Italian/German MS-Windows in _the message *list*
 display pane_, which Mozilla can. MS OE  can display those headers for
 individual messages.), though.

Thank you for your clarification.  MS OE doesn't show any chars outside the
system code page in the message list, only in the preview pane and message
windows.

`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~
 S e  n  S  a g h d h a   [EMAIL PROTECTED]

N bhonn tran buan.  Seanfhocal.




Re: international characters in email subject line

2001-02-12 Thread Michael \(michka\) Kaplan

From: "Jungshik Shin" [EMAIL PROTECTED]

 Please, note that  Michael (michka) Kaplan, I guess is, one of
 the last persons on this list to say something not true just to
 make MS look bad.

There are a few program managers in Office and Visual Studio who might
disagree with this statement -- they seem to think I live to bash Microsoft.
They are mistaken, sadly. But no company is above having their boneheaded
decisions called out, something not everyone there understands.

Its nice that you do, though. :-)

 Of course, by this I'm not implying by any means that there
 are some people who would do that on this list.

Its ok, we all know that such people exist; heck, we probably all know who
they are, too. As long as we don't name names, no can claim to be offended
unless they have felon's guilt or something. :-)

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/




Re: international characters in email subject line

2001-02-12 Thread Sean O Seaghdha

Ar 12 Feb 2001, ag 20:28 scrobh Alain LaBont
fn bhar "Re: international characters in ema":

  19:53 01-02-12 -0800, Sean O Seaghdha a crit:
 
 Of course, since my preferred mail program is Pegasus Mail, which can only be
 configured for one character set, I can't usually read such headers anyway.

 [Alain]  Some years ago, I was also using Pegasus mail and I was not
 satisfied with this. I then communicated with the author directly (he lives
 in Sourthern New Zealand); we engaged in a series of exchanges and I made
 him accept to carry on the character set in use without conversion [in my
 case the Windows character set]... You have to use a parameter for this,
 this is the compromise he made me accept because he was really impressed by
 the SMTP 7-bit-only-headers dogma -- which does not impress me since it
 works any way with 8-bit-clean systems [predominant nowadays in the world
 since a serious security breach, I was told, was corrected with an
 8-bit-clean-enabling SMTP patch].

I think there are a couple of different issues here.  As far as storage on
disk goes, I think this changed some time back so that now you have to use
the switch to get the old behaviour (converting messages to one code page on
disk) which was retained for compatibility with the DOS version.

You can send 8-bit mail with Pegasus by changing a setting in Options, but
when you switch it on you get a stern warning about it being "formally
illegal" and a "Comments" header is added to each message.

I have suggested from time to time over the last few years for Pegasus to be
made Unicode aware, but I get the impression it's considered "too hard" or
"too complicated" although I don't think I've actually got a reply on this
from the author, David Harris.  Since there will not be another 16-bit
Windows version and the Macintosh version has not been updated in a long
time, this leaves only the DOS  Win32 versions.  Hopefully, this will mean
that Unicode will become more of a viable option for him in the future.  At
the moment, though, he seems quite busy enough adding HTML mail composition
to version 4.

`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~
 S e  n  S  a g h d h a   [EMAIL PROTECTED]

Calumnies are answered best with silence.Ben Johnson.




(no subject)

2001-01-07 Thread Krishna Desikachary

The intent of  this message is to point out some of the deficiencies in the unicode 
specifications for non-Devangari Indic  scripts.  It is well known (inescapable and 
undeniable)  fact that people writing in non_Devanagari scripts such as Telugu, 
Kannada, Malayalam and others transcribe Sanskrit and Vedic texts in their own script 
to conveniently study them. In fact many well known Vedic and Sanskrit scholars in 
Andhra  Pradesh, where Telugu is spoken and which is my native state,  do not know how 
to read Devanagari. 
They have all read and written these texts only in Telugu. Also, it  is an established 
fact that many Sanskrit manuscripts from ancient times are available  only in 
non-Devanagari scripts.  Given this situation, I am terribly dissatisfied that the 
current Unicode specification for non_Devanagari scripts lacks many symbols required 
to transcribe Sanskrit and Vedic texts properly. These include:

a) All the swara  symbols required to transcribe Vedic texts (udatta, anudatta, double 
udatta atc, and the symbols used in writing Samaveda)
b) Avagraha, Vocalic L and LL Matra symbols
c) Half Visarga, used in grammar and other Sanskrit  texts

In addition, Unicode Standard need to address the following features in case of Telugu 
standardization

The Dantya (Dontal) ca and ja, and the vowel ligatures of these two consonants with A, 
u, U, o, O, Au  occur in Telugu language. These are equivalent to ja-nukta and 
ca-nukta in Hindi. But these are not included in Telugu Unicode specification. Without 
these, it is impossible to compose an authentic Telugu dictionary, and also the 
sorting of text will also be wrong. So, these MUST be included in the Unicode spec for 
Telugu.

In addition, the symbols to denote Karnatic music should also be included in the 
specification of symbols so that any script transcribing Karnatic compositions should 
be able to do so correctly.



Unless an effort is made to include all these symbols in all the relevant Indic 
scripts, the existing specification is woefully 



(no subject)

2000-12-17 Thread zdzlc

Who can tell me where can I download the unicode standard?
thank you!! 




(no subject)

2000-11-15 Thread nikita k

Hi,
Is there any text editor by which data can be entered
in Hindi?

Rgds,
Nikita K

__
Do You Yahoo!?
Yahoo! Calendar - Get organized for the holidays!
http://calendar.yahoo.com/



(no subject)

2000-11-03 Thread G. Anders

Who can take me off from the unicode list ?
I have an overflow for the moment and no time
to take part of the group.
unsubscribe [EMAIL PROTECTED]
Thank you
Gunter


BEGIN:VCARD
VERSION:2.1
N:Anders;G.
FN:G. Anders
TEL;HOME;VOICE:0041 (61) 711 67 14
TEL;HOME;FAX:0041 (61) 711 67 14
ADR;HOME:;;Gehrenstr. 1;Reinach/BL;;CH 4153;Schweiz
LABEL;HOME;ENCODING=QUOTED-PRINTABLE:Gehrenstr. 1=0D=0AReinach/BL CH 4153=0D=0ASchweiz
EMAIL;PREF;INTERNET:[EMAIL PROTECTED]
REV:20001103T140847Z
END:VCARD



(no subject)

2000-10-11 Thread John H. Jenkins

Somebody has been playing with the wires in the room where the server 
is housed and so the server is technically up but inaccessible 
outside the server room. I'm in the process of trying to straighten 
out this tangled affair.

Meanwhile, the PDF charts are still accessible via their new home 
URL, http://www.unicode.org/charts/.

-- 
=
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://www.blueneptune.com/~tseng



(no subject)

2000-09-25 Thread



unscirbe


(no subject)

2000-09-24 Thread woodmailit



please remove me from this 
list


(no subject)

2000-08-08 Thread C. Janardhana Gupta



I have an application that doesn't include unicode support at all.
Considering this, can I use Uniscribe APIs in my application. The system on
which I want to run my application is Windows 98.

Specifically, is there any relationship between Uniscribe APIs and Unicode,
and if yes, then what exactly it is.

Thanks

C.Janardhana Guptha
Quark, Chandigarh




(no subject)

2000-07-31 Thread Zhen Ren

Hello, all.  How do I print the superscript minus sign?  The unicode for 
this is \u207B.  However, it is not printed correctly.  Instead, it is an 
unrecognized character.  Thanks a lot.

Zhen Ren

Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com




RE: Subject lines in UTF-8 mssgs? [was: Proposal to make ...]

2000-07-28 Thread Robert A. Rosenberg

At 01:41 AM 07/13/2000 -0800, [EMAIL PROTECTED] wrote:
As far as I can understand, the choice of the outgoing charset is highly
automatic in MS Outlook 2000. I suspects it depends on the combination of
characters that I (or the system) used in the various fields of the e-mail.

The problem is that the heuristics are not correct for ISO-8859-1/CP-1252. 
The selection SHOULD be:

   1) Only x00-x7F - US-ASCII
   2) x00-x7F + xA0-xFF - ISO-8859-1 [Western European(ISO)]
   3) x00-x7F + xA0-xFF + a character in the x80-x9F code point range - 
CP-1252/Windows-1252 [Western European(Windows])

If you check the Encoding list, you will note that Western European(ISO) 
and Western European(Windows) are both listed and the selection controls if 
a message with xA0-xFF characters gets ID'ed as ISO-8859-1 or CP-1252. The 
problem is that selection of Western European(ISO) does not correct the 
message's CHARSET to CP-1252 if a x80-x9F is found in the message.




Re: Subject lines in UTF-8 mssgs? [was: Proposal to make ...]

2000-07-14 Thread John Cowan

Chris Wendt wrote:

 This is relevant when you are running with a non-English OS locale. It will
 prevent entering non-usascii characters for day and month names in the reply
 header so as to not force you to send in UTF-8 in case you write in a
 different script than the OS locale is.

How's that?  The Date: header on outgoing email is localized to the sender's locale?
That seems to be a clear-cut violation of RFC-822, and damaging to interoperability
(because I must know every possible localized month name to interpret the
header).

It would make *much* more sense to localize the Date: headers on incoming email.

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan [EMAIL PROTECTED]
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,   || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)



RE: Subject lines in UTF-8 mssgs? [was: Proposal to make ...]

2000-07-14 Thread Chris Wendt

I shouldn't have used "header". What I meant is not the message header in
the RFC 822 sense but the information out of the header that gets copied
into the message BODY on a reply.

Example right below.

-Original Message-
From: John Cowan [mailto:[EMAIL PROTECTED]]
Sent: Friday, July 14, 2000 8:51 AM
To: Unicode List
Subject: Re: Subject lines in UTF-8 mssgs? [was: Proposal to make ...]


Chris Wendt wrote:

 This is relevant when you are running with a non-English OS locale. It
will
 prevent entering non-usascii characters for day and month names in the
reply
 header so as to not force you to send in UTF-8 in case you write in a
 different script than the OS locale is.

How's that?  The Date: header on outgoing email is localized to the sender's
locale?
That seems to be a clear-cut violation of RFC-822, and damaging to
interoperability
(because I must know every possible localized month name to interpret the
header).

It would make *much* more sense to localize the Date: headers on incoming
email.



Re: Subject lines in UTF-8 mssgs? [was:

2000-07-13 Thread Michael \(michka\) Kaplan

 I forced the encoding to UTF-8 (it is supposed to be the
 default in my setting, but most of my messages arrive as
 charset="windows-1252"), and I am using some Chinese
 characters that are certainly not in my system's default
 code page:

 你好、雅朴。
 _馬可。

Note that this may not necessarily forced UTF-8, since OE supports encodings
for Chinese characters that you could also use to send the message.

UTF-8 *is* required for languages that do not support such an encoding, like
Tamil.

showing_off
உலகம் பேச நினைக்கும் போது Unicode 
பேசுகிறது
/showing_off

On the whole, I would not recommend sending mail using those other
encodings, I believe that people using OE 5.0 and later will be prompted to
install language support just by opening the e-mail! :-)

michka

(the sentence is right, by the way g).





Re: Subject lines....../ Lost Header?? Re: [nothing]

2000-07-13 Thread Jaap Pranger


My previous message of a few minutes ago 
with the empty   "Re:   "  --only Header (at least as I got 
it back from the listserver) left my home with a Header 
as shown below. 

Any information about the whereabouts 
of my lost Head   leading to its recovery ... 



Re: =?utf-8?B?UkU6IFN1YmplY3QgbGluZXMgaW4gVVRGLTggbXNzZ3M/IFt3YXM6?=





Jaap

-- 





(no subject)

2000-07-12 Thread Akil Fahd

I'am trying to create a bilingual and bi-directional (Arabic and English 
Qur'an)e-Book, that will be compliant with the Open eBook OEB specification. 
  This is targeted at the PalmOS, but should be renderable in XML and/or 
XHTML compliant browsers such IE 5.0 and Netscape 6.0 or any type of Open 
eBook reader.

I already have the HTML files entire of the Qur'an in Arabic and English - 
though I will have them proof read many times before I distribute the 
completed eBook.

The Arabic pages are coded using the win-1256 (Arabic) codepage in the 
following manner:

HTML DIR=RTL

head

META content="text/html; charset=windows-1256" http-equiv=Content-Type
body

p align="right"

font face = "Traditional Arabic"

font size = "5pt"

These pages show up fine (correct font and directionality) when using the IE 
5.0 browser, however when I convert them to the PalmOS, the right to left 
directionality is lost.

In order to convert the HTML pages to the OEB eBook format I'm using the 
MobiPocket Publisher (home page 
http://www.mobipocket.com/en/HomePage/default.asp)that creates a prc file 
from the HTML files.

In order to test the conversion to the PalmOS, I'm using the PalmOS Emulator 
(running a 3.5 Palm OS IIIc rom) with the APOS 2.0 (home page 
http://www.arabicpalm.com/) and Mobipocket Reader software installed.

The above setup is being tested on Windows 98 (Arabic Enabled Edition) and 
Windows 2000 PCs.

The prc files created using this method, display the Arabic font on the the 
emulator's Palm IIIc screen (when using the MobiPocket reader), however the 
correct direction is not enforced.

Please note that Arabic and English text are coded with separate html files.

My questions are as follows

How can I convert from cp 1256 to unicode, without doing it character by 
character? Is there software that will do this?

Dose the eBook Spec. allow for the nesting of a right to left languages 
(Arabic) inside of a left to right language (English) on the same page?

Does anyone know if APOS is unicode compliant?

Any advise or examples would be greatly appreciated, as I have not found any 
examples on how nest languages (with different text and directionality) with 
in the Palm doc nor prc formats.

Akil


Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com




Re: Subject lines in UTF-8 mssgs? [was: Proposal to make ...]

2000-07-12 Thread Christopher J. Fynn


"Jaap Pranger" [EMAIL PROTECTED] wrote:

 At 16:44 +0200 2000.07.12, [EMAIL PROTECTED] wrote:

 Everybody (beginning by myself!) should probably be more careful 
 in naming subject lines, and renaming them when a reply deviates 
 from the subject.
 
 Marco,
 
 This wil not help very much when you send UTF-8 messages. Your 
 Subject lines in those messages show up completely "garbled", at 
 least in my non-UTF-8-aware email client. OK, that's my problem. 
 But mostly other people's UTF-8 messages show 'neat' Subject headers.  
 What's going on, why this difference? 
 
 Jaap
 
In Outlook Express under Tools, Options, Send,  International Settings 
it is possible to specify that only English  (? ASCII) is used in headers 
and under Tools, Options, Send, Plain Text Settings  Tools, Options, 
Send, HTML Settings it is possible to specify whether or not 8-bit 
characters may be used in message headers.

These settings seem to apply whatever encoding is used for the body 
of the message.

- Chris




RE: Subject lines in UTF-8 mssgs? [was: Proposal to make ...]

2000-07-12 Thread Chris Wendt

From: Christopher J. Fynn [mailto:[EMAIL PROTECTED]]
In Outlook Express under Tools, Options, Send,  International Settings 
it is possible to specify that only English  (? ASCII) is used in headers 

This is relevant when you are running with a non-English OS locale. It will
prevent entering non-usascii characters for day and month names in the reply
header so as to not force you to send in UTF-8 in case you write in a
different script than the OS locale is.

and under Tools, Options, Send, Plain Text Settings  Tools, Options, 
Send, HTML Settings it is possible to specify whether or not 8-bit 
characters may be used in message headers.

This does not prevent non-usascii characters in the header. It only decides
if the non-usascii characters will be RFC1522 encoded or sent as raw 8-bit
bytes - each in the chosen encoding.

These settings seem to apply whatever encoding is used for the body 
of the message.

Yes, correct.