[no subject]
Hello, Just a question here. The Zodiac sign Capricorn has an alternate Glyph/Symbol (see below): http://www.capricornzodiacsign.net/capricornsymbol.htm It is only vaguely similar to the glyph found in the Unicode charts and astrological sites, and sometimes astrological software offers a choice between the two. Since every font I have checked on my computer, uses a glyph close to the Unicode charts (if they have Zodiac symbols at all), I am thinking that it might be best to propose this as a separate character. Is this a good idea? Also, Zodiac signs right now have Emoji representations. Would I have to submit this as an Emoji rather than a symbol? Would I have to make up a coloured Emoji Glyph? Thanks for any responses. David Faulks
[no subject]
[Note: message resent using another domain. Visibly the Unicode mailing list rejects as spam all emails posted from Gmail's webmail, and containing all relevant tracking mime headers and regularly signed by Google and my proven identity]. 2015-03-28 12:30 GMT+01:00 Michael Norton : Thanks Doug. I did not know there exists a representative sample of the world's text. :) I do know that 400 years ago there were about 10,000 languages; now there are about 6,500. Time flies! Your frequency chart is great.The average char appearance is 2.91%. Only 34% from your list exceed 10% of it. Therefore, U+0020 is the elephant in the room (ie. 15%.05% is far 2.91%). In fact, it's almost 50% greater than the next most-appearing character. So from the two frequency lists you've given me (my email and yours) we begin to see some patterns emerge. Provided prior data and observation, most useful patterns prevail over other more obscure ones and present a provocative opportunity for webbers out there... While this is probably out of context for most of the 700 Unicode members, I can report that it's good news. Long time ago I learned a word (or is it an acronym? it's not really an abbreviation by itself even if it is pronounceable) used by French cryptanalists (using simple encryption schemes by substitution): ESARTINULOC (some older sources gave ESANTIRULO). Which is the ordered list of most frequently basic letters used in French (ignoring case and diacritic differences). It's also used implicitly by gamers (e.g. playing or composing crosswords, or playing games such as Scrabble(TM), where the top letters of the list have lower scoring values, different between French Scrabble and English Scrabble). That word is slightly different in English, or in the limited global counting Doug did (over an extremely limited set of source texts); but of course in French the SPACE would also lead the list before that word (but that does not enter into account for crosswords or Scrabble, even in languages that don't use spaces for word separation). More accurate statistics may be found using statistics collected by databases with plain-text search capabilities (in the structure of their index), provided they correctly track the language used and their data concerns a large enough set of domains (e.g. statistics of plain-text search engines for each **localized** edition of Wikipedia, Wiktionnary, or Wikisource). If you want global statistics it will be more difficult (Wikimedia Commons is insufficiently translated, with a too wide presence of English), but what you may do is to estimate the rate of usages for each main language (or macrolanguage) and weight the statistics collected for each language to return an estimated global frequency list. But be careful, each language has its own set of collation rules such that letters that are considered having the same primary weight in one language are distinguished and counted separately in some other language: you may find that a source ü or ä had its rate actuelly computed as UE or AE in German, but only as U or A in English or French, and this wil not allow you to correctly estimate the global frequency rates of U, A and E. A simple linear mathematic transform (scalar products of usage rates of languages and usage rates of letters per language) would not work: the global usage rate of E would be underestimated where it also represents the German umlaut, and both U and A would be overestimated... ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?
Hello Karl, On 2012/07/21 0:41, Karl Pentzlin wrote: Looking for an example of plain text which is obvious to anybody, it seems to me that the Subject field of e-mails is a good example. Common e-mail software lets you enter any text but gives you never access to any higher-level protocol. Possibly you can select the font in which the subject line is shown, but this is completely independent of the font your subject line is shown at the recipient. Thus, you transfer here plain text, and you can use exactly the characters which either Unicode provides to you, or which are PUA characters which you have agreed upon with the recipient before. In fact, the de-facto-standard regulating the e-mail content (RFC 2822, dated April 2001 http://www.ietf.org/rfc/rfc2822.txt , afaik) No. If you go to http://tools.ietf.org/html/rfc2822, you'll see Obsoleted by: 5322, Updated by: 5335, 5336. RFC 5322 is the new version, date October 2008, but doesn't change much. RFC 5335 and 5336 are experimental for encoding the Subject (and a lot of other fields) as raw UTF-8 if the email infrastructure supports it. There are Standards Track updates for these two, RFC 6531 and 6532. But what's more important for your question, at least in theory, is http://tools.ietf.org/html/rfc2231, which defines a way to add language information to header fields such as Subject:. With such information, it would stop to be plain text. In practice, RFC 2231 is not well known, and even less used, so except for detailed technical discussion, your example should be good enough. Regards, Martin. defines the content of the Subject line as unstructured (p.25), which means that is has to consist of US-ASCII characters, which in turn can denote other (e.g. Unicode) characters by the application of MIME protocols. Thus, the result is an unstructured character sequence. There is e.g. no possibility to include superscripted/subscripted characters in a Subject of an e-mail, unless these characters are in fact included as superscript/subscript characters in Unicode directly. Thus, proving the necessity to include a character in the text of a Subject line of an e-mail, is proving that the character has to be available as a plain text character. If, additionally, the character is used outside a closed group (which can be advised to use PUA characters), then there is a valid argument to include such a character in Unicode. Is my assumption correct? (I think of the SUBSCRIPT SOLIDUS proposed in WG2 N3980. It is in fact annoying that you cannot address DIN EN 13501 requirements in an e-mail subject line written correctly, as Unicode, although being an industry standard, until now did not listen to an industry request at this special topic.) - Karl
Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?
Looking for an example of plain text which is obvious to anybody, it seems to me that the Subject field of e-mails is a good example. Common e-mail software lets you enter any text but gives you never access to any higher-level protocol. Possibly you can select the font in which the subject line is shown, but this is completely independent of the font your subject line is shown at the recipient. Thus, you transfer here plain text, and you can use exactly the characters which either Unicode provides to you, or which are PUA characters which you have agreed upon with the recipient before. In fact, the de-facto-standard regulating the e-mail content (RFC 2822, dated April 2001 http://www.ietf.org/rfc/rfc2822.txt , afaik) defines the content of the Subject line as unstructured (p.25), which means that is has to consist of US-ASCII characters, which in turn can denote other (e.g. Unicode) characters by the application of MIME protocols. Thus, the result is an unstructured character sequence. There is e.g. no possibility to include superscripted/subscripted characters in a Subject of an e-mail, unless these characters are in fact included as superscript/subscript characters in Unicode directly. Thus, proving the necessity to include a character in the text of a Subject line of an e-mail, is proving that the character has to be available as a plain text character. If, additionally, the character is used outside a closed group (which can be advised to use PUA characters), then there is a valid argument to include such a character in Unicode. Is my assumption correct? (I think of the SUBSCRIPT SOLIDUS proposed in WG2 N3980. It is in fact annoying that you cannot address DIN EN 13501 requirements in an e-mail subject line written correctly, as Unicode, although being an industry standard, until now did not listen to an industry request at this special topic.) - Karl
Re: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?
The Subject filed is subject to special encoding like Quoted-Printable or Base64 using specific prefixes. This is necessary because the MIME headers spreciying the ail encoding only applies to the mail body but not to the headers themselves. For this reason it is not stricly plain text. Additionally it has specific formatting conventions related to the use of spaces and continuation lines if needed. Not all mail reader agents will recognize the Quoted-Printable or Base64 signatures found in these headers (notably in: subject, from, to), but most now actually decode them properly, privded that the prefixes are specifying a supported charset. UTF-8 is one of thoese charsets that will be most fequently recognized, but the ISO-8859-1 is still much more often recognized. For Chinese, or Japanese, UTF-8 is rarely used. There's no way to specify a font to render the encoded characters. When the headers contain 8-bit byte values, there's some assumption that it will be decoded like with the encooding found or specified in the mail body, but this is unreliable. 2012/7/20 Karl Pentzlin karl-pentz...@acssoft.de: Looking for an example of plain text which is obvious to anybody, it seems to me that the Subject field of e-mails is a good example. Common e-mail software lets you enter any text but gives you never access to any higher-level protocol. Possibly you can select the font in which the subject line is shown, but this is completely independent of the font your subject line is shown at the recipient. Thus, you transfer here plain text, and you can use exactly the characters which either Unicode provides to you, or which are PUA characters which you have agreed upon with the recipient before.
Re: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?
On 7/20/2012 8:41 AM, Karl Pentzlin wrote: Looking for an example of plain text which is obvious to anybody, it seems to me that the Subject field of e-mails is a good example. By common convention, certain notational features have been relegated to styled text. Super and subscript in mathematical, chemical and other notation belongs to that class. There have been occasional calls to add certain explicit characters, but they have been either rejected or met with such chilly response on preliminary inquiry that no formal submission was ever made. Subscript and superscript are essential features of such a notation, but most people can live with not having access to the full notation in the subject line. (No mathematician expects to be able to place a fully built-up equation there, even if his software supports plain text math, as defined in UTN#28). A much stronger case than subject lines are regulatory databases with plain-text fields in their records. A German company had approached Unicode with the problem that even the in-line formulas for chemical compounds needed a few subscript character beyond digits, in particular the Greek letters alpha, beta and gamma (not the whole alphabet). That request died before being taken up by the committee. I have no idea how that industry solved their problem, after all, the regulatory mandate didn't disappear. However, as it stands, the de-facto precedent is to not accommodate such usage by coding characters. The situation with DIN EN 13501 seems to be entirely equivalent, in fact I find it less likely that a subject line, to be intelligible and specific would require the particular character in question than the letters needed to be able to write a full chemical formula (in the style of C₂H₆O). People just make do, writing C2H6O etc. (check chemical formula of alcohol on google, to see what I mean). [Some organic compounds also use Greek letters, I don't have an example, not being a chemist.] If the users for which such near plain text notations are part of their daily work were to report that subject lines, database plain text fields and other such bottlenecks are causing serious issues, then I think Unicode and WG2 should listen carefully. However, this should be something that's broadly anchored in those user communities. Let them demonstrate that there's a real practical need that outweighs the dual representation issue. A./
Re: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?
2012-07-20 19:52, Philippe Verdy wrote: The Subject fi[el]d is subject to special encoding like Quoted-Printable or Base64 using specific prefixes. This is a matter of character encoding. All plain text inevitably has some encoding, and the encoding may vary without changing the plain text status. Admittedly, QP and Base64 may be interpreted as being a higher-level protocol, but they can be applied to any plain text, and I don’t think this changes plain text to non-plain. Additionally it has specific formatting conventions related to the use of spaces and continuation lines if needed. This is a real deviation from plain text principles and applies to e-mail message headers in general. As per clause 2.2.3 of RFC 2822, the header is logically a single line but may contain CR LF, which will be unfolded. Yucca
RE: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?
A) it can use quoted-printable B) See RFC 6532/6530 - Now it can be UTF-8 :) -Shawn
Re: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?
2012-07-20 20:19, Asmus Freytag wrote: On 7/20/2012 8:41 AM, Karl Pentzlin wrote: Looking for an example of plain text which is obvious to anybody, it seems to me that the Subject field of e-mails is a good example. By common convention, certain notational features have been relegated to styled text. Super and subscript in mathematical, chemical and other notation belongs to that class. I’m afraid I don’t quite follow. Superscripts and subscripts can be presented using styling or other higher-level protocols, or specialized superscript or subscript characters can be used, in many cases. But this does not seem to be relevant to the question whether “Subject” fields are a good example of plain text. A much stronger case than subject lines are regulatory databases with plain-text fields in their records. It’s part of the database design to decide whether fields are plain text, so I don’t quite get the point. Sometimes people would like plain text to cover things that do not exist as Unicode characters now, but that’s a different topic. If the users for which such near plain text notations are part of their daily work were to report that subject lines, database plain text fields and other such bottlenecks are causing serious issues, then I think Unicode and WG2 should listen carefully. Instead of getting into theoretical considerations of “near plain text”, I think the question is whether there is sufficient evidence of real-life needs for new subscript or superscript characters. In general, coding of new characters requires demonstrated *use* of symbols as text characters, rather than arguments about *need* to use them. But even the need is questionable: e-mail headings are supposed to be short texts that tell what the message is about, not complicated formulas. And it’s part of database design to decide that you use some fields for some purposes and make them plain text fields, instead of (somehow) allowing styling inside them. Yucca
Re: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?
On 7/20/2012 1:34 PM, Jukka K. Korpela wrote: 2012-07-20 20:19, Asmus Freytag wrote: On 7/20/2012 8:41 AM, Karl Pentzlin wrote: Looking for an example of plain text which is obvious to anybody, it seems to me that the Subject field of e-mails is a good example. By common convention, certain notational features have been relegated to styled text. Super and subscript in mathematical, chemical and other notation belongs to that class. I’m afraid I don’t quite follow. Yeah, I think in this case you missed the point of what I was trying to say. A./
[no subject]
I know that there are some combining characters, and a lot of base characters. But, is there any way to use a base character as a combining character? Please help me! - Michael Norton (a.k.a. Flarn) E-mail address: [EMAIL PROTECTED]
[no subject]
A new translation has been posted on the Unicode website: What is Unicode? in Slovenian http://www.unicode.org/standard/translations/slovenian.html --- Magda Danish Sr. Administrative Director The Unicode Consortium 650-693-3921 [EMAIL PROTECTED]
[no subject]
mail3.microsoft.com with Microsoft SMTPSVC(6.0.3790.196); Thu, 23 Sep 2004 17:14:34 -0700 Received: from RED-MSG-52.redmond.corp.microsoft.com ([157.54.12.12]) by mailout2.microsoft.com with Microsoft SMTPSVC(6.0.3790.0); Thu, 23 Sep 2004 17:14:31 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: multipart/alternative; boundary=_=_NextPart_001_01C4A1CB.7796A6F7 Subject: unspecified by sender Date: Thu, 23 Sep 2004 17:14:29 -0700 Message-ID: [EMAIL PROTECTED] X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Index: AcShy3Vk88P4a4OPQdeIADUfCaV1aw== From: Peter Constable [EMAIL PROTECTED] To: [EMAIL PROTECTED] X-OriginalArrivalTime: 24 Sep 2004 00:14:31.0530 (UTC) FILETIME=[76B62CA0:01C4A1CB] X-archive-position: 16576 X-Approved-By: [EMAIL PROTECTED] X-ecartis-version: Ecartis v1.0.0 Sender: [EMAIL PROTECTED] Errors-to: [EMAIL PROTECTED] X-original-sender: [EMAIL PROTECTED] Precedence: bulk List-help: mailto:[EMAIL PROTECTED] List-unsubscribe: mailto:[EMAIL PROTECTED] List-software: Ecartis version 1.0.0 List-ID: unicode.sarasvati.unicode.org X-List-ID: unicode.sarasvati.unicode.org X-list: unicode This is a multi-part message in MIME format. --_=_NextPart_001_01C4A1CB.7796A6F7 Content-Type: text/plain; charset=Windows-1252 Content-Transfer-Encoding: quoted-printable Here=92s the abstract for one of the presentations at ATypI next week. = Will this be the every-character-has-a-story repository we=92ve always = wished for? =20 Decode Unicode! A typographic database Johannes Bergerhausen=20 Friday 1 October | 14:15 =96 15:00 Location: A-2 (Archa Hall 2) Presentation | Theme: Typographic Babylon | Duration: 45 minutes=20 After the DNA, the ASCII-Code is the most successful code on this = planet. The Unicode will even be better. Now is the right time to gather = and explain the meaning, history and correct typographic use of each = Unicode-Caracter. Who =93invented=94 the full stop? When did the = Infinity-Sign come into being? What=92s an Ogonek? In an 18-month = project in the department of Design at the University of Applied = Sciences in Mainz, Germany, we are collecting images, samples and texts = about each and every sign in the Code. In the near future, the project = will be opened for anyone to submit their own material. In his lecture, = Prof. Bergerhausen will give an introduction to code-history from ASCII = to Unicode and will present the project that is supported by the Germany = Federal Ministry of Education and Research.=20 Speaker details Johannes Bergerhausen = http://www.atypi.org/08_Prague/30_program/40_speakers/view_person_html?p= ersonid=3D1130 Professor Fachhochschule Mainz | Germany Prof. Johannes Bergerhausen, born 1965 in Bonn, Germany, studied Visual = Communication at the University of Applied Sciences in D=FCsseldorf. = From 1993 to 2000, he lived and worked in Paris. First he collaborated = with the Founders of Grapus, G=E9rard Paris-Clavel and Pierre Bernard, = then he founded his own office. In 1998 he was awarded a grant from the = French Centre National des Arts Plastiques for a typographic research = project on the ASCII-Code. Lectures in Amiens, Paris, Rotterdam, Warsaw, = Weimar. He returned to Germany in 2000, since 2002 he is Professor of = Typography at the University of Applied Sciences in Mainz. In 2003, = together with Paris-Clavel, he published the font =93LeBuro=94 at ACME = Fonts, London. =20 --_=_NextPart_001_01C4A1CB.7796A6F7 Content-Type: text/html; charset=Windows-1252 Content-Transfer-Encoding: quoted-printable html xmlns:o=3Durn:schemas-microsoft-com:office:office = xmlns:w=3Durn:schemas-microsoft-com:office:word = xmlns:st1=3Durn:schemas-microsoft-com:office:smarttags = xmlns=3Dhttp://www.w3.org/TR/REC-html40; head meta http-equiv=3DContent-Type content=3Dtext/html; = charset=3Dwindows-1252 meta name=3DGenerator content=3DMicrosoft Word 11 (filtered medium) o:SmartTagType = namespaceuri=3Durn:schemas-microsoft-com:office:smarttags name=3Dcountry-region/ o:SmartTagType = namespaceuri=3Durn:schemas-microsoft-com:office:smarttags name=3DCity/ o:SmartTagType = namespaceuri=3Durn:schemas-microsoft-com:office:smarttags name=3Dplace downloadurl=3Dhttp://www.5iantlavalamp.com// o:SmartTagType = namespaceuri=3Durn:schemas-microsoft-com:office:smarttags name=3DPlaceName/ o:SmartTagType = namespaceuri=3Durn:schemas-microsoft-com:office:smarttags name=3DPlaceType/ !--[if !mso] style st1\:*{behavior:url(#default#ieooui) } /style ![endif]-- style !-- /* Font Definitions */ @font-face {font-family:SimSun; panose-1:2 1 6 0 3 1 1 1 1 1;} @font-face {font-family:Georgia; panose-1:2 4 5 2 5 4 5 2 3 3;} @font-face {font-family:Verdana; panose-1:2 11 6 4 3 5 4 4 2 4;} @font-face {font-family:[EMAIL PROTECTED]; panose-1:2 1 6 0 3 1 1 1 1 1;} /* Style Definitions */ p.MsoNormal
Back to the subject: Folding algorithm and canonical equivalence
There has been extensive discussion in this thread on the specifics of accent and diacritic folding. But no one has answered my point, repeated below, that there seems to be a conflict between the folding algorithm (rather than the details of specific foldings) and the principle of canonical equivalence. Specifically, it seems to breach the principle in Unicode Conformance Clause C9: Ideally, an implementation would always interpret two canonical-equivalent character sequences identically. There are practical circumstances under which implementations may reasonably distinguish them. Are the authors of UTR #30 claiming that folding is one of those practical circumstances, or is this just an oversight? Peter Kirk On 17/07/2004 23:25, Peter Kirk wrote: I was just reviewing the UTR #30 draft in response to Rick's notice about it. And I believe I may have found a point in which the folding algorithm as given may violate the principle of canonical equivalence. But I would like some clarification from list members before providing formal input on this point. Consider a sequence made up of a base character B and two combining marks M1 and M2, in which the combining class of M1 is less than that of M2. B, M1, M2 and B, M2, M1 are canonically equivalent representations of the same sequence, but only the former is in canonical order. Suppose that a folding is defined including the operation B, M2 - X, but no other relevant operations. When this folding is applied, according to the folding algorithms defined in sections 4.1.1 and 4.1.2 of the UTR #30 draft, in step (a) the sequence B, M2, M1 will be folded to X, M1 and will not be further changed, but the sequence B, M1, M2 will not be changed at all by the folding because the sequence B, M2 will never be found. (By contrast, a folding operation B, M1 - Y will be applied to both sequences, because the canonical decomposition step converts B, M2, M1 to B, M1, M2 and the folding operation is re-applied and finds a match the second time.) The implication is that folding of two canonically equivalent strings gives different (and not canonically equivalent) results. This is not a purely theoretical point. The Diacritic Folding as specified in http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt includes operations like 05D1 05BC - 05D1, i.e. BET, DAGESH - BET, but no general rule to delete DAGESH (or any other combining marks; I think there needs to be such a rule, and I have already posted a formal response saying that). Sequences like BET, DAGESH, PATAH are very common in Hebrew text, and commonly written in this order which is logically correct and preferred by current rendering technologies, but the canonical order is in fact BET, PATAH, DAGESH; thus both sequences will be found in data depending on whether or not it has been normalised. The effect of applying Diacritic Folding exactly as specified is that BET, DAGESH, PATAH is folded to BET, PATAH, but the canonically equivalent BET, PATAH, DAGESH is unchanged. (In fact I consider that both should be folded to just BET, but that is not what the current data file specifies.) I hope I have not totally misunderstood the folding algorithm here. But it seems to me that what is missing in the algorithm is an initial step of normalising the data. The introductory text to section 4 seems to suggest that this has been avoided because folding may need to preserve the distinction between NFC and NFD data - although the algorithm as presented does not in fact do this. Since in practice the input data is not necessarily in either NFC or NFD and there is no easy way to detect which is being used, the only meaningful approach is for the user of the folding to specify whether the output of the folding should be NFC or NFD. Of course there might be a real requirement for a folding which, for example, removes DAGESH when combined with BET (but not with other base characters) irrespective of what other combining marks might intervene. But such foldings would need a considerably more powerful folding algorithm. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Back to the subject: Folding algorithm and canonical equivalence
You did point out an oversight; Asmus and I have been working on the issue. Mark - Original Message - From: Peter Kirk [EMAIL PROTECTED] To: Unicode List [EMAIL PROTECTED] Sent: Monday, July 19, 2004 13:21 Subject: Back to the subject: Folding algorithm and canonical equivalence There has been extensive discussion in this thread on the specifics of accent and diacritic folding. But no one has answered my point, repeated below, that there seems to be a conflict between the folding algorithm (rather than the details of specific foldings) and the principle of canonical equivalence. Specifically, it seems to breach the principle in Unicode Conformance Clause C9: Ideally, an implementation would always interpret two canonical-equivalent character sequences identically. There are practical circumstances under which implementations may reasonably distinguish them. Are the authors of UTR #30 claiming that folding is one of those practical circumstances, or is this just an oversight? Peter Kirk On 17/07/2004 23:25, Peter Kirk wrote: I was just reviewing the UTR #30 draft in response to Rick's notice about it. And I believe I may have found a point in which the folding algorithm as given may violate the principle of canonical equivalence. But I would like some clarification from list members before providing formal input on this point. Consider a sequence made up of a base character B and two combining marks M1 and M2, in which the combining class of M1 is less than that of M2. B, M1, M2 and B, M2, M1 are canonically equivalent representations of the same sequence, but only the former is in canonical order. Suppose that a folding is defined including the operation B, M2 - X, but no other relevant operations. When this folding is applied, according to the folding algorithms defined in sections 4.1.1 and 4.1.2 of the UTR #30 draft, in step (a) the sequence B, M2, M1 will be folded to X, M1 and will not be further changed, but the sequence B, M1, M2 will not be changed at all by the folding because the sequence B, M2 will never be found. (By contrast, a folding operation B, M1 - Y will be applied to both sequences, because the canonical decomposition step converts B, M2, M1 to B, M1, M2 and the folding operation is re-applied and finds a match the second time.) The implication is that folding of two canonically equivalent strings gives different (and not canonically equivalent) results. This is not a purely theoretical point. The Diacritic Folding as specified in http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt includes operations like 05D1 05BC - 05D1, i.e. BET, DAGESH - BET, but no general rule to delete DAGESH (or any other combining marks; I think there needs to be such a rule, and I have already posted a formal response saying that). Sequences like BET, DAGESH, PATAH are very common in Hebrew text, and commonly written in this order which is logically correct and preferred by current rendering technologies, but the canonical order is in fact BET, PATAH, DAGESH; thus both sequences will be found in data depending on whether or not it has been normalised. The effect of applying Diacritic Folding exactly as specified is that BET, DAGESH, PATAH is folded to BET, PATAH, but the canonically equivalent BET, PATAH, DAGESH is unchanged. (In fact I consider that both should be folded to just BET, but that is not what the current data file specifies.) I hope I have not totally misunderstood the folding algorithm here. But it seems to me that what is missing in the algorithm is an initial step of normalising the data. The introductory text to section 4 seems to suggest that this has been avoided because folding may need to preserve the distinction between NFC and NFD data - although the algorithm as presented does not in fact do this. Since in practice the input data is not necessarily in either NFC or NFD and there is no easy way to detect which is being used, the only meaningful approach is for the user of the folding to specify whether the output of the folding should be NFC or NFD. Of course there might be a real requirement for a folding which, for example, removes DAGESH when combined with BET (but not with other base characters) irrespective of what other combining marks might intervene. But such foldings would need a considerably more powerful folding algorithm. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Back to the subject: Folding algorithm and canonical equivalence
At 01:56 PM 7/19/2004, Mark Davis wrote: You did point out an oversight; Asmus and I have been working on the issue. Mark As Mark wrote, your point is taken and we've taken that onboard. However, we won't try to *edit* text on the list, that's why we are not engaging in a long discussion on the details (and we've discovered many interesting ones, wait for the next version of the text). In my replies I tend to focus on issues for which I need more information. A./ PS: Just one final comment: Ideally, an implementation would always interpret two canonical-equivalent character sequences identically. There are practical circumstances under which implementations may reasonably distinguish them. Are the authors of UTR #30 claiming that folding is one of those practical circumstances, or is this just an oversight? As it turns out, and not surprisingly, realizing that ideal for any arbitrary type of possible folding rule can get complicated (again, I won't go into details right now). There may be situations were an optimization would break canonical equivalence in the face of permissible, but unusual, if not to say 'non-sensical' input. That's what's meant with 'practical circumstances'. If the ability to 'correctly' handle combining sequences that are a random mixture of Khmer and Arabic combining marks were to result in severe runtime penalties, would you rather have a 'correct' or a fast implementation? Nobody argues that sequences that are expected to occur in realistic data, including specialized texts, definitely should be handled as expected, even where practicalities require some optimizations. So, we are all agred.
Re: Back to the subject: Folding algorithm and canonical equivalence
On 19/07/2004 23:23, Asmus Freytag wrote: At 01:56 PM 7/19/2004, Mark Davis wrote: You did point out an oversight; Asmus and I have been working on the issue. Mark As Mark wrote, your point is taken and we've taken that onboard. However, we won't try to *edit* text on the list, that's why we are not engaging in a long discussion on the details (and we've discovered many interesting ones, wait for the next version of the text). In my replies I tend to focus on issues for which I need more information. Fair enough. I just wondered if I needed to raise this one as a formal feedback issue. From what you say here, I assume not. A./ PS: Just one final comment: Ideally, an implementation would always interpret two canonical-equivalent character sequences identically. There are practical circumstances under which implementations may reasonably distinguish them. Are the authors of UTR #30 claiming that folding is one of those practical circumstances, or is this just an oversight? As it turns out, and not surprisingly, realizing that ideal for any arbitrary type of possible folding rule can get complicated (again, I won't go into details right now). There may be situations were an optimization would break canonical equivalence in the face of permissible, but unusual, if not to say 'non-sensical' input. That's what's meant with 'practical circumstances'. If the ability to 'correctly' handle combining sequences that are a random mixture of Khmer and Arabic combining marks were to result in severe runtime penalties, would you rather have a 'correct' or a fast implementation? Again, fair enough. But I would be surprised if this is a real issue with the folding algorithm. Indeed I would expect, given that decomposition, presumably to NFD, is anyway required after the first folding pass, that there would be little or no performance hit in normalising the text to be folded to NFD before the first folding pass. Nobody argues that sequences that are expected to occur in realistic data, including specialized texts, definitely should be handled as expected, even where practicalities require some optimizations. Yes, but I did make the point that the issue I brought up is not a purely theoretical one, but a very real one for Hebrew with the diacritic removal folding as defined. So, we are all agred. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Subject lines that have nothing to do with message content
Personally speaking, I would have expected that a recent message on this list with the sujbect line Katakana_Or_Hiragana might have something to do with Japanese, Hiragana, Katakana, or at least Han, or perhaps even Asia. But no... It was about Phoenician. It would be really helpful if people could use subject lines that have something to do with the subject of the message. It just can't be that difficult for people to pick a reasonable subject line. And if you're going to go off-topic in a thread, you might consider getting a different subject line -- or at least adding a parenthetical about how you're going to go off the thread... (As usual, this is my personal opinion and doesn't reflect an official policy, etc.) Rick
RE: Subject lines that have nothing to do with message content
Of course, if ever there was a subject line that permitted the topic to wander howsoever far from where it started, the one on this thread is it. :-) Peter
(no subject)
Quoting Marion Gunn [EMAIL PROTECTED]: how to guarantee continuance, in the specific context of Irish text computing, of the traditional restriction of the Irish diacritic dot (having only one single function in Irish) to the consonants to which it belongs? A spell checker. -- Jon Hanna http://www.hackcraft.net/ it has been truly said that hackers have even more words for equipment failures than Yiddish has for obnoxious people. - jargon.txt
(no subject)
To Unicode.org In connection with the discussion about hexadecimal characters, one might find of interest my solution to the problem. As background, I developed a code for the unique identification of all recorded knowledge and information and proposed a universal system at a conference in Tokyo in 1967. Since then, my colleagues and I have been waiting for technology to develop to the stage that would make a universal information access system an essential component of a Global Information Infrastructure. The technology is now here in bandwidth, processing speed and power, and cost of storage. Our alphanumeric code in a structured format has been supplemented with a 64-bt unique identifier for machine interaction also in a structured format. The standard keyboard would be replaced by one with 20 additional special function keys. Sixteen of these keys would have 16 color coded dots representing the hexadecimal coding. When the input is shifted to the universal code, the first two keys entered would automatically represent a Unicode character. The first 16 bits of the 17th bit field would represent the hexadecimal characters. The remaining 64-bits would identify devices, subject terms and phrases, proper names, geographic segments, documents and items in the system. The system is designed to handle both public and private information. Howard J. Hilton, Ph.D.
Major Defects in Subject Lines!
Wow... How on earth did the subject line Major Defect in Combining Classes of Tibetan Vowels turn into a discussion of Biblical Hebrew? At least, people, if you're going to transmogrify the discussion, please use a subject line such as Biblical Hebrew which someone already was wise enough to start using on some pieces of this thread. Thanks, Rick (All my own opinions, of course)
Re: Khmer encoding model (had no subject)
Quoting Marco Cimarosti [EMAIL PROTECTED]: Mijan wrote: [...] 3. There are no other cases of a Vowel+Virama combination in the Unicode encoding model. Yes, there are. Khmer. I do not understand Khmer but I see that it does not use the same 'encoding model'. Please look, you will see that you were wrong to use Khmer as an example. What do you mean by not using the same encoding model? There are actually three Indic scripts that have been encoded with a different model: Tibetan (subscript letters are encoded separately, rather than as combinations of virama + consonant), and Thai/Lao (reordrant vowel marks are encoded in visual order, rather than in phonetic order). But, AFAIK, this is not the case of Unicode Khmer, which is encoded in the same way as the scripts of India. Thank you for the correction. I said I do not understand Khmer. I was understanding that scripts not based on ISCII were using different encoding model Mijan - This mail sent through http://www.bangladesh.net
Khmer encoding model (had no subject)
Mijan wrote: [...] 3. There are no other cases of a Vowel+Virama combination in the Unicode encoding model. Yes, there are. Khmer. I do not understand Khmer but I see that it does not use the same 'encoding model'. Please look, you will see that you were wrong to use Khmer as an example. What do you mean by not using the same encoding model? There are actually three Indic scripts that have been encoded with a different model: Tibetan (subscript letters are encoded separately, rather than as combinations of virama + consonant), and Thai/Lao (reordrant vowel marks are encoded in visual order, rather than in phonetic order). But, AFAIK, this is not the case of Unicode Khmer, which is encoded in the same way as the scripts of India. _ Marco
(no subject)
Hi, I read with interest about the japhalaa debate in Bangla and I have joined you to answer this question I understand that unicode is supposed to represent the language, not the way it is written. This is how bengali is currently described in unicode, and obviously it seems to work well for the most part. I am convinced that this needs to be extended for cases that cannot be represented in unicode or have ambiguous interpretation on how it should be rendered as is the case of ya-phalaa. Let's consider the ra+virama+ya case. In the mostpart the ra+virama+ya is displayed as ya+reph. This obviously seems to be an instance of ambiguous interpretation because ra+virama+ya could also represents ra+ja-phalaa. ya+reph and ra+ja-phalaa are used in different words and have different meaning. Form this you see that ja-phalaa is not equivalent to virama-ya and is better as a separate letter in Unicode. We always thought of ya-phalaa as separate anyway. Now to you questions on this: Michael Everson wrote on 02 March 2003 13:22: 1. The sequence 'Vowel+Virama+Ya...' is illogical to scholars of Bengali and indeed Indic languages in general. I refuted this yesterday by indication that this usage is an innovation. I think that only scholars of Bengali can have correct place to answer that! 2. Such sequences are not semantically equivalent to the intended ... sentence fragment. I think Andy meant 'not equivalent to vowels with ya-phalaa' 3. There are no other cases of a Vowel+Virama combination in the Unicode encoding model. Yes, there are. Khmer. I do not understand Khmer but I see that it does not use the same 'encoding model'. Please look, you will see that you were wrong to use Khmer as an example. 4. Yaphalaa is not equivalent to 'Virama+Ya' Yes, it is, as I showed yesterday. No one can show that Virama+Ya is the same as ya-phalaa because it is not!. Please understand that ya-phalaa is originally an alternative form of 'Sanskrit letter Ya'. Now days 'Sanskrit letter Ya' is represented as YYA (Ya with nukta) in Bengali words. Bengali 'Ya' has a separate meaning and is pronounced 'Ja'. The origin of ya-phalaa is clear but the present day Bengali equivalent letter is not. No one can be sure if ya-phalaa is a form of Ya or YYa. I say that it is neither. Now days ya-phalaa has a very different purpose. It is used to alter the pronunciation of letters that proceed it or vowels that come after it. 5. ISCII implementations encode these letters as separate characters corresponding to the Devanagari Candra A E. Unicode should follow the example of these implementations. No, it shouldn't. Unicode has a method for writing these sequences already and a second method for doing so should not be introduced. Use mapping tables to exchange ISCII and Unicode data. I have been taught to keep things simple when coding software. If adding letters to the Bengali code space do this, then it will be better. I hope that this helps you Best regards Mijan - This mail sent through http://www.bangladesh.net
Re: (no subject)
Mijan scripsit: Let's consider the ra+virama+ya case. In the mostpart the ra+virama+ya is displayed as ya+reph. This obviously seems to be an instance of ambiguous interpretation because ra+virama+ya could also represents ra+ja-phalaa. ya+reph and ra+ja-phalaa are used in different words and have different meaning. I'm responding to this message in order to isolate this point. If correct, then the current model of YA PHALAA is inadequate. -- Dream projects long deferredJohn Cowan [EMAIL PROTECTED] usually bite the wax tadpole.http://www.ccil.org/~cowan --James Lileks http://www.reutershealth.com
Re: (no subject)
Michael Everson wrote At 16:48 -0500 2003-03-03, John Cowan wrote: Mijan scripsit: Let's consider the ra+virama+ya case. In the mostpart the ra+virama+ya is displayed as ya+reph. This obviously seems to be an instance of ambiguous interpretation because ra+virama+ya could also represents ra+ja-phalaa. ya+reph and ra+ja-phalaa are used in different words and have different meaning. I'm responding to this message in order to isolate this point. If correct, then the current model of YA PHALAA is inadequate. ZWJ can be used to produce the required differentiation. If this is the way the differentiation should be made there should probably be an explicit note to that effect in the introduction to the Bengali block . - Chris
Re: Key E00 (was: (no subject))
At 02:24 -0500 2002-02-06, [EMAIL PROTECTED] wrote: ISO keyboards have the section-sign (§) key, next to the 1 key above the tab key on the left of the keyboards. Some US keyboards (for instance the Mac PowerBook G3) don't have this key, but instead have the grave key there, while on the ISO keyboard the grave key is down next to the z. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Key E00 (was: (no subject))
Apple calls what I have on my desk an ISO extended keyboard. It came with my Cube. It has the section key next to the 1, and the grave key next to the z. My Powerbook has the grave key next to the 1, and no key next to the z. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Key E00 (was: (no subject))
In a message dated 2002-02-06 3:39:14 Pacific Standard Time, [EMAIL PROTECTED] writes: ISO keyboards have the section-sign (§) key, next to the 1 key above the tab key on the left of the keyboards. Some US keyboards (for instance the Mac PowerBook G3) don't have this key, but instead have the grave key there, while on the ISO keyboard the grave key is down next to the z. My draft copy of ISO/IEC 9995-3, acquired from: http://iquebec.ifrance.com/cyberiel/sc35wg1/SC35N0233_9995-3.pdf shows SECTION SIGN on key C02, level 2 of the common secondary group, and GRAVE ACCENT on key C12, level 1 on both the complementary Latin and common secondary groups. (Note that C12 is frequently relocated to B00, down next to the 'z' as you indicated.) In the complementary Latin group, key E00 is ASTERISK (level 1) and PLUS SIGN (level 2), while in the common secondary group it is NOT SIGN (level 1) and SOFT HYPHEN (level 2). Which ISO keyboard are you referring to? I'm not trying to be argumentative; I just got done implementing a lot of keyboards, and none of them had SECTION SIGN on key E00, so I'm curious. For those unfamiliar with ISO 9995 terminology, please refer to the above document as well as: http://iquebec.ifrance.com/cyberiel/sc35wg1/SC35N0232_9995-2.pdf and John Cowan's explanation from yesterday. -Doug Ewell Fullerton, California (address will soon change to dewell at adelphia dot net)
(no subject)
On the official Web site of the Cherokee Nation (Tahlequah, Oklahoma), there is a Cherokee keyboard, there is a nice keyboard layout that goes with the font they offer: http://www.cherokee.org/Extras/downloads/font/Keyboard.htm For key E00, level 1 (i.e. the unshifted grave-accent key), there is a little squiggly mark called Accent. I can't find any indication of the purpose of this character -- what it's supposed to accent -- but it's not encoded in Unicode. Does anyone know what this character is for, or why it wasn't encoded? I read Michael Everson's 1995 proposal for Cherokee (WG2 N1172) and couldn't find any mention of it. -Doug Ewell Fullerton, California (address will soon change to [EMAIL PROTECTED])
Re: (no subject)
Doug On the official Web site of the Cherokee Nation (Tahlequah, Doug For key E00, level 1 (i.e. the unshifted grave-accent key), there is Doug a little squiggly mark called Accent. I can't find any indication Doug of the purpose of this character -- what it's supposed to accent -- Doug but it's not encoded in Unicode. For those of us not in the know, please tell us what the heck key E00, level 1 means. - Mark LeisherOrthodoxy, of whatever color, seems to Computing Research Lab demand a lifeless, imitative style. New Mexico State University Box 30001, Dept. 3CRL -- Politics and the English Language, Las Cruces, NM 88003 George Orwell
Re: (no subject)
At 12:09 -0500 2002-02-05, [EMAIL PROTECTED] wrote: On the official Web site of the Cherokee Nation (Tahlequah, Oklahoma), there is a Cherokee keyboard, there is a nice keyboard layout that goes with the font they offer: http://www.cherokee.org/Extras/downloads/font/Keyboard.htm For key E00, level 1 (i.e. the unshifted grave-accent key), there is a little squiggly mark called Accent. I can't find any indication of the purpose of this character -- what it's supposed to accent -- but it's not encoded in Unicode. Does anyone know what this character is for, or why it wasn't encoded? I read Michael Everson's 1995 proposal for Cherokee (WG2 N1172) and couldn't find any mention of it. I've never seen it anywhere but on that web page, which I found some time ago. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: (no subject)
At 10:55 -0700 2002-02-05, Mark Leisher wrote: Doug On the official Web site of the Cherokee Nation (Tahlequah, Doug For key E00, level 1 (i.e. the unshifted grave-accent key), there is Doug a little squiggly mark called Accent. I can't find any indication Doug of the purpose of this character -- what it's supposed to accent -- Doug but it's not encoded in Unicode. For those of us not in the know, please tell us what the heck key E00, level 1 means. It is the section-sign (§) key, next to the 1 key above the tab key on the left. Some US keyboards don't have this key, but instead have the grave key there. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: (no subject)
Mark Leisher wrote: For those of us not in the know, please tell us what the heck key E00, level 1 means. E00 is the leftmost key on the E row, which is the fifth row from the bottom (the row containing the spacebar is A). On U.S.-style keyboards E01 is the 1 key, D01 is Q, C01 is A, B01 is Z. Level 1 means that no shift keys are in effect; Level 2 means that Shift is down, and Level 3 that AltGr (typically the right Alt key on keyboards that need it) is down. This naming scheme allows us to talk about particular keys on the keyboard without regard to what they are used for in one locale or another. ISO 9995 is the controlling standard. -- John Cowan [EMAIL PROTECTED] http://www.reutershealth.com I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan han mathon ne chae, a han noston ne 'wilith. --Galadriel, _LOTR:FOTR_
RE: international characters in email subject line
Raghu Kolluru [EMAIL PROTECTED] wrote: Do you know of any email client which CAN do this and also display the from alias of the email in the desired charset? Lotus Notes does this (and has done so for some considerable time), although it's probably way too large for what you need. Brendan
international characters in email subject line
Greetings! I would like to send email in international charsets. I am able to send the body using the desired charset but not the subject line. Any help would be appreciated. Thanks.
Re: international characters in email subject line
What mail program are you using? Many of them (Exchange, Outlook, etc.) do not support this. Some do not even support international text in the body. michka a new book on internationalization in VB at http://www.i18nWithVB.com/ - Original Message - From: "Raghu Kolluru" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 12, 2001 2:37 PM Subject: international characters in email subject line Greetings! I would like to send email in international charsets. I am able to send the body using the desired charset but not the subject line. Any help would be appreciated. Thanks.
Re: international characters in email subject line
Well, like I said Outlook does not support this -- it will only use the default system code page (b.k.a. CP_ACP) for subject lines and any other part of the header. michka - Original Message - From: "Raghu Kolluru" [EMAIL PROTECTED] To: "'Michael (michka) Kaplan'" [EMAIL PROTECTED]; "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 12, 2001 3:29 PM Subject: RE: international characters in email subject line I wrote a java application which sends emails to a relay server (Postfix). My email client is outlook which does support international character sets. I can send/recieve non-ascii encoded body but not the subject line. Probably this is a question for SMTP newsgroup. Does anyone know public email address of such a group? Thanks. -Original Message- From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]] Sent: Monday, February 12, 2001 3:21 PM To: Raghu Kolluru; Unicode List Subject: Re: international characters in email subject line What mail program are you using? Many of them (Exchange, Outlook, etc.) do not support this. Some do not even support international text in the body. michka a new book on internationalization in VB at http://www.i18nWithVB.com/ - Original Message - From: "Raghu Kolluru" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 12, 2001 2:37 PM Subject: international characters in email subject line Greetings! I would like to send email in international charsets. I am able to send the body using the desired charset but not the subject line. Any help would be appreciated. Thanks.
RE: international characters in email subject line
Michael, Do you know of any email client which CAN do this and also display the from alias of the email in the desired charset? Thanks. -Original Message- From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]] Sent: Monday, February 12, 2001 3:31 PM To: Raghu Kolluru; Unicode List Subject: Re: international characters in email subject line Well, like I said Outlook does not support this -- it will only use the default system code page (b.k.a. CP_ACP) for subject lines and any other part of the header. michka - Original Message - From: "Raghu Kolluru" [EMAIL PROTECTED] To: "'Michael (michka) Kaplan'" [EMAIL PROTECTED]; "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 12, 2001 3:29 PM Subject: RE: international characters in email subject line I wrote a java application which sends emails to a relay server (Postfix). My email client is outlook which does support international character sets. I can send/recieve non-ascii encoded body but not the subject line. Probably this is a question for SMTP newsgroup. Does anyone know public email address of such a group? Thanks. -Original Message- From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]] Sent: Monday, February 12, 2001 3:21 PM To: Raghu Kolluru; Unicode List Subject: Re: international characters in email subject line What mail program are you using? Many of them (Exchange, Outlook, etc.) do not support this. Some do not even support international text in the body. michka a new book on internationalization in VB at http://www.i18nWithVB.com/ - Original Message - From: "Raghu Kolluru" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 12, 2001 2:37 PM Subject: international characters in email subject line Greetings! I would like to send email in international charsets. I am able to send the body using the desired charset but not the subject line. Any help would be appreciated. Thanks.
Re: international characters in email subject line
The email program I am using, mutt, can do this. Kind regards keld Simonsen On Mon, Feb 12, 2001 at 02:55:41PM -0800, Michael (michka) Kaplan wrote: What mail program are you using? Many of them (Exchange, Outlook, etc.) do not support this. Some do not even support international text in the body. michka a new book on internationalization in VB at http://www.i18nWithVB.com/ - Original Message - From: "Raghu Kolluru" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 12, 2001 2:37 PM Subject: international characters in email subject line Greetings! I would like to send email in international charsets. I am able to send the body using the desired charset but not the subject line. Any help would be appreciated. Thanks.
Re: international characters in email subject line
On Mon, 12 Feb 2001, Michael (michka) Kaplan wrote: From: "Raghu Kolluru" [EMAIL PROTECTED] I would like to send email in international charsets. I am able to send the body using the desired charset but not the subject line. The question is so vague. If you need to get some help, you've gotta provide as much information as possible(what mail program under what OS for what character set). There are so many possibilities and nobody would wish to go thru all of them. What mail program are you using? Many of them (Exchange, Outlook, etc.) do not support this. Some do not even support international text in the body. Mozilla and Netscape 6 support entering subject header in whatever script for which input methods are available/installed in the OS (MS-Windows, MacOS, Unix/X11). In this respect, I18N of Mozilla/Netscape 6 is ahead of that of MS Outlook. The same is true of display of subject headers in scripts which happens not to be supported by the default codepage (to use MS terminology). BTW, one of the worst MUAs in terms of I18N (among the widely used) might be Eudora. BTW, most modern Unix text-based mail programs (e.g. Pine, Mutt) work fine in this regard as long as you run them under the terminal that supports input/ouput of the charset you want to use (for UTF-8, the newest xterm works well for a pretty large range of the BMP). Jungshik Shin
[OT]RE: international characters in email subject line
On Mon, 12 Feb 2001, Raghu Kolluru wrote: I wrote a java application which sends emails to a relay server (Postfix). When you write your java application, note that any 8bit character is explicitly prohibited(IETF STD 11/RFC 822). You need to encode them per IETF RFC 2047 (and RFC 2184, 2231). Some MTAs(mail transport agent) refuse to accept messages with 8bit characters in the header depending on the configuration. BTW, the header encoding is not just for working around those MTAs but also for the sake of identifying MIME charset/encoding used and allowing the possibility of multiple MIME charset/encoding mixed in the header (the latter might be mute when UTF-8 is exclusively used) My email client is outlook which does support international character sets. I can send/recieve non-ascii encoded body but not the subject line. Probably this is a question for SMTP newsgroup. Does anyone know public email address of such a group? Usenet newsgroup comp.mail.mime is the best place to ask your question. (it has the mail-submission address as well, but I don't know it) BTW, MS OE doesn't support it while Mozilla does support it. Jungshik Shin P.S. I'm afraid Unicode mailing list server strips off too many header lines of messages. In this case and some other cases(e.g. when people talke about the safe 'transport' of UTF-8 messages), 'X-Mailer:' header would be nice to have.
Re: international characters in email subject line
Ar 12 Feb 2001, ag 15:06 scrobh Michael (michka) Kaplan fn bhar "Re: international characters in ema": Well, like I said Outlook does not support this -- it will only use the default system code page (b.k.a. CP_ACP) for subject lines and any other part of the header. Ar 12 Feb 2001, ag 15:46 scrobh Jungshik Shin fn bhar "[OT]RE: international characters in": On Mon, 12 Feb 2001, Raghu Kolluru wrote: My email client is outlook which does support international character sets. I can send/recieve non-ascii encoded body but not the subject line. [snip] BTW, MS OE doesn't support it while Mozilla does support it. This is simply not true! I know we all like to bash MS from time to time, but people really get far too carried away. I don't know if the above is true about Outlook (as my installation is stuffed as far as e-mail goes), but it is NOT TRUE about Outlook Express. OE encodes the subject line with the same encoding as the body and often (?) the From header as well. Whether or not this works for you would probably depend on what OS you are using and what language features are installed. It works for me with OE 5.50.4133.2400 on Windows NT 4.0 SP5. Of course, since my preferred mail program is Pegasus Mail, which can only be configured for one character set, I can't usually read such headers anyway. `~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~ S e n S a g h d h a [EMAIL PROTECTED] Nuair a bhonn an fon istigh, bonn an ciall amuigh. Seanfhocal.
Re: international characters in email subject line
On Mon, 12 Feb 2001, Sean O Seaghdha wrote: On 12 Feb 2001, Michael (michka) Kaplan wrote: Well, like I said Outlook does not support this -- it will only use the default system code page (b.k.a. CP_ACP) for subject lines and any other part of the header. On 12 Feb 2001, Jungshik Shin wrote: On Mon, 12 Feb 2001, Raghu Kolluru wrote: My email client is outlook which does support international character sets. I can send/recieve non-ascii encoded body but not the subject line. [snip] BTW, MS OE doesn't support it while Mozilla does support it. This is simply not true! I know we all like to bash MS from time to time, but people really get far too carried away. I don't know if the above is true about Outlook (as my installation is stuffed as far as e-mail goes), but it is NOT TRUE about Outlook Express. OE encodes the subject line with the same encoding as the body and often (?) the From header as well. I stand corrected(thank you for correcting me). It's possible to enter whatever script supported by IMEs installed on your system in both Subject(and other headers) and body of the message. However, what I wrote about the display of the headers in scripts NOT supported by the default system code page still stands. For instance, MS OE cannot display Korean, Japanese, Chinese, Russian headers under English/French/Spanish/Italian/German MS-Windows in _the message *list* display pane_, which Mozilla can. MS OE can display those headers for individual messages.), though. Not having checked out MS OE for a while, I was a bit confused what is possible and what is not. Anyway, my comment and michka's have *nothing* to do with MS bashing. I was just giving what I believed to be facts, one of which was not true as it turned out. Please, note that Michael (michka) Kaplan, I guess is, one of the last persons on this list to say something not true just to make MS look bad. Of course, by this I'm not implying by any means that there are some people who would do that on this list. Jungshik Shin
Re: international characters in email subject line
Ar 12 Feb 2001, ag 20:40 scrobh Jungshik Shin fn bhar "Re: international characters in ema": I stand corrected(thank you for correcting me). It's possible to enter whatever script supported by IMEs installed on your system in both Subject(and other headers) and body of the message. However, what I wrote about the display of the headers in scripts NOT supported by the default system code page still stands. For instance, MS OE cannot display Korean, Japanese, Chinese, Russian headers under English/French/Spanish/Italian/German MS-Windows in _the message *list* display pane_, which Mozilla can. MS OE can display those headers for individual messages.), though. Thank you for your clarification. MS OE doesn't show any chars outside the system code page in the message list, only in the preview pane and message windows. `~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~ S e n S a g h d h a [EMAIL PROTECTED] N bhonn tran buan. Seanfhocal.
Re: international characters in email subject line
From: "Jungshik Shin" [EMAIL PROTECTED] Please, note that Michael (michka) Kaplan, I guess is, one of the last persons on this list to say something not true just to make MS look bad. There are a few program managers in Office and Visual Studio who might disagree with this statement -- they seem to think I live to bash Microsoft. They are mistaken, sadly. But no company is above having their boneheaded decisions called out, something not everyone there understands. Its nice that you do, though. :-) Of course, by this I'm not implying by any means that there are some people who would do that on this list. Its ok, we all know that such people exist; heck, we probably all know who they are, too. As long as we don't name names, no can claim to be offended unless they have felon's guilt or something. :-) MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: international characters in email subject line
Ar 12 Feb 2001, ag 20:28 scrobh Alain LaBont fn bhar "Re: international characters in ema": 19:53 01-02-12 -0800, Sean O Seaghdha a crit: Of course, since my preferred mail program is Pegasus Mail, which can only be configured for one character set, I can't usually read such headers anyway. [Alain] Some years ago, I was also using Pegasus mail and I was not satisfied with this. I then communicated with the author directly (he lives in Sourthern New Zealand); we engaged in a series of exchanges and I made him accept to carry on the character set in use without conversion [in my case the Windows character set]... You have to use a parameter for this, this is the compromise he made me accept because he was really impressed by the SMTP 7-bit-only-headers dogma -- which does not impress me since it works any way with 8-bit-clean systems [predominant nowadays in the world since a serious security breach, I was told, was corrected with an 8-bit-clean-enabling SMTP patch]. I think there are a couple of different issues here. As far as storage on disk goes, I think this changed some time back so that now you have to use the switch to get the old behaviour (converting messages to one code page on disk) which was retained for compatibility with the DOS version. You can send 8-bit mail with Pegasus by changing a setting in Options, but when you switch it on you get a stern warning about it being "formally illegal" and a "Comments" header is added to each message. I have suggested from time to time over the last few years for Pegasus to be made Unicode aware, but I get the impression it's considered "too hard" or "too complicated" although I don't think I've actually got a reply on this from the author, David Harris. Since there will not be another 16-bit Windows version and the Macintosh version has not been updated in a long time, this leaves only the DOS Win32 versions. Hopefully, this will mean that Unicode will become more of a viable option for him in the future. At the moment, though, he seems quite busy enough adding HTML mail composition to version 4. `~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~:.,.:'^`~ S e n S a g h d h a [EMAIL PROTECTED] Calumnies are answered best with silence.Ben Johnson.
(no subject)
The intent of this message is to point out some of the deficiencies in the unicode specifications for non-Devangari Indic scripts. It is well known (inescapable and undeniable) fact that people writing in non_Devanagari scripts such as Telugu, Kannada, Malayalam and others transcribe Sanskrit and Vedic texts in their own script to conveniently study them. In fact many well known Vedic and Sanskrit scholars in Andhra Pradesh, where Telugu is spoken and which is my native state, do not know how to read Devanagari. They have all read and written these texts only in Telugu. Also, it is an established fact that many Sanskrit manuscripts from ancient times are available only in non-Devanagari scripts. Given this situation, I am terribly dissatisfied that the current Unicode specification for non_Devanagari scripts lacks many symbols required to transcribe Sanskrit and Vedic texts properly. These include: a) All the swara symbols required to transcribe Vedic texts (udatta, anudatta, double udatta atc, and the symbols used in writing Samaveda) b) Avagraha, Vocalic L and LL Matra symbols c) Half Visarga, used in grammar and other Sanskrit texts In addition, Unicode Standard need to address the following features in case of Telugu standardization The Dantya (Dontal) ca and ja, and the vowel ligatures of these two consonants with A, u, U, o, O, Au occur in Telugu language. These are equivalent to ja-nukta and ca-nukta in Hindi. But these are not included in Telugu Unicode specification. Without these, it is impossible to compose an authentic Telugu dictionary, and also the sorting of text will also be wrong. So, these MUST be included in the Unicode spec for Telugu. In addition, the symbols to denote Karnatic music should also be included in the specification of symbols so that any script transcribing Karnatic compositions should be able to do so correctly. Unless an effort is made to include all these symbols in all the relevant Indic scripts, the existing specification is woefully
(no subject)
Who can tell me where can I download the unicode standard? thank you!!
(no subject)
Hi, Is there any text editor by which data can be entered in Hindi? Rgds, Nikita K __ Do You Yahoo!? Yahoo! Calendar - Get organized for the holidays! http://calendar.yahoo.com/
(no subject)
Who can take me off from the unicode list ? I have an overflow for the moment and no time to take part of the group. unsubscribe [EMAIL PROTECTED] Thank you Gunter BEGIN:VCARD VERSION:2.1 N:Anders;G. FN:G. Anders TEL;HOME;VOICE:0041 (61) 711 67 14 TEL;HOME;FAX:0041 (61) 711 67 14 ADR;HOME:;;Gehrenstr. 1;Reinach/BL;;CH 4153;Schweiz LABEL;HOME;ENCODING=QUOTED-PRINTABLE:Gehrenstr. 1=0D=0AReinach/BL CH 4153=0D=0ASchweiz EMAIL;PREF;INTERNET:[EMAIL PROTECTED] REV:20001103T140847Z END:VCARD
(no subject)
Somebody has been playing with the wires in the room where the server is housed and so the server is technically up but inaccessible outside the server room. I'm in the process of trying to straighten out this tangled affair. Meanwhile, the PDF charts are still accessible via their new home URL, http://www.unicode.org/charts/. -- = John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://www.blueneptune.com/~tseng
(no subject)
unscirbe
(no subject)
please remove me from this list
(no subject)
I have an application that doesn't include unicode support at all. Considering this, can I use Uniscribe APIs in my application. The system on which I want to run my application is Windows 98. Specifically, is there any relationship between Uniscribe APIs and Unicode, and if yes, then what exactly it is. Thanks C.Janardhana Guptha Quark, Chandigarh
(no subject)
Hello, all. How do I print the superscript minus sign? The unicode for this is \u207B. However, it is not printed correctly. Instead, it is an unrecognized character. Thanks a lot. Zhen Ren Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com
RE: Subject lines in UTF-8 mssgs? [was: Proposal to make ...]
At 01:41 AM 07/13/2000 -0800, [EMAIL PROTECTED] wrote: As far as I can understand, the choice of the outgoing charset is highly automatic in MS Outlook 2000. I suspects it depends on the combination of characters that I (or the system) used in the various fields of the e-mail. The problem is that the heuristics are not correct for ISO-8859-1/CP-1252. The selection SHOULD be: 1) Only x00-x7F - US-ASCII 2) x00-x7F + xA0-xFF - ISO-8859-1 [Western European(ISO)] 3) x00-x7F + xA0-xFF + a character in the x80-x9F code point range - CP-1252/Windows-1252 [Western European(Windows]) If you check the Encoding list, you will note that Western European(ISO) and Western European(Windows) are both listed and the selection controls if a message with xA0-xFF characters gets ID'ed as ISO-8859-1 or CP-1252. The problem is that selection of Western European(ISO) does not correct the message's CHARSET to CP-1252 if a x80-x9F is found in the message.
Re: Subject lines in UTF-8 mssgs? [was: Proposal to make ...]
Chris Wendt wrote: This is relevant when you are running with a non-English OS locale. It will prevent entering non-usascii characters for day and month names in the reply header so as to not force you to send in UTF-8 in case you write in a different script than the OS locale is. How's that? The Date: header on outgoing email is localized to the sender's locale? That seems to be a clear-cut violation of RFC-822, and damaging to interoperability (because I must know every possible localized month name to interpret the header). It would make *much* more sense to localize the Date: headers on incoming email. -- Schlingt dreifach einen Kreis um dies! || John Cowan [EMAIL PROTECTED] Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)
RE: Subject lines in UTF-8 mssgs? [was: Proposal to make ...]
I shouldn't have used "header". What I meant is not the message header in the RFC 822 sense but the information out of the header that gets copied into the message BODY on a reply. Example right below. -Original Message- From: John Cowan [mailto:[EMAIL PROTECTED]] Sent: Friday, July 14, 2000 8:51 AM To: Unicode List Subject: Re: Subject lines in UTF-8 mssgs? [was: Proposal to make ...] Chris Wendt wrote: This is relevant when you are running with a non-English OS locale. It will prevent entering non-usascii characters for day and month names in the reply header so as to not force you to send in UTF-8 in case you write in a different script than the OS locale is. How's that? The Date: header on outgoing email is localized to the sender's locale? That seems to be a clear-cut violation of RFC-822, and damaging to interoperability (because I must know every possible localized month name to interpret the header). It would make *much* more sense to localize the Date: headers on incoming email.
Re: Subject lines in UTF-8 mssgs? [was:
I forced the encoding to UTF-8 (it is supposed to be the default in my setting, but most of my messages arrive as charset="windows-1252"), and I am using some Chinese characters that are certainly not in my system's default code page: 你好、雅朴。 _馬可。 Note that this may not necessarily forced UTF-8, since OE supports encodings for Chinese characters that you could also use to send the message. UTF-8 *is* required for languages that do not support such an encoding, like Tamil. showing_off உலகம் பேச நினைக்கும் போது Unicode பேசுகிறது /showing_off On the whole, I would not recommend sending mail using those other encodings, I believe that people using OE 5.0 and later will be prompted to install language support just by opening the e-mail! :-) michka (the sentence is right, by the way g).
Re: Subject lines....../ Lost Header?? Re: [nothing]
My previous message of a few minutes ago with the empty "Re: " --only Header (at least as I got it back from the listserver) left my home with a Header as shown below. Any information about the whereabouts of my lost Head leading to its recovery ... Re: =?utf-8?B?UkU6IFN1YmplY3QgbGluZXMgaW4gVVRGLTggbXNzZ3M/IFt3YXM6?= Jaap --
(no subject)
I'am trying to create a bilingual and bi-directional (Arabic and English Qur'an)e-Book, that will be compliant with the Open eBook OEB specification. This is targeted at the PalmOS, but should be renderable in XML and/or XHTML compliant browsers such IE 5.0 and Netscape 6.0 or any type of Open eBook reader. I already have the HTML files entire of the Qur'an in Arabic and English - though I will have them proof read many times before I distribute the completed eBook. The Arabic pages are coded using the win-1256 (Arabic) codepage in the following manner: HTML DIR=RTL head META content="text/html; charset=windows-1256" http-equiv=Content-Type body p align="right" font face = "Traditional Arabic" font size = "5pt" These pages show up fine (correct font and directionality) when using the IE 5.0 browser, however when I convert them to the PalmOS, the right to left directionality is lost. In order to convert the HTML pages to the OEB eBook format I'm using the MobiPocket Publisher (home page http://www.mobipocket.com/en/HomePage/default.asp)that creates a prc file from the HTML files. In order to test the conversion to the PalmOS, I'm using the PalmOS Emulator (running a 3.5 Palm OS IIIc rom) with the APOS 2.0 (home page http://www.arabicpalm.com/) and Mobipocket Reader software installed. The above setup is being tested on Windows 98 (Arabic Enabled Edition) and Windows 2000 PCs. The prc files created using this method, display the Arabic font on the the emulator's Palm IIIc screen (when using the MobiPocket reader), however the correct direction is not enforced. Please note that Arabic and English text are coded with separate html files. My questions are as follows How can I convert from cp 1256 to unicode, without doing it character by character? Is there software that will do this? Dose the eBook Spec. allow for the nesting of a right to left languages (Arabic) inside of a left to right language (English) on the same page? Does anyone know if APOS is unicode compliant? Any advise or examples would be greatly appreciated, as I have not found any examples on how nest languages (with different text and directionality) with in the Palm doc nor prc formats. Akil Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com
Re: Subject lines in UTF-8 mssgs? [was: Proposal to make ...]
"Jaap Pranger" [EMAIL PROTECTED] wrote: At 16:44 +0200 2000.07.12, [EMAIL PROTECTED] wrote: Everybody (beginning by myself!) should probably be more careful in naming subject lines, and renaming them when a reply deviates from the subject. Marco, This wil not help very much when you send UTF-8 messages. Your Subject lines in those messages show up completely "garbled", at least in my non-UTF-8-aware email client. OK, that's my problem. But mostly other people's UTF-8 messages show 'neat' Subject headers. What's going on, why this difference? Jaap In Outlook Express under Tools, Options, Send, International Settings it is possible to specify that only English (? ASCII) is used in headers and under Tools, Options, Send, Plain Text Settings Tools, Options, Send, HTML Settings it is possible to specify whether or not 8-bit characters may be used in message headers. These settings seem to apply whatever encoding is used for the body of the message. - Chris
RE: Subject lines in UTF-8 mssgs? [was: Proposal to make ...]
From: Christopher J. Fynn [mailto:[EMAIL PROTECTED]] In Outlook Express under Tools, Options, Send, International Settings it is possible to specify that only English (? ASCII) is used in headers This is relevant when you are running with a non-English OS locale. It will prevent entering non-usascii characters for day and month names in the reply header so as to not force you to send in UTF-8 in case you write in a different script than the OS locale is. and under Tools, Options, Send, Plain Text Settings Tools, Options, Send, HTML Settings it is possible to specify whether or not 8-bit characters may be used in message headers. This does not prevent non-usascii characters in the header. It only decides if the non-usascii characters will be RFC1522 encoded or sent as raw 8-bit bytes - each in the chosen encoding. These settings seem to apply whatever encoding is used for the body of the message. Yes, correct.