At 11:17 PM 7/17/2004, John Cowan wrote:
Peter Kirk scripsit:
But I think the best thing to do is to drop *all* Hebrew
combining marks; the result of this is valid unpointed Hebrew.
I agree.
OK, in my last message I was cofused, this was Peter's suggestion and Jony
had seconded it.
I take it
At 05:28 AM 7/18/2004, Peter Kirk wrote:
I can see that there might be cases when the Hebrew folding should be
invoked without other scripts being affected. But I think that anyone
applying a general accent or diacritic folding would expect this to
include all Hebrew (and Arabic, Syriac etc)
At 05:25 AM 7/18/2004, Peter Kirk wrote:
I accept that there might be some script-specific cases in which
particular accents should not be removed. The breve in Cyrillic i kratkoe
might be an example; but then this might be rather too language-specific
as well. But these should be clearly
At 07:53 PM 7/18/2004, Jony Rosenne wrote:
By this logic, I cannot see why you lump Latin/Greek/Cyrillic together.
Latin/Greek/Cyrillic share the fact that for searches you may want to
remove accents, but, except for very unusual circumstances, it's not a good
idea to transform text permanently.
At 01:56 PM 7/19/2004, Mark Davis wrote:
You did point out an oversight; Asmus and I have been working on the issue.
Mark
As Mark wrote, your point is taken and we've taken that onboard. However,
we won't try to *edit* text on the list, that's why we are not engaging in
a long discussion on the
At 11:11 AM 8/5/2004, Peter Kirk wrote:
In TUS 4.0 Section 5.3, p.111, the following is stated of default
ignorable code points:
These characters are also ignored except with respect to specific,
defined processes; for example, ZERO WIDTH NON-JOINER is ignored in
collation. ... For more
At 10:04 AM 8/6/2004, Marcin 'Qrczak' Kowalczyk wrote:
I don't like perpetuating the myth that Unicode is a 16-bit encoding
and UCS-2 can represent all Unicode characters
Neither do I. I've replied to John offline with extensive comments. He's on
a reasonably tight deadline, so he probably
At 12:49 AM 9/8/2004, Philippe Verdy wrote:
And still no decision if this invisible base character will be added or
not. It's just a public review for now,
Well, hold your horses for a bit here.
If something's out of review, there won't be a decision until the review is
over.
Anything that has
At 05:53 PM 9/8/2004, Mike Ayers wrote:
From:
[EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]
On Behalf Of Asmus Freytag
Sent: Wednesday, September 08, 2004 2:25 PM
If something's out of review, there won't be a decision until
the review is over.
I'm sorry, but I can't make sense
At 11:06 AM 9/9/2004, Mike Ayers wrote:
how to color parts of characters, is out of scope.
A given diacritic can be a part of a character's glyph,
but _combining characters_ are characters, not merely part
of characters.
Therefore, in principle, the question of how to encode text
streams that
It might be worth stepping back and asking the question: What is the
purpose of publishing word-breaking behavior as part of the Unicode Standard?
The answer to this question is neither easy nor obvious. Part of the
problem is that what constitutes a 'word' is subject to tailoring. In
certain
At 05:21 PM 9/14/2004, Anto'nio Martins-Tuva'lkin wrote:
On 2004.09.14, 17:06, Jörg Knappen [EMAIL PROTECTED] wrote:
My classic for this situation is the german -burg abbreviature often
seen in cartography: It is -bg. with breve between b and g.
Why not U+0062 U+035D U+0067 ? I guess that the
I've updated Unibook to version 4.0.1
The latest version reads more property files and can display some of the
new 4.0.1 properties.
There are new ways to combine properties.
As before, you can cutpaste either the character code or the character
name of a selected character to the clipboard.
At 04:32 AM 9/16/2004, Marion Gunn wrote:
Lovely browser. Is it possible to obtain a Mac-friendly version?
mg
I'm glad you like it. I've been told previously by experienced Mac users
that it runs fine with Virtual PC on the Mac.
A./
PS: To those of you who downloaded 4.0.1 already, the zip
Just a note of thanks to many of you who have sent me useful feedback. I
also found out
that my update to the archive had gone awry, but am happy to point out now that
http://www.unicode.org/unibook/Unibook-4.0.1.zip
is finally the correct version.
A./
At 11:37 AM 9/19/2004, D. Starner wrote:
To: [EMAIL PROTECTED]
Subject: RE: Saudi-Arabian Copyright sign
Jorg Knappen writes:
On Sun, 19 Sep 2004, Jon Hanna wrote:
Looks like {U+062D, U+20DD}
Yes, it does look like that. But it forms a separate entity, just like its
precedents COPYRIGHT SIGN
At 06:09 PM 9/19/2004, D. Starner wrote:
Asmus Freytag writes:
Given
the nature of the symbol in question, I would personally see no reason to
object
to encoding it - especially given the current and projected lack of
availability
of other alternatives.
It's a simple combining character
At 10:32 AM 9/20/2004, you wrote:
Michael Everson schrieb:
I would like to see a range of samples in several publications
published in several languages and more than one country. That would
make a stronger case for it.
I'll be able to dig out some more samples from other books published by
At 11:50 AM 9/20/2004, Eric Muller wrote:
But the real obstacle for a generative approach is QA: if as a font vendor
you want to ensure some level of quality, then it is hard to avoid human
work essentially proportional to the number of base+mark *combinations*
you claim to support. If you
At 10:55 PM 9/20/2004, Doug Ewell wrote:
Jörg Knappen knappen at uni dash mainz dot de wrote:
I see a precedent in Unicode to treat Copyright-like sign differently
from simple encircled letters:
Unicode takes precautions not to encode the same character twice.
Therefore, superscript digits 2
At 03:58 AM 9/21/2004, Peter Kirk wrote:
On 20/09/2004 19:21, Asmus Freytag wrote:
...
PS for named sequences:
See: http://www.unicode.org/reports/tr34
Draft Data:
http://www.unicode.org/Public/4.1-Update/NamedCompositeEntities-4.1.0d4.txt
(the last part of the file name may change
At 07:22 PM 10/10/2004, James Kass wrote:
Most people begin counting with one, although in the recent
past some computer technicians have begun to begin counting
with zero. This is bound to cause problems and discrepancies.
Not counting from zero leads to weird situations at times, such
as the
At 01:42 PM 10/13/2004, Eric Muller wrote:
It has interesting consequences: e.g. U+2FA1C is canonically equivalent to
U+9F3B, so the BMP is not closed under canonical equivalence, so no
conformant system could make its repertoire exactly the BMP.
We should have thought of that sooner - what a
Otto Stolz wrote:
As has been said before, in this thread (by Jörg Knappen, IIRC), the
little bow in the -burg abbreviation stems from the u stripped
together with the r.
In German handwriting it used to be common to place a mark above
the letter 'u', to distinguish it from 'n'. When I first saw
At 06:04 PM 9/30/2004, Michael Everson wrote:
see no reason given for us not to unify the handwritten symbol we have
seen with BREVE ABOVE. In the environment described, apparently bg is
taken as an abbreviation for berg, and bg (with breve) is being used as
an abbreviation for burg. The
please ignore.
A./
At 09:48 PM 11/1/2004, Doug Ewell wrote:
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:
Visual entry should never be used. It was used for some legacy
encodings to render text on devices that don't implement the Bidi
algorithm and can only render text as LTR. Nobody enters RTL text
At 08:29 PM 11/6/2004, Doug Ewell wrote:
You've received nine public responses: one from a genuine font designer,
two (or more, depending on interpretation) from people who have designed
fonts at some time but don't identify themselves as font designers,
and the remainder from people who
At 05:14 PM 11/8/2004, Michael Everson wrote:
At 00:47 + 2004-11-09, Peter Kirk wrote:
The aim of Unicode standardisation is surely to define a single and
unambiguous representation of text.
Not at all, not in the least but. It's to provide encoding for the world's
writing systems. It is
At 10:21 AM 11/14/2004, Doug Ewell wrote:
Throughout all of this, I had completely missed the fact that the Tech
Note for CESU-8 had been upgraded to a Tech Report, two and a half years
ago, in fact. Perhaps I was in denial. Anyway, that ... invalidates many
of my comments...
Noted.
CESU-8 is
At 10:01 PM 11/14/2004, Doug Ewell wrote:
Asmus Freytag asmusf at ix dot netcom dot com wrote:
There are some UTF-8/UTF-16 interoperability aspects that are
addressed by CESU-8. These concerns are real, and affect multi-
component architectures that must interchange data across component
At 01:45 PM 11/15/2004, Philippe Verdy wrote:
Deprecated does not mean that it is not used. This interface remains
accessible when working with internal class file format. I don't
understand however why the storage format of the string constants pool was
not changed when the class format was
At 11:10 AM 11/21/2004, Doug Ewell wrote:
Actually, of course, the only way to *guarantee* that readers will see
the right glyphs is to chuck HTML altogether and create a PDF file.
And that's a task that needs to be approached with some care as well.
The UTC and WG2 constantly get PDF documents
At 04:36 AM 11/24/2004, Peter Kirk wrote:
I understand that the proposed INVISIBLE CHARACTER was rejected at the
recent UTC meeting. I presume that the intention is that NBSP should be
used instead.
At the moment, NBSP is the only sanctioned base character without 'ink'.
There are cases of words
At 04:23 PM 11/23/2004, Chris Jacobs wrote:
Now, this implies that UTF-8 does interpret U+ as an ASCII NULL
control char.
This is incompatible with using it as a string terminator.
Except that it's up to you how to interpret the C0 control codes in Unicode.
You can do it according to ISO 6429
At 04:53 PM 11/24/2004, Peter Kirk wrote:
On 24/11/2004 22:23, Peter Kirk wrote:
On 24/11/2004 22:00, Asmus Freytag wrote:
...
The sequence SPACE NBSP *does* not allow a break after the SPACE under
the line breaking rules we publish in UAX#14.
I tried to change does not into *does* and missed
The fact is, once you dedicate the top bits in a pipe to some purposes,
you've narrowed the width of the pipe. That's what happened to those
systems that implemented a 7-bit pipe for ASCII by using the top bit for
other purposes.
And everybody seems to agree that when you serialize such an
At 04:23 PM 11/26/2004, Peter Kirk wrote:
As I understand it (and I asked for confirmation of this but have not
received it), according to the current version of UAX #14 there is no
break opportunity between SPACE and NBSP, because rule LB11b precedes rule
LB12, although there is a note Many
At 11:13 AM 11/26/2004, Philippe Verdy wrote:
Note however that the ZWJ prohibits breaking, despite in French there's a
possible hyphenation at the first occurence, where it is also a syllable
break, but not for the second occurence that occurs in the middle of the
second syllable.
None of the
At 01:26 PM 11/27/2004, Philippe Verdy wrote:
But it's true that the United States have delegated several times their
official international representation to the Unicode Concertium, acting on
behalf of the US government for some decisions or some limited domains
(this is valid because Unicode
At 07:44 PM 11/27/2004, Doug Ewell wrote:
The problem, as Addison pointed out, is that if you use these forms in
text, most searching and sorting operations will fail to recognize them.
That's not the only problem. In some languages other ligatures, such as
fj might be as commonly needed as fi -
At 04:58 PM 11/27/2004, John Hudson wrote:
Mark E. Shoulson wrote:
Well, that's the difference under discussion. The plain text would
seem to be either the qere or the ketiv (but not the combined blended
form), since each of those is somewhat sensible.
Is there some place in the standard where
At 10:10 AM 11/28/2004, Peter Kirk wrote:
And I will remember not to implement the official standard whenever I come
across such a note, but rather to avoid mis-applied conservatism by
following everyone else in breaking the standard.
I would have phrased it as: ... in following everyone else in
Wachs-tube (growth tube)
Not the common reading of this. However, a growth tube or growing tube
might be an implement in some specialized context. But note that such
compounds might also be formed with 'Wuchs-', perhaps even preferentially so.
Therefore, reading 'Wachs-' as wax, as Otto
At 02:14 PM 11/29/2004, Kenneth Whistler wrote:
By the way, Google is your friend. If you want to get
information about such things, googling for it is a
good way to start. I suggest reading:
http://encyclopedia.thefreedictionary.com/Chinese%20writing%20system
As Richard Cook has pointed out, the
At 09:56 PM 12/2/2004, Doug Ewell wrote:
I use ... and UTF-32 for most internal processing that I write
myself. Let people say UTF-32 is wasteful if they want; I don't tend to
store huge amounts of text in memory at once, so the overhead is much
less important than one code unit per character.
At 11:52 PM 12/6/2004, Jony Rosenne wrote:
In chapter 8, regarding Hebrew, the standard says:
Positioning. Marks may combine with vowels and other points, and there are
complex typographic rules for positioning these combinations.
I understand that this sentence should be regarded as being
At 09:50 PM 12/6/2004, John Hudson wrote:
I don't know. I try to avoid politics, if possible. The significance of
what I'm saying is that you have made a good start in your proposal, that
it has some shortcomings, and that I hope to be able to help put something
more complete together.
It
At 12:50 PM 12/10/2004, Kenneth Whistler wrote:
Tim Greenwood asked:
... a perfectly normal linguistic process of
attributive disambiguation of a term which had grown ambiguous
in usage.
Is that like the 'Please RSVP' that I see all too often? Or should
that not be excused?
*grins* Well,
At 04:32 PM 12/23/2004, James Kass wrote:
Public Review Issue # 59 concerning danda and double danda
doesn't mention the Limbu script specifically.
The double danda, at least, is used in the Limbu script.
See the exhibit on page 12 of N2410.PDF. It's also listed
in the Limbu punctuation shown on
On 5/31/2010 12:33 PM, Tulasi wrote:
Thanks Mark for posting the links!
My posting was based on
http://www.unicode.org/consortium/directors.html
where in the bottom it said Unicode Inc.
Looks like the elected members from consortium
http://www.unicode.org/consortium/consort.html
forms Unicode
On 5/31/2010 2:12 PM, V. M. Kumaraswamy wrote:
Hello all,
Just a clarification an UNICODE.
Is UNICODE a STANDRAD
Yes, Unicode (The Unicode Standard), is indeed a standard.
And no, the use of ALL CAPS is discouraged. The
proper spelling is Unicode.
that needs to be followed by all
On 6/1/2010 1:37 PM, John Dlugosz wrote:
Why does the code chart call the plain Greek letter (upper and lower
case) “LAMDA” rather than “LAMBDA”? The latter is used in other places
where a glyph is based on the lambda, e.g. “U+019B LATIN SMALL LETTER
LAMBDA WITH STROKE”
Names sometimes
On 6/1/2010 4:14 PM, Mark Crispin wrote:
Is it really necessary to have this sort of pedagogical discussions on
the
Unicode list?
Is this character name misspelled?
Is Unicode a for-profit company?
Who owns the Unicode font?
etc. etc.
Perhaps we need to have a
On 6/1/2010 6:04 PM, Mark Crispin wrote:
I don't think that the unicode list should be used for the type of
questions that have polluted it recently.
That list unicode@unicode.org is open for general questions.
It has no formal standing as far as the business of the Consortium
is concerned, and
On 6/1/2010 8:04 PM, Kannan Goundan wrote:
I'm trying to come up with a compact encoding for Unicode strings for
data serialization purposes. The goals are fast read/write and small
size.
Why not use SCSU?
You get the small size and the encoder/decoder aren't that complicated.
You get the
On 6/2/2010 11:46 AM, Jonathan Rosenne wrote:
Although this mail was not addressed to me, I did read it. Sue me.
The terms of use for the Unicode mail list essentially state that these
types of boilerplate are null and void as far as Unicode is concerned.
You will find the following in
On 6/2/2010 3:28 PM, John Dlugosz wrote:
If anyone can “null and void” it, I wonder why companies bother to put
such things in people’s outgoing mail. I would have thought they could
come up with a proper net-etiquite version, but they just don’t care.
These things are bogus, because they
SCSU is a pass-through for ASCII, plus it handles the common mix of
ASCII plus 96 local characters (Latin-1, Greek, Cyrillic, Thai, etc)
really fast. Go look at the sample code. If you take that as starting
point for optimization, I think you'll be fine.
On 6/4/2010 8:34 AM, Mark Davis ☕ wrote:
In a compression format, that doesn't matter; you can't expect random
access, nor many of the other features of UTF-8.
The minimal expectation for these kinds of simple compression is that
when you write a string with a particular /write/ method, and
On 6/7/2010 4:26 PM, Masaaki Shibata wrote:
I'm studying the UAX #14 (5.2.0) and testing my code against
LineBreakTest.txt. And I found some test cases on this text file seem
to be contradictory to the rules on the document.
For example, LB25 explicitly prohibits breaking between CP and PO,
Can we stop double posting on Unicode and Unicore list?
People on the unicode list cannot reply to people on the other list,
and vice versa (unless they happen to be mermbers of both lists).
Thanks.
A./
On 6/14/2010 1:18 PM, Mark E. Shoulson wrote:
On 06/14/2010 02:15 PM, Asmus Freytag wrote:
On 6/14/2010 9:21 AM, Stephen Slevinski wrote:
Plain text SignWriting should be able to write actual sign language,
such as hello world.
You could equally well insist that it should be possible
On 6/17/2010 7:24 PM, Tulasi wrote:
What is equivalent ISO/IEC
ISO/IEC what?
There are hundreds of ISO/IEC standards, of which dozens are character
encoding standards.
for U+0278 LATIN SMALL LETTER PHI (ɸ)?
Or do Unicode ISO/IEC use different number name for same letter/symbol?
On 6/26/2010 5:41 PM, Doug Ewell wrote:
Regarding the inability to distinguish 8859-15 heuristically from
8859-1, I understand the problem when there are no tags or other
hints, or for cases like Windows-1252 text declared to be 8859-1, but
it seems unlikely to me that there is much text
The one argument that I find convincing is that too many implementations
seem set to disallow generic combination, relying instead on fixed
tables of known/permissible combinations.
In that situation, a formally adopted character with the clearly stated
semantic of is expected to actually
On 6/28/2010 11:38 AM, Mark Davis ☕ wrote:
The problem with slavishly following the charset parameter is that it
is often incorrect. However, the charset parameter is a signal into
the character detection module, so the charset is correctly supplied
from the message then the results of the
I'd like to second Mark.
There is a lot of information in the Standard, including the UAXs, and
the Unicode Character Database that would help answer your questions.
The volunteers associated with the Unicode effort have worked hard
putting all that information together - so use it, instead
Andreas,
I think we all realize your frustration with well-meaning software.
Because tags can be wrong for no fault of the human originating the
document,
I fully understand that Google might want to attempt to improve the user
experience in such situations.
The problem is that doing so
On 7/24/2010 3:00 PM, Bill Poser wrote:
On Sat, Jul 24, 2010 at 1:00 PM, Michael Everson ever...@evertype.com wrote:
Digits can be scattered randomly about the code space and it wouldn't make any
difference.
Having written a library for performing conversions between Unicode
strings
The short answer to Karl's question is that there will not be an
absolute guarantee.
The long answer is that, partly for the reasons he's mentioned, this
won't be a practical problem.
A. Most of the living scripts that are in wide use have been encoded,
including whatever digits are in use.
On 7/25/2010 6:05 PM, Martin J. Dürst wrote:
On 2010/07/26 4:37, Asmus Freytag wrote:
PPS: a very hypothetical tough case would be a script where letters
serve both as letters and as decimal place-value digits, and with modern
living practice.
Well, there actually is such a script, namely
On 7/26/2010 12:13 PM, Mark Davis ☕ wrote:
I agree that having it stated at point of use is useful - and we do
that in other cases covered by stability clauses; but we can only
state it IF we have the corresponding stability policy.
Mark,
The statement in your but clause really isn't correct.
On 7/27/2010 3:02 PM, Kenneth Whistler wrote:
Karl Williamson asked:
Subject: Why does EULER CONSTANT not have math property and PLANCK CONSTANT
does?
They are U+2107 and U+210E respectively.
Because U+210E PLANCK CONSTANT is, to quote the standard,
simply a mathematical
On 7/28/2010 2:02 AM, Kent Karlsson wrote:
Den 2010-07-28 09.50, skrev Jukka K. Korpela jkorp...@cs.tut.fi:
André Szabolcs Szelp wrote:
Generally, for the decimal point . (U+002E FULLSTOP) and , (U+002C
COMMA) is used in the SI world. However, earlier conventions could use
different
On 7/28/2010 10:09 AM, Murray Sargent wrote:
Contextual rendering is getting to be more common thanks to adoption of OpenType features. For example, both MS Publisher 2010 and MS Word 2010 support various contextually dependent OpenType features at the user's discretion. The choice of glyph for
On 7/28/2010 10:13 PM, Martin J. Dürst wrote:
Sequences of numeric Kanji are also used in names and word-plays, and
as sequences of individual small numbers.
But the same applies to our digits. A very simple example is to use
them as a ruler in plain text:
1 2 3
On 7/28/2010 9:32 PM, Doug Ewell wrote:
Murray Sargent murrays at exchange dot microsoft dot com wrote:
It's worth remembering that plain text is a format that was
introduced due to the limitations of early computers. Books have
always been rendered with at least some degree of rich text. And
On 8/2/2010 5:04 PM, Karl Pentzlin wrote:
I have compiled a draft proposal:
Proposal to add Variation Sequences for Latin and Cyrillic letters
The draft can be downloaded at:
http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB).
The final proposal is intended to be submitted
On 8/4/2010 1:30 PM, verdy_p wrote:
Asmus Freytag wrote:
The Fraktur problem is one where one typestyle requires additional
information (e.g. when to select long s) that is not required for
rendering the same text in another typestyle. If it is indeed desirable
(and possible) to create
Philipe,
Text typeset in Fraktur contains more information than text typset in
Antiqua. That means, there are some places where there are some (mild)
ambiguities in representation in the Antiqua version. Not enough to
bother a human reader who can use deep context to read the text
correctly,
On 8/5/2010 3:47 AM, William_J_G Overington wrote:
On Wednesday 4 August 2010, Asmus Freytag asm...@ix.netcom.com wrote:
However, there's no need to add variation sequences to
select an *ambiguous* form. Those sequences should be
removed from the proposal.
Are you here talking about
On 8/6/2010 2:03 AM, William_J_G Overington wrote:
On Thursday, 5 August 2010, Kenneth Whistler k...@sybase.com wrote:
I am thinking of where a poet might specify an ending version of a glyph at the
end of the last word on some lines, yet not on others, for poetic effect. I
think that it
The first discussions that lead to the current formulation of the bidi
algorithm easily go back 20 years by now. There's some value in not
re-stating a specification - even if a new formulation could be found to
be 100% equivalent. That value lies in the fact that any reader can
tell, by
On 9/18/2010 8:36 AM, abysta wrote:
Hello.
I need a dot to separate words into syllables. What should I use, 00B7 or 2027,
and why?
2027 is explicitly intended to be used to show syllables as is done in
dictionaries. You don't make it explicit in your query, but it sounds
like that is
On 9/18/2010 10:56 AM, Lorna Priest wrote:
U+00B7 MIDDLE DOT is semantically ambiguous and has (partly
therefore) varying renderings, and it might be used as a replacement
for U+2027 if the latter cannot be used reliably.
What about using U+02D1 - half triangular colon?
Why not use
On 10/11/2010 9:49 PM, Janusz S. Bień wrote:
On Mon, 11 Oct 2010 announceme...@unicode.org wrote:
The newly finalized Unicode Version 6.0 adds 2,088 characters,
What is the current total? Are other statistic informations available
somewhere?
The announcement gives a link to click
Ken,
some comments, and a few suggestions near the end.
On 10/12/2010 4:56 PM, Kenneth Whistler wrote:
Karl Williamson asked:
The Unicode standard only gives numeric values to rational numbers. Is
the reason for this merely because of the difficulty of representing
irrational ones?
No.
On 10/16/2010 10:38 AM, suzuki toshiya wrote:
Hi,
I've never heard any comments about the reservation
of the codepoints to making the code chart structure
similar among multiple script, no posive, no negative.
So your comment is interesting. Could you tell me more
about what kind of
On 10/17/2010 7:01 AM, Michael D. Adams wrote:
This is something that not even the C++ and Java reference
implementations do (though it appears that the C++ implementation of
the W rules was originally derived from a regular expression as it
uses state tables, but if so it is undocumented).
On 10/17/2010 10:59 AM, Michael D. Adams wrote:
The biggest challenge was not in creating those tables, but in
understanding the nuances of the rules, by the way.
Two questions so I can understand better.
First, by nuances do you mean the nuances of how the rules interact
(which I think would
On 11/4/2010 5:46 PM, Doug Ewell wrote:
Markus Scherer wrote:
While processing 16-bit Unicode text which is not assumed to be
well-formed UTF-16, you can treat (decode) an unpaired surrogate as a
mostly-inert surrogate code point. However, you cannot unambiguously
encode a surrogate code
On 11/5/2010 7:02 AM, Doug Ewell wrote:
Asmus Freytagasmusf at ix dot netcom dot com wrote:
I'm probably missing something here, but I don't agree that it's OK
for a consumer of UTF-16 to accept an unpaired surrogate without
throwing an error, or converting it to U+FFFD, or otherwise raising
If you want to get that point across to a general audience, you could
use a more colloquial term, albeit one that itself derives from mathematics.
Text that can be completely expressed in ASCII is fits into something
(ASCII) that works as a lowest common denominator of a large number of
On 11/14/2010 12:57 PM, Doug Ewell wrote:
Jim Monty jim dot monty at yahoo dot com wrote:
Japanese kana (the J in CJK) and Korean syllables (the K in
CJK) both have different normalization forms. What do ideographs
have to do with anything? I didn't mention ideographs; you did.
The term CJK
On 11/15/2010 2:24 PM, Kenneth Whistler wrote:
FA47 is a compatibility character, and would have a compatibility mapping.
Faulty syllogism.
Formally correct answer but only because of something of a design flaw
in Unicode. When the type of mapping was decided on, people didn't fully
expect
On 11/15/2010 5:43 PM, Kenneth Whistler wrote:
Perhaps someone would like to make a detailed proposal to
the UTC for how to fix the text and charts?;-)
Ken,
having shown yourself the master of detail in your reply, I think you've
appointed yourself.
A round of applause for Ken!
See how
On 11/18/2010 8:04 AM, Peter Constable wrote:
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf
Of André Szabolcs Szelp
AFAIR the reservations of WG2 concerning the encoding of Jangalif
Latin Ь/ь as a new character were not in view of Cyrillic Ь/ь, but
rather in
On 11/18/2010 11:15 PM, Peter Constable wrote:
If you'd like a precedent, here's one:
Yes, I think discussion of precedents is important - it leads to the
formulation of encoding principles that can then (hopefully) result in
more consistency in future encoding efforts.
Let me add the
On 11/22/2010 4:15 AM, Michael Everson wrote:
It boils down to this: just as there aren’t technical or usability reasons that
make it problematic to represent IPA text using two Greek characters in an
otherwise-Latin system,
Yes there are. Sorting multilingual text including Greek and IPA
301 - 400 of 1250 matches
Mail list logo