RE: IPA Null Consonant

2003-06-04 Thread Kent Karlsson
Jim Allan wrote: > Kent Karlsson posted on the use of slashed zero for empty set: > > > Yes... A horrible glyph for denoting the empty set, if I may say so. > > No > > offence intended. Please use the glyph available via the command > > \varnothing (a misleading n

RE: IPA Null Consonant

2003-06-03 Thread Kent Karlsson
Karl Pentzlin wrote: ... > At present, Unicode has not a character which fulfills this need > uniquely and unanimously (as this thread shows). > If there was a need to include such a character into Unicode, this > would have happened long before (considering the many linguists here), > or at least

RE: IPA Null Consonant

2003-06-03 Thread Kent Karlsson
Ken, Thanks for your thorough explanation! Finally something that is at least partially convincing! (Of that it *sometimes* is a borrowing of *symbolism* from set theory notation, that is, nothing more.) > Gustav Leunbach (1973), Morphological Analysis as a Step in > Automated Syntacti

RE: IPA Null Consonant

2003-06-03 Thread Kent Karlsson
> > The empty set symbol is a math symbol, not expected to ever occur (properly) > > in a word-like context. Capital O with stroke, however, is a > > letter, and can easily and without any problems occur in a word-like context. > > Which is exactly why it would be a terrible choice to indicate n

RE: IPA Null Consonant

2003-06-03 Thread Kent Karlsson
Jim Allan wrote: > Ken Whistler posted: > > > And what I pointed > > out earlier is that, in *linguistic* usage, the slashed zero > > glyph is clearly an acceptable glyphic variant of the > > empty set symbol. So to claim it is "completely unrelated" > > is to manifestly ignore actual practice.

RE: IPA Null Consonant

2003-06-03 Thread Kent Karlsson
> > > > - Ø [LATIN CAPITAL LETTER O WITH STROKE] and ø [LATIN > > > SMALL LETTER O > > > > WITH STROKE] are both ruled out as their semantics is > > > totally wrong. > > > > Not at all (as seen by example Jarkko quoted!). In Danish > > and Norwegian, > > yes. But in Swedish and Finnish t

RE: IPA Null Consonant

2003-06-03 Thread Kent Karlsson
Peter Constable wrote: > According to > http://www.unicode.org/Public/UNIDATA/StandardizedVariants.html, there is > no variation sequence < 2205, FE00 > defined. Somebody needs to tell the > author(s) of this page that they can't make up their own variation-selector > sequences. Indeed. And I ha

RE: IPA Null Consonant

2003-05-30 Thread Kent Karlsson
> >> Surely a 0 with stroke, not a O with stroke. > > > >The empty set sign was originally definitely the Norwegian/Danish letter > >CAPITAL O WITH STROKE. > > I do not believe you. > > >It never was related at all to a ZERO with stroke. > > Why not? Zero and emptiness are closely related. D

RE: IPA Null Consonant

2003-05-30 Thread Kent Karlsson
Michael Everson wrote: > >(Remember that the empty set symbol really was an O with stroke, > >originally!) > > Surely a 0 with stroke, not a O with stroke. The empty set sign was originally definitely the Norwegian/Danish letter CAPITAL O WITH STROKE. It never was related at all to a ZERO with

RE: IPA Null Consonant

2003-05-30 Thread Kent Karlsson
John Cowan wrote: > > I have yet to see anyone quote a linguistic texts that *explicitly* says that > > they use the empty set symbol for this "empty" linguistic entity. > > Well, a linguistics paper I read yesterday (citation on request) definitely > used the slashed-circle, aka empty set sign,

RE: IPA Null Consonant

2003-05-29 Thread Kent Karlsson
Peter Constable wrote > > But I doubt you will find any linguist who would consider the Norwegian > > capital slashed O as anything other than a kludge replacement for > > either the standard round empty set symbol or the slashed zero symbol. > > Hear! Hear! Then use a slashed zero (; which see

RE: IPA Null Consonant

2003-05-29 Thread Kent Karlsson
Jim Allan wrote: > Kent Karlsson posted: > > > And I (still!) very strongly disagree. The empty set symbol stands > > for the empty set (also written {}). But there is no set here, let alone > > an empty one. Possibly an empty string (of phonetic symbols?). > &g

RE: IPA Null Consonant

2003-05-29 Thread Kent Karlsson
Behalf Of > [EMAIL PROTECTED] > Sent: Tuesday, May 27, 2003 2:32 PM > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: RE: IPA Null Consonant > > > > Kent, the symbol used in linguistics is not the Danish capital vowel; > > it is the empty set symbol. A rather categorical statement from M

RE: Dutch IJ, again

2003-05-29 Thread Kent Karlsson
Kenneth Whistler quoted and wrote: > > > From: "Anto'nio Martins-Tuva'lkin" <[EMAIL PROTECTED]> > > > > On 2003.05.25, 00:00, Philippe Verdy <[EMAIL PROTECTED]> wrote: > > > > > even if the Dutch language considers it as a single letter, in a > > > > > way similar to the Spanish "ch" > > > > > >

RE: IPA Null Consonant

2003-05-29 Thread Kent Karlsson
> I absolutely concur with Peter, Michael, and Lukas that > U+2205 EMPTY SET > is the correct and intended character to deal with this semantic > of null morphemes and other linguistic "zeroes" in technical > linguistic representation. And I (still!) very strongly disagree. The empty set symbol

RE: Several BOMs in the same file

2003-03-25 Thread Kent Karlsson
> In that case, removing the BOM that would end up somewhere in the > middle is the natural thing to do, just as removing the EOF marker > at the end of the first file is. There is no "EOF marker" at the end of a file. At least not in in modern file systems. There is no NULL, CTRL-Z, or CTRL-D

RE: Several BOMs in the same file

2003-03-25 Thread Kent Karlsson
> You command above would now expand to something like this: > > cat -R UTF-16 -F UTF-16LE file1 -F Big-5 file2 > file3 > > Provided with information about the input encodings and the > expected output > encoding, "cat" could now correctly handle BOM's, endianness, new-line > conventions,

RE: Several BOMs in the same file

2003-03-25 Thread Kent Karlsson
> Let's say that I have two files, namely file1 & file2, in any Unicode > encoding, both starting with a BOM, and I compile them into > one by using > > cat file1 file2 > file3 For POSIX implementations, this concatenates the octets (bytes) in the two files, whether either of them is text in U

RE: U+00D0, U+01b7 -- variants or distinct chars?

2003-03-18 Thread Kent Karlsson
> > use, though there is some potential motivation: the need to provide a > > contrast between capitals for 00f0 and 0256 in a single language -- I > > If I may chime in, I agree on the need to distinguish letters like > these visually. It doesn't have to be in one language; any occasion > on whi

RE: Ligatures

2003-03-13 Thread Kent Karlsson
  probably didn't come out right. I never meant to say moving the characters apart was the best solution. Moving only the offending accent mark rather than the entire (composite) character might help in some cases, but this technique also should be used with care. Like in the ca

RE: Ligatures (was: FAQ entry)

2003-03-11 Thread Kent Karlsson
> I don't see how that's possible in the general case. In particular, ï > just about has to be wider than i (except in a monowidth font, obviously), > or the dots will collide with whatever's nearby. Similarly with i-macron. A diaeresis or macron over i or j can be narrower than when over most

RE: Ligatures (was: FAQ entry)

2003-03-11 Thread Kent Karlsson
> I agree with you; on the one hand, the examples mentioned like få > and fè and so on don't look very nice as is and could use a little > correction; but they would benefit more from adding a pixel or so of I was thinking more about high resolution (where pixels are so small you nearly cannot se

RE: Ligatures (qj)

2003-03-10 Thread Kent Karlsson
Joop Jagers: > The problem of overlapping glyphs should IMO not be solved by creating > ligatures, but by kerning the offending pairs of glyphs. You mean negative kerning (tweaking them apart)? That is almost certain to create horrible glyph spacing for many fonts. Note that by "ligature" we don

RE: Ligatures (was: FAQ entry)

2003-03-10 Thread Kent Karlsson
> > Yes, and qj. And similarly, f has overlappings with several more > > letters, so you would need ligatures for fb, fh, fk, fþ etc. But then > > where would it end? > > I suspect it would end when you start talking about > combinations like qj > and fþ that are unlikely to appear in natural la

RE: FAQ entry

2003-03-07 Thread Kent Karlsson
> Actually, it is of orthographic significance: it is not > uncommon for good fonts to have an fj ligature. That typography, not orthography. But I would appreciate if more fonts had an fj ligature, and (e.g.) a gj ligature too (in some fonts gj otherwise have overlapping glyphs). /ken

RE: FAQ entry

2003-03-07 Thread Kent Karlsson
> > Typographically, it's a ligature either way. > > You mean that both ae and ij should be called ligatures, > although one is fused and the other isn't? No. What I'm trying to say is that the names do not really matter. While there is a strive to give "good" names to characters, they sometimes

RE: FAQ entry

2003-03-07 Thread Kent Karlsson
> > E.g., it is quite legitimate to render, e.g. LIGATURE FI as > an f followed > > by an i, no ligation, whereas that is not allowed for the ae > > ligature/letter, nor for the oe ligature. > > How do you know that? Either "Caesar" or "Cæsar" is good Latin. That's the other way around. Ligatin

RE: FAQ entry (was: Looking for information on the UnicodeData file)

2003-03-07 Thread Kent Karlsson
> > For instance, the Danish ae (U+00E6) is not designated a ligature, > > It was in Unicode 1.0; I think politics were involved in that one. > In Latin use, ae is most certainly a ligature, and likewise in the > languages (including English) that have borrowed words involving it. > In Danish use,

RE: FAQ entry (was: Looking for information on the UnicodeData file)

2003-03-07 Thread Kent Karlsson
The names do NOT always provide correct descriptions of the characters. This is especially true for "digraph" and "ligature" (and in the case of U+00E6 too), as well as (e.g.) SCRIPT CAPITAL P, which is neither script, nor capital (it's lowercase), though it is a p... In addition, there are diff

RE: Caron / Hacek?

2003-03-07 Thread Kent Karlsson
> > By the way, although Unicode calls it a cedilla, the > correct form to use > > with G is the disconnected, 'under comma' form. > > Ah yes, the cedillas; now these are ambiguous! > What is the "correct form" for cedillas under N, K, L, R, S > and T? What should these look like? Well, the e

RE: Ya-phalaa

2003-03-06 Thread Kent Karlsson
> >> Moreover, RA + VIRAMA + YA cannot represent "Ra-yaphalaa" > as Ra+Virama > >> is relied upon as being representative of Reph. > >> For example, in the Indic OpenType secifications, you will > see that a > >> Ra+Virama is recognised as reph before any other > processing is applied. > > > >

RE: Reph and Khmer encoding model

2003-03-04 Thread Kent Karlsson
> I understand that unicode is supposed to represent the > language, not the way it is written. No, Unicode is supposed to be able to represent the written form. (Of course.) ... > Let's consider the ra+virama+ya case. In the mostpart the > ra+virama+ya is > displayed as ya+reph. This obviou

RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-02 Thread Kent Karlsson
Michael (michka) Kaplan: ... > then the conversion will simply strip the errant characters. Note that > either solution meets the needs of refusal to interpret the errant > sequences. Simply stripping the errant byte sequences means that they are each interpreted as the empty string of character

RE: discovering code points with embedded nulls

2003-02-06 Thread Kent Karlsson
> From what I'm hearing from you all is that a null in UTF-8 is > for termination and termination only. > Is this correct? No, NULL is a character (actually a control character) among many others. However, many C/C++ APIs (mis)use NULL as a string terminator since NULL isn't very useful for othe

RE: Suggestions in Unicode Indic FAQ

2003-02-03 Thread Kent Karlsson
> > No, with proper reordering (and "normal" display mode), the e-matra at > > the beginning of the second word would appear to be last glyph of the > > first "word". Similarly, for the second case, the e-matra glyph would > > have come to the left of the pa. The fluent reader (ok, not me...)

RE: Suggestions in Unicode Indic FAQ

2003-02-03 Thread Kent Karlsson
> --- Kent Karlsson <[EMAIL PROTECTED]> wrote: > > > > > > No fallback rendering is coming into picture with your explanation. > > > > Yes, there is. A character sequence (say) > > is very unlikely to have a ligature, specially adapted (and fit

RE: Suggestions in Unicode Indic FAQ

2003-01-31 Thread Kent Karlsson
Keyur Shroff wrote: ... > > No fallback rendering is coming into picture with your explanation. Yes, there is. A character sequence (say) is very unlikely to have a ligature, specially adapted (and fitting) adjustment points, or similar. The rendering would in that sense need to use a fallba

RE: Suggestions in Unicode Indic FAQ

2003-01-30 Thread Kent Karlsson
> Let me give a proper example this time. Consider a "Vowel Sign E" [U+0947] > appearing after any non-consonant character. This sign is generally > attached to the consonants. It has zero advance width with negative left > side bearing in the font. Ok. > Clearly, since in this case the sign is

RE: Suggestions in Unicode Indic FAQ

2003-01-30 Thread Kent Karlsson
> > I don't know where you find support for that position in that text. > > Can you please quote? There are no "invalid base consonants" for > > any dependent vowel (for Indic scripts; similarly for any > > other script). > > Actually, there is a mention of displaying combining marks on dotted

RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Kent Karlsson
Keyur Shroff wrote > Kent Karlsson <[EMAIL PROTECTED]> wrote: > > > > A space followed by a dependent vowel sign should display just the > > dependent vowel sign, no dotted circle. Indeed, (except for a "show > > invisibles" mode, or a "chara

RE: Indic Devanagari Query

2003-01-29 Thread Kent Karlsson
> > I wouldn't go so far. The fact that clusters belong together is something > > that can be handled by the software. Collation and other data processing > > needs to deal with such issues already for many other languages. See > > http://www.unicode.org/reports/tr10 on the collation algorithm.

RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Kent Karlsson
> The [new] INV character in Unicode can also be used for displaying dependent > vowel matras without dotted circle. A space followed by a dependent vowel sign should display just the dependent vowel sign, no dotted circle. Indeed, (except for a "show invisibles" mode, or a "character chart" dis

RE: Unicode Standards for Indic Scripts

2003-01-08 Thread Kent Karlsson
Marco Cimarosti wrote: > I downloaded the first two issues, but I see no mention to any formal > proposal to Unicode. Perhaps it's in the third one. Not that I have found anyway (I haven't read all the hundreds of pages, just e-flipped though...). However, all three issues contain modified code

RE: XTF-Morse (was RE: UTF-Morse)

2002-11-22 Thread Kent Karlsson
Why so ASCII-biased? ;-) See http://www.qsl.net/dk5ke/intcode.html. /kent k

Roman numerals (was: RE: Lowercase numerals)

2002-11-21 Thread Kent Karlsson
> > The same is true for M which had, amongst its many early > forms, a form close > > to (I), [...] > > > Cf. U+2180 in . > > Note that V is half an X (the upper half), > L emerged from half a C (the upper half of its original form), > D is half a U+2

RE: Confused by the difference between Case Mapping Charts and SpecialCasing.txt (U+0130)

2002-11-19 Thread Kent Karlsson
Not a stupid question at all. The reason SpecialCasing.txt changes the case mapping for dotted uppercase I is as follows: Take any two strings that are *canonically equivalent*. One in Normal Form C (maximally composed) and one in Normal Form D (decomposed). Now map the two strings to lowercas

RE: ct, fj and blackletter ligatures

2002-11-07 Thread Kent Karlsson
> Kent Karlsson wrote: > > (Subword boundaries are likely hyphenation > > points, whereas occurrences of ff, fi etc. elsewhere are > > unlikely hyphenation points.) > > I am sorry to always contradict you I don't think we always contradict eachother! ;-) Indeed

RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-07 Thread Kent Karlsson
> Initial for each piece, as each is assumed to be a complete > text file before concatenation. Nothing > prevents copy/cp/cat and other commands from recognizing > Unicode signatures, for as long as they > don't claim to preserve initial U+FEFF. Yes there is, in a formal sense, for cat and c

RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Kent Karlsson
> True, UTF-16 files do need a signature. Eh, no! "UTF-16BE" and "UTF-16LE" files (or whatever kind of text data element) do not have any signature/BOM. Not even files (somehow) labelled "UTF-16" need have a signature/BOM, without a BOM they are then the same as if it was labelled "UTF-16BE".

RE: In defense of Plane 14 language tags (long)

2002-11-06 Thread Kent Karlsson
> I think it's time to remember the limited purpose for which Plane 14 > tagging was created: plain-text protocol messages. The idea is that Well, not really. The Plane 0E (!) tag characters were invented solely for "political" reasons for ONE IETF working group. But not even that one IETF WG

RE: ct, fj and blackletter ligatures

2002-11-06 Thread Kent Karlsson
> > Firstly, the claim that there must be no ligation over subword > > boundaries is made only for German. > > It is also valid for Slovak and Czech. ok. I still wonder a bit why. It does not help the reader in any significant way, esp. when many different words are spelled the same quite r

RE: ct, fj and blackletter ligatures

2002-11-05 Thread Kent Karlsson
> German is indeed a special case, and there are various ideas > for how best > to handle German ligation. Clearly, inserting ZWJ where one > wanted ligation > -- or, perhaps, ZWNJ where one didn't want it -- is an > option. Using ZWNJ is probably a better solution, Why would not SOFT HYPHEN

RE: Character identities

2002-10-31 Thread Kent Karlsson
de purely for Norwegian, why not display ö as ø, as is the convention? This is *exactly* the same situation as with ä vs. a^e. I say, let the *"author"* decide in all these cases, and let that decision stand, *regardless of font changes*. [There is an implicit qualification there, bu

RE: RE: Character identities

2002-10-30 Thread Kent Karlsson
> Sütterlin does use a macron over "m" and "n" to indicate that > the letter should be doubled So should a Sütterlin font then by default replace mm with an m-macron glyph? Or should the "author" decide which orthography to use? /Kent K

RE: Character identities

2002-10-30 Thread Kent Karlsson
> >Marco: It is o.k. (in a German-specific context) to display > >an umlaut as a macron (or a tilde, or a little e above), > >since that is what Germans do. > > > >Kent: It is *not* o.k. -- that constitutes changing a character. > > Kent can't be right here. > > 1. We have all see

RE: Character identities

2002-10-30 Thread Kent Karlsson
> I insist that you can talk about character-to-character > mappings only when > the so-called "backing store" is affected in some way. No, why? It is perfectly permissible to do the equivalent of "print(to_upper(mystring))" without changing the backing store ("mystring" in the pseudocode); to_

RE: Character identities

2002-10-29 Thread Kent Karlsson
Marco, Standard orthography, and orthography that someone may choose to use on a sign, or in handwriting, are often not the same. And I did say that current font technologies (e.g. OT) does not actually do character to character mappings, but the net effect is *as if* they did (if, and I h

RE: Character identities

2002-10-29 Thread Kent Karlsson
> -Original Message- > From: Marco Cimarosti [mailto:marco.cimarosti@;essetre.it] > Sent: den 28 oktober 2002 16:23 > To: 'Kent Karlsson'; Marco Cimarosti > Cc: [EMAIL PROTECTED] > Subject: RE: Character identities > > > Kent Karlsson wrote

RE: Character identities

2002-10-28 Thread Kent Karlsson
... > > For this reason it is quite impermissible to render the > > combining letter small e as a diaeresis > > So far so good. There would be no reason for doing such a thing. ... > > or, for that matter, the diaeresis as a combining > > letter small e (however, you see the latter version > > some

RE: Character identities

2002-10-25 Thread Kent Karlsson
>... Like it or not, superscript e *is* the > same diacritic > that later become "¨", so there is absolutely no violation of > the Unicode > standard. Of course, this only applies German. Font makers, please do not meddle with the authors intent (as reflected in the text of the document!). Just

RE: Character identities

2002-10-24 Thread Kent Karlsson
font(s) that display and similar cases properly. The latter would be very welcome! /Kent K > On Thu, Oct 24, 2002 at 11:46:04AM +0200, Kent Karlsson wrote: > > Please don't. "a^e" is . > > Which is great, if you're a scholar trying to accurately reprodu

RE: Character identities

2002-10-24 Thread Kent Karlsson
> Kent Karlsson wrote: > > And it is easy for Joe User to make a simple (visual...) > > substitution cipher by just swiching to a font with the > > glyphs for letters (etc.) permuted. Sure! I think it > > would be a bad idea to call it a "Unicode font" t

RE: Character identities

2002-10-24 Thread Kent Karlsson
erman uses), but different level 2 weights (as is appropriate, since this difference is (usually) more significant than case distinctions). /Kent Karlsson

RE: Devanagari variations

2002-03-07 Thread Kent Karlsson
> implementations might > not recognise a sequence like < consonant, vowel, nukta > as > valid. For > instance, I understand that if Uniscribe encountered such a > sequence, it > would assume you've left out a consonant immediately before > the nukta, > and it would display a dotted circ

RE: How to make "oo" with combining breve/macron over pair?

2002-03-07 Thread Kent Karlsson
> However, it might make sense to make an implementation guideline > that would constrain any such mechanism to double diacritics and > suggest that people move to generic markup mechanisms if they > need more. Thus: > > X CGJ X CGJ combining-breve > > But not: > > X CGJ X CGJ X CGJ combining-

RE: How to make "oo" with combining breve/macron over pair?

2002-03-04 Thread Kent Karlsson
... > >The problem here is that the COMBINING GRAPHEME JOINER only affects > >*enclosing* combining marks (and combining marks *following* an > >enclosing one). > > I do not know what you mean by this. The CGJ makes what it is joining > into a single entity. So if you add a diacritic to it, it

RE: How to make "oo" with combining breve/macron over pair?

2002-03-04 Thread Kent Karlsson
> -Original Message- > From: Michael Everson ... > At 20:39 -0800 2002-03-03, Dan Wood wrote: > >Hi, > > > >I'm not finding hints of this in any of the FAQ or "where's my > >character" docs I'm trying to create (or find) the "oo" pair > >with a combining macron (0304) and combinin

RE: Beta version

2002-01-31 Thread Kent Karlsson
> > Of course, e.g., a, , and a should be > ordered the same > > at the primary level for the Nordic languages. > > "ä", "æ", "a ¨-above", and "a e-above" should all be sorted > the same in > Swedish, no matter whether they're written in capital or > small letters. Of > course (?), the "e-ab

RE: Beta version

2002-01-30 Thread Kent Karlsson
COMBINING RING ABOVE and COMBINING LATIN SMALL LETTER O look different (small "true" ring vs. an o-shape (rarely a "true" circle) a bit larger than the small ring). The latter is a historic precursor to the former. COMBINING DIAERESIS and COMBINING LATIN SMALL LETTER E really look different, tho

RE: Plane One use, was Re: HTML Validation

2001-12-18 Thread Kent Karlsson
There is no such thing as an "astral character" in Unicode or 10646. But someone did suggest that as a name for non-BMP characters before one settled on the term "supplementary character". /kent k > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > Beha

RE: Unicode 1.0 names for control characters

2001-12-04 Thread Kent Karlsson
None of the C0 and in particular C1 names are really from Unicode 1.0. They are from ISO/IEC 6429. Now from ISO/IEC 6429:1992 (Third edition), rather than the second edition. Technically the same standard is available as Fifth edition of ECMA-48, 1991; ftp://ftp.ecma.ch/ecma-st/Ecma-048.pdf. Th

RE: Are these characters encoded?

2001-12-03 Thread Kent Karlsson
Summary answer to the question in the subject line: yes.   As I tried to express as succinctly as possible before is that:1) & and o̲ (underlined o, sometimes used as an abbreviation for 'och', as is 'o.' (dictionaries)and 'o', and even 'å') is definitely not a glyph variant issue, they ar

RE: Are these characters encoded?

2001-12-02 Thread Kent Karlsson
> >> 1.) Swedish ampersand (see "&.bmp"). It's an "o" (for > "och", i.e. "and") > >> with a line below. In handwritten text it is almost > always used instead of > >> &, in machine-written text I don't think I've ever seen it. > > > >This might be a character in its own right, as different

<    1   2   3