Re: Upside Down Fu character

2012-01-09 Thread Peter Cyrus
This discussion has veered close enough to my pet project (the Shwa
script) to comment.

I plan to implement it in the PUA, both to demonstrate its value and
viability and to find the problems and correct them before they get
frozen into Unicode (if ever).  I accept that means an eventual
recoding, which I hope can be applied without changing the last two
hex digits of the encoding, e.g. changing just the pages.  That seems
to me a reasonable and viable route to inclusion.

What worries me is the number of technical decisions that will have
been made long before an application for inclusion into Unicode.  It
seems to me there is a risk that those decisions will then become
written in stone for compatibility reasons, and will never have been
exposed to the debate they would engender on this list, for instance.
There seems to be a gap in the process.

One solution would be to institute a sandbox in the PUA, inspired by
the ConScript Unicode Registry except with non-permanent entries.
There, new characters and scripts could enjoy widespread use and
delayed stability until they were ready for inclusion, while at the
same time profiting from the scrutiny of the Unicode community.

If this Unicode Prep existed, you could assign Upside-Down Fu to it
without much hesitation, and see who uses it in plaintext for a few
years.  If accepted, it would have to be recoded, but that might be
the least of all evils.


On Mon, Jan 9, 2012 at 9:23 PM, Asmus Freytag asm...@ix.netcom.com wrote:
 On 1/9/2012 2:52 AM, vanis...@boil.afraid.org wrote:

 From: Asmus Freytagasmusf_at_ix.netcom.com

 I have no opinion on the Upside-down FU ideograph as a candidate for
 encoding, but I think any analysis of its merits needs to be more
 nuanced than what your message seemed to imply.

 A./

 While I generally agree with your more nuanced view on this matter, Asmus,
 I'm
 afraid I have to disagree in this particular case. The upside down Fu has
 been
 used decoratively for a thousand years (it's a Chinese pun), and if anyone
 wanted to use it in plain text, they would have by now. With a character
 of
 such antiquity, there really is no question of computer technology
 suppressing
 its use. Put simply, people have either used this character in plain text,
 or
 they haven't. If someone can dig up a couple example texts, then it's no
 question. If nobody can find those example texts, I think that speaks
 volumes
 on the utility of the character and its suitability to encoding.

 -Van


 Van,

 I wrote I have no opinion...

 Reading your reply may nudge me closer to having an opinion :)

 And, for the record, I think what you wrote is rather nuanced.

 If there's a smoking gun, I'm sure that would settle the encodability
 question, and given the history of the character, you make a good
 argument that searching for plain text in pre-digital technologies
 is feasible, and appropriate.

 Still, I'm interested in the general issue - what to do about a
 (hypothetical) character or a hypothetical new use
 for an otherwise existing character that doesn't have the
 benefit of first having been around for thousands of years
 in an age of hand-lettering or hot-metal print.

 If Unicode wants to be the only game in town, and if non-digital
 text is disappearing as a medium, how does one address
 innovation without leaving broken (non-supportable) digital data
 as a prerequisite.

 Currency symbols have been given an exemption from this chicken
 and egg conundrum, because everyone realizes that using temporary
 encodings for them until their use is established is unreasonable.

 So, my question remains, are there any other avenues besides
 hot-metal printed text and compatibility encodings to demonstrate
 that a character (not this example, but in general) is a viable candidate?

 A./




Re: Sorting and Volapük

2012-01-01 Thread Peter Cyrus
German does both, so there may be a CLDR locale for the choice you need.

On Sun, Jan 1, 2012 at 4:27 PM, Michael Everson ever...@evertype.com wrote:
 Swedish and Finnish treat ä and ö as separate letters of the alphabet, but 
 sort them at the end after z.

 Volapük sorts a ä b c d e f g h i j k l m n o ö p r s t u ü v w x y z, with ä 
 a separate letter after a, ö separate after o, and ü separate after u.

 There is as yet no CLDR locale for Volapük... does anyone know if any other 
 language treats ä/ö/ü in the same way?

 Michael Everson * http://www.evertype.com/







Re: Sorting and German (was: Sorting and Volapük)

2012-01-01 Thread Peter Cyrus
Sounds like Michael could use the Austrian system.

On Sun, Jan 1, 2012 at 6:46 PM, Otto Stolz otto.st...@uni-konstanz.de wrote:
 Happy New Year,

 on Sun, Jan 1, 2012 at 4:27 PM,
 Michael Eversonever...@evertype.com wrote:  Volapük sorts [...] ä a
 separate letter after a, ö separate after o,

 and ü separate after u.
 does anyone know if any other language treats ä/ö/ü in the same way?


 Am 2012-01-01 16:54, schrieb Peter Cyrus:

 German does both,


 Not really.

 According to DIN 5007,
 German features two different sort orders:
 • In lists of personal names, Ä, Ö, Ü may be sorted
  as AE, OE, and UE, respectively; this order is
  mainly used in telefone directories.
 • In dictionaries and encyclopedias, Ä, Ö, Ü are sorted
  as A, O, and U, respectively.

 As encyclopedias may well comprise personal names,
 the scope of the  former scheme is not well defined,
 imho, and I stick to the latter one, whenever I have
 to sort a list.

 In both schemes, ß is sorted as SS.

 In both schemes, true A (or AE, respectively) goes before Ä,
 iff two sort keys are otherwise identical; likewise for
 Ö, Ü, and ß.

 In Austria, a third scheme is used in telefone directories
 (but not in the yellow pages): Here, Ä, Ö, and Ü, are
 indeed treated as distinct letters, to go between A and B,
 O and P, and U and V, respectively; and ß is treated as a
 distinct pair of letters, ro go between SS and ST.

 Best wishes,
  Otto Stolz






empty codepoints at 00 and FF?

2011-12-12 Thread Peter Cyrus
I'm sure this is explained somewhere, but I can't find it.

Must blocks leave the first and last codepoints unassigned?


Re: missing characters: combining marks above runs of more than 2 base letters

2011-11-18 Thread Peter Cyrus
Ken, you mention defined markup constructions, but nothing would prevent
specialized rendering software from, for example, connecting a left half
mark with the corresponding right half mark via titlo, even though the text
is still only plain text with no markup, right?  The titlo would simply not
display as such in the absence of the right software.

On Fri, Nov 18, 2011 at 8:03 PM, Ken Whistler k...@sybase.com wrote:

 On 11/17/2011 11:28 PM, Philippe Verdy wrote:

 Could the Unicode text specify that a left half mark, when it is
 followed by a right half-mark on the same line, has to be joined ? And
 which character can we select in a font to mark the intermediate
 characters between them ?


 No.

 This kind of stuff is not plain text. Mathematicians and musical scorere
 long ago got over
 the notion that marking of scoped constructs (with beams and ties in music,
 and similar kinds of scoping for expressions in math) could be plain text.

 People who score text, whether metricians, prosodic analysts, or
 phoneticians, need to learn the same lesson. Unicode is not a repository
 for text scoring hacks, with the expectation that all of the rendering
 implementations
 will quietly incorporate this kind of complexity into their already complex
 requirements for plain text rendering of writing systems.

 People who need to score text will have to make use of specialized
 rendering software and defined markup constructions, just like the
 mathematicians and musicians do.

 --Ken





Re: missing characters: combining marks above runs of more than 2 base letters

2011-11-18 Thread Peter Cyrus
I mention it because I think plain text could at least indicate what
_should_ display (instead of what *does* display), and a rich environment
could make the same text look great.

I think we'll all need for a long time more to write text that displays
adequately as plain text in the absence of even OpenType advancexd
typography features.

On Fri, Nov 18, 2011 at 8:46 PM, Ken Whistler k...@sybase.com wrote:

 On 11/18/2011 11:21 AM, Peter Cyrus wrote:

 Ken, you mention defined markup constructions, but nothing would
 prevent specialized rendering software from, for example, connecting a left
 half mark with the corresponding right half mark via titlo, even though the
 text is still only plain text with no markup, right?  The titlo would
 simply not display as such in the absence of the right software.


 Correct. Specialized rendering software can pretty much do whatever its
 programmers
 want it to do.

 But there would be no reason to limit that to what it could do with the
 hacky left-
 and right-half marks, either.

 Specialized rendering software could detect a sequence letter,
 combining titlo, letter,
 letter, decide the three letters constituted a Cyrillic number and draw
 the titlo
 over all three letters as well. Or a specialized Cyrillic font could
 contain ligatures
 which would do the same, without requiring specialized code in a rendering
 engine.

 The problem would be there will be people who would expect such
 specialized rendering
 to be specified *in* the standard and be supported by *non*-specialized
 rendering
 engines and fonts, because their multi-letter titlos don't display
 correctly when posted
 on websites and viewed by people who don't have specialized rendering
 software
 or specialized fonts.

 That's when the answer has to be no. At that point, the responsibility
 really falls on
 the folks who need to score text to define the higher-level protocols to
 do so, and then
 convince the people who want to support that kind of text convention to do
 the
 implementation(s) required to make it happen.

 --Ken




more flexible pipeline for new scripts and characters

2011-11-16 Thread Peter Cyrus
I've only been on this list for some months, and I only came to it with my
own little project in mind, but it occurs to me, as I follow all these
threads, that Unicode might benefit from a more flexible process of
adaptation, of Unicodification.  The model would be an asymptotic approach
to standardization that tolerated an amount of change in inverse proportion
to time elapsed.

In other words, people could propose a new script or character and rather
than have it discussed before encoding and then encoded in permanence, with
no possibility even to correct obvious errors as in U+FE18, instead it
would be provisionally accepted but still subject to modifications as
implementors worked with it.  Hopefully, most mistakes would be unearthed
early and corrections applied before much text had been encoded.  As time
passed and the encoding became more stable, the size of mistake open to
correction would be reduced, e.g. to spelling errors, until it was frozen
as a result of this process before being declared permanent.

My thought is that some of the problems that I've seen discussed might have
been discovered and addressed had a community been using the proposed
standard before it became immutable.  In the current process, that
transition may occur too early to be useful.  It may be easier to fix all
the existing text if very little time has passed, than to fix all future
text forever.

This idea could also be extended to new characters and scripts that might
or might not make it into Unicode : Unicode could offer a provisional
acceptance that allowed users to demonstrate the utility of the proposed
changes once they're in Unicode, even if they're later modified or
withdrawn.  This policy might have prevented the recoding of Tengwar,
Cirth, Shavian, Phaistos Disc and Deseret as they moved from the PUA to the
SMP.

It seems to me that the current policy is intended to offer implementors a
guarantee of our best effort to save them the trouble of chasing down
problems, but in fact they might prefer to have some problems fixed early
(while the developers still remember the application) than to have to take
unfixed problems into account forever.

This idea is WAY beyond my expertise, but I thought I'd mention it for you
all to consider.


Re: more flexible pipeline for new scripts and characters

2011-11-16 Thread Peter Cyrus
I guess what I'm proposing is that the proposed allocations be implemented,
so that problems may be unearthed, even as the users accept that the
standard is still only provisional.

On Wed, Nov 16, 2011 at 3:25 PM, Asmus Freytag asm...@ix.netcom.com wrote:

 Peter,

 in principle, the idea of a provisional status is a useful concept
 whenever one wants to publish something based on potentially doubtful or
 possibly incomplete information. And you are correct, that, in principle,
 such an approach could be most useful whenever there's no possibility of
 correcting some decision taking in standardization.

 Unicode knows the concept of a provisional property, which works roughly
 in the manner you suggested. However, for certain types of information to
 be standardized, in particular the code allocation and character names, it
 would be rather problematic to have extended provisional status. The reason
 is that once something is exposed in an implementation, it enables users to
 create documents. These documents would all have to be provisional,
 because they would become obsolete once a final (corrected or improved)
 code allocation were made.

 The whole reason that some aspects of character encoding are write once
 (can never be changed) is to prevent such obsolete data in documents.

 Therefore, the only practical way is that of having a bright line between
 proposed allocations (that are not implemented and are under discussion)
 and final, published allocations that anyone may use. Instead of a
 provisional status, the answer would seem to lie in making the details of
 proposed allocations more accessible for review during the period where
 they are under consideration and balloting in the standardization committee.

 One possible way to do that would be to make repertoire additions subject
 to the Public Review process.

 Another would be for more interested people to become members and to
 follow submissions as soon as they hit the Unicode document registry.

 The former is much more labor-intensive and I suspect not something the
 Consortium could easily manage with the existing funding and resources. The
 latter would have the incidental benefit of adding to the funding for the
 work of the Consortium by providing some additional funding via from
 membership fees.

 A./



Re: definition of plain text

2011-10-17 Thread Peter Cyrus
Perhaps the idea of something embedded in the text that then controls
the display of the subsequent run of text is the very definition of
markup, whether or not that markup is a special character or an
ASCII sequence like /spanspan style=gait:xxx; or /spanspan
style=font:xxx;.

On Mon, Oct 17, 2011 at 1:07 AM, Richard Wordingham
richard.wording...@ntlworld.com wrote:
 On Sun, 16 Oct 2011 21:37:20 +0200
 Peter Cyrus pcy...@alivox.net wrote:

 Perhaps, awkwardly.  But that is ultimately equivalent to marking the
 gait on every letter, in which case I probably wouldn't need to
 distinguish between initial and non-initial letters.

 If you allow C(R)V(C) as a 'fixed' syllable structure, the
 location of the syllable boundary in words like /tatrat/ would be
 significant, as in Thai.  /tata/ would also be awkward if you had null
 initials, as, again, some claim for Thai.  (There are languages that
 need to be analysed as having a phonemic contrast between null initials
 and initial glottal stops, even if German isn't one of them.)

 You might be able to handle syllable breaks just by having an
 optional syllable break character, analogous to CGJ and ZWSP.

 Marking gait on every letter may not be necessary, but gait-selecting
 characters present issues.  They're analogous to the deprecated numeric
 shape selectors U+206E and U+206F, whose use is strongly discouraged.
 These characters need explicit support in rendering engines, which is
 an argument against gait-selecting characters.  You might be able to
 propagate gait by contextual substitution, *if* you could propagate it
 through automatic line breaks.

 Richard.






Re: definition of plain text

2011-10-17 Thread Peter Cyrus
It's been done already : the International Phonetic Alphabet.  If we
all just wrote in that, it would make Unicode much easier to
implement, too.

I'm just working on Plan B, just in case.

On Mon, Oct 17, 2011 at 8:48 PM, Ken Whistler k...@sybase.com wrote:
 On 10/17/2011 1:23 AM, Peter Cyrus wrote:

 Perhaps the idea of something embedded in the text that then controls
 the display of the subsequent run of text is the very definition of
 markup, whether or not that markup is a special character or an
 ASCII sequence like/spanspan style=gait:xxx;  or/spanspan
 style=font:xxx;.

 Yep.

 And FWIW, rather than invent a new conscript for what you are attempting to
 do, my recommendation would be to simply use a strictly defined phonetic
 subset of the already-encoded Unicode characters, and then use XML markup
 to define gait. Your marked-up text would be more voluminous, but
 the chances of it displaying decently and being processable with widely
 available tools would be much better.

 --Ken







Re: definition of plain text

2011-10-17 Thread Peter Cyrus
Your idea of propagation seems worth exploring - thanks!

On Mon, Oct 17, 2011 at 1:07 AM, Richard Wordingham
richard.wording...@ntlworld.com wrote:
 On Sun, 16 Oct 2011 21:37:20 +0200
 Peter Cyrus pcy...@alivox.net wrote:

 Perhaps, awkwardly.  But that is ultimately equivalent to marking the
 gait on every letter, in which case I probably wouldn't need to
 distinguish between initial and non-initial letters.

 If you allow C(R)V(C) as a 'fixed' syllable structure, the
 location of the syllable boundary in words like /tatrat/ would be
 significant, as in Thai.  /tata/ would also be awkward if you had null
 initials, as, again, some claim for Thai.  (There are languages that
 need to be analysed as having a phonemic contrast between null initials
 and initial glottal stops, even if German isn't one of them.)

 You might be able to handle syllable breaks just by having an
 optional syllable break character, analogous to CGJ and ZWSP.

 Marking gait on every letter may not be necessary, but gait-selecting
 characters present issues.  They're analogous to the deprecated numeric
 shape selectors U+206E and U+206F, whose use is strongly discouraged.
 These characters need explicit support in rendering engines, which is
 an argument against gait-selecting characters.  You might be able to
 propagate gait by contextual substitution, *if* you could propagate it
 through automatic line breaks.

 Richard.






definition of plain text

2011-10-14 Thread Peter Cyrus
Is there a definition or guideline for the distinction between plain
text and rich text?

For example, in the expression 3², the exponent is a single character,
superscript two.  Semantically, this expression is equivalent to
3^2, using a visible character to indicate exponentiation and then
leaving the exponent in normal notation.  Both seem to me clear
examples of plain text.

But if the circumflex were replaced by an invisible character that
meant the following number should be superscripted, would that still
be plain text?  Or would it be formatting that should be relegated to
markup?

What about a character that inhibited the composition of following
Hangul jamo into a syllable?  That seems to me to be markup, but if it
could be replaced by a medial ZWNJ, I'm no longer sure.

Is the ZWNJ another tricky case?  One could say that it's an invisible
formatting character whose role is simply to control how other
characters are displayed, and thus it should be markup?  For that
matter, perhaps the normal space is a type of markup, especially when
it triggers the use of a final variant in the previous character.

Finally, aren't the LTR and RTL characters markup?  What if we wanted
characters that put a run of text into vertical directionality?

One candidate guideline would be that plain text never include
anything that affects non-adjacent characters.  But isn't that just
the equivalent of requiring repetition of markup for each character?
For example, if you wanted to write 3²⁽ⁿ⁺¹⁾ with m instead of n, the
plain text would be 3^2^(^m^+^1^), using ^ as a superscripting prefix.
 If that is acceptable as plain text, then perhaps the Unicode
superscripted characters should all decompose into a superscripting
prefix.

Maybe I just need more sleep...




Re: definition of plain text

2011-10-14 Thread Peter Cyrus
Ken, your explanation seems more permissive than I had anticipated.

Your example of 3sup2/sup would seem to me at risk of behaving
in unforeseen ways if, for instance, it were split up.  Wouldn't it
match a string up  2?  Wouldn't it fail to match 3²?  I guess I
thought that plain text should be more canonical.

I was also being coy about my real question, but perhaps I shouldn't
have been.   I'm working on a conscript intended as a universal
phonetic script, and even though it is unlikely ever to merit
inclusion in Unicode, I'd like to design it to Unicode's standards.
(For the curious, the work done to date is online at
www.shwascript.org.)

One particularity of this script is that it is written in different
gaits, depending on the phonology of the language.  Languages with
open syllables, like most Niger-Congo or Austronesian languages, would
write it as a syllabary.  Languages with fixed syllables, like
Chinese, Korean or Vietnamese, would write it as blocks, like Hangul.
Languages with variable syllables, like most Indo-European languages,
would write it as an alphabet.  And Afro-Asiatic languages would write
the vowels as diacritics to highlight the triliteral roots.  But all
these gaits would use the same underlying letters, and the same
underlying Unicode PUA characters.

The obvious way to encode this is to add a set of invisible characters
that specify the gait of the following run of plain text.  Each would
also serve as the end-of-run character for the preceding run.  This
solution seems to me analogous to the use of LTR and RTL to mark runs
for directionality, but I don't know enough about the UBA to know
where the pitfalls are or whether a better solution is feasible.
These gait characters would be ignored in search, which is the desired
behavior.

Alternatives might include markup or even different fonts, but the
gaits seem to me as much part of the text as the letters themselves.
Writers will have to explicitly change gaits when they want to embed a
Chinese name in an English text, for example.  It seems unwieldy to
capture that information at the keyboard and then package it
separately for encoding, transmission and rendering.  Nor does
considering it as a case distinction seem elegant.

May I ask for advice?

On Fri, Oct 14, 2011 at 9:17 PM, Ken Whistler k...@sybase.com wrote:
 On 10/14/2011 11:47 AM, Joó Ádám wrote:

 Peter asked for what the Unicode Consortium considers plain text, ie.
 what principles it apllies when deciding whether to encode a certain
 element or aspect of writing as a character. In turn, you thoroughly
 explained that plain text is what the Unicode Consortium considered to
 be plain text and encoded as characters.

 Correct. And basically, that is what it comes down to.

 One cannot look at *rendered* text and somehow know, a priori,
 exactly how that text should be represented in characters. (In the case of
 most
 of what is still being considered for encoding, rendered text means
 non-digitally printed historic printed materials, because there isn't any
 character encoding for it in the first place, and hence no compatibility
 encoding issues.)

 Sure, there are some general principles which apply:

 1. We don't represent font size differences by means of encoded characters.

 2. We don't represent text coloration differences by means of encoded
 characters.

 3. We don't represent pictures by means of encoded characters.

 and so on. Add your favorites.

 But character encoding as a process engaged in by character encoding
 committees (in this case, the UTC and SC2/WG2) is an art form which
 needs to balance: existing practice, if any; graphological analysis of
 writing systems; complexity of implementation for proposed solutions
 to encoding; architectural consistency across the entire standard;
 linguistic politics in user communities; and even national body politics
 involved in voting on amendments to the standard.

 It is impossible to codify that process in a set of a priori, axiomatic
 principles about what is and is not plain text, and then sit in committee
 and run down some check list and determine, logically, what exactly
 is and is not a character to be encoded. People can wish all they
 want that it were that way, but it ain't.

 So yeah, what the Unicode Consortium considers to be plain text is
 what can be represented by a sequence of Unicode characters, once
 those characters ended up standardized and published in the standard.

 You can't start at the other end, define exactly what plain text is, and
 then
 pick and choose amongst the already standardized characters based
 on that definition. Given the universal (including historic) scope of the
 Unicode Standard, that way lies madness.

 --Ken