Re: Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask

2020-02-10 Thread Mark E. Shoulson via Unicode

On 2/10/20 6:14 PM, Sławomir Osipiuk via Unicode wrote:

As for "concatenation of such plain text sequences" where each sequence is in a 
different language, I must again ask: Is there a system that actually does this, that 
does not have a higher-level protocol that can carry metadata about the natural language 
of the text sequences?
Indeed, it seems to me that concatenating such sequences *is* in itself 
a higher-level protocol.  After all, it isn't  "plain text" anymore when 
you have to suppress printing out some of it.  And we already have other 
higher-level protocols that can do the job about as efficiently.  So at 
least this particular application would be a solution to a problem 
that's already been solved.


~mark



Re: A neat description of encoding characters

2019-12-02 Thread Mark E. Shoulson via Unicode

On 12/2/19 7:01 AM, Costello, Roger L. via Unicode wrote:

>From the book titled "Computer Power and Human Reason" by Joseph Weizenbaum, 
p.74-75


It's a reasonably good explanation of binary numbers and "encoding" in a 
more usual sense than we use it here in Unicode-land.  Actually makes 
for a basis to move on to discussing information theory.  But when 
Unicodites say "encoding", they mean stuff like UTF-8 vs UTF-16, which 
is kind of a different kettle of macaroons.


~mark



Re: Is the Unicode Standard "The foundation for all modern software and communications around the world"?

2019-11-19 Thread Mark E. Shoulson via Unicode
It says "foundation", not "sum total, all there is."  I don't think this 
is much overreach.  MAYBE it counts as "enthusiastic", but not misleading.


Why so concerned with these minutiæ? Were you in fact misled?  (Doesn't 
sound like it.)  Do you know someone who was, or whom you fear would 
be?  What incorrect conclusions might they draw from that 
misunderstanding, and how serious would they be?  Doesn't sound like 
this is really anything serious even if you were right.


~mark

On 11/19/19 1:59 PM, Costello, Roger L. via Unicode wrote:


Hi Folks,

Today I received an email from the Unicode organization. The email 
said this: (italics and yellow highlighting are mine)


/The Unicode Standard is the foundation for all modern software and 
communications around the world, including all modern operating 
systems, browsers, laptops, and smart phones—plus the Internet and Web 
(URLs, HTML, XML, CSS, JSON, etc.)./


That is a remarkable statement! But is it entirely true? Isn’t it 
assuming that everything is text? What about binary information such 
as JPEG, GIF, MPEG, WAV; those are pretty core items to the Web, 
right? The Unicode Standard is silent about them, right? Isn’t the 
above quote a bit misleading?


/Roger





SHEQEL and L2/19-291

2019-07-24 Thread Mark E. Shoulson via Unicode

  
  
Just looking at document L2/19-291,
  https://www.unicode.org/L2/L2019/19291-missing-currency.pdf
  "Currency signs missing in Unicode" by Eduardo Marín Silva.  And
  I'm wondering why he feels it necessary for the Unicode standard
  to say that a more correct spelling for the Israeli currency would
  be "shekel" (and not "sheqel").  What criterion is being used that
  makes this "more correct"?  I think it's more popular and common,
  so maybe that's it.  But historically and linguistically, "sheqel"
  is more accurate.  The middle letter is ק, U+05E7 HEBREW LETTER
  QOF (which is not "more correctly" KOF), from the root ש־ק־ל
  Sh.Q.L meaning "weight".  It's true that Modern Hebrew does not
  distinguish K and Q phonetically in speech; maybe that is what is
  meant?  Still, the "historical" transliteration of QOF with Q is
  very widespread, and I believe occurs even on some coins/bills
  (could be wrong here; is this what is meant by "more correct"? 
  That "shekel" is what is used officially on the currency and I am
  misremembering?)


Just wondering about this, since it seems to be stressed in the
  document.


~mark

  



Re: Unicode "no-op" Character?

2019-07-03 Thread Mark E. Shoulson via Unicode
What you're asking for, then, is completely possible and achievable—but 
not in the Unicode Standard.  It's out of scope for Unicode, it sounds 
like.  You've said you realize it won't happen in Unicode, but it still 
can happen.  Go forth and implement it, then: make your higher-level 
protocol and show its usefulness and get the industry to use and honor 
it because of how handy it is, and best of luck with that.


~mark

On 7/3/19 2:22 PM, Ken Whistler via Unicode wrote:



On 7/3/2019 10:47 AM, Sławomir Osipiuk via Unicode wrote:


Is my idea impossible, useless, or contradictory? Not at all.


What you are proposing is in the realm of higher-level protocols.

You could develop such a protocol, and then write processes that 
honored it, or try to convince others to write processes to honor it. 
You could use PUA characters, or non-characters, or existing control 
codes -- the implications for use of any of those would be slightly 
different, in practice, but in any case would be an HLP.


But your idea is not a feasible part of the Unicode Standard. There 
are no "discardable" characters in Unicode -- *by definition*. The 
discussion of "ignorable" characters in the standard is nuanced and 
complicated, because there are some characters which are carefully 
designed to be transparent to some, well-specified processes, but not 
to others. But no characters in the standard are (or can be) ignorable 
by *all* processes, nor can a "discardable" character ever be defined 
as part of the standard.


The fact that there are a myriad of processes implemented (and 
distributed who knows where) that do 7-bit ASCII (or 8-bit 8859-1) 
conversion to/from UTF-16 by integral type conversion is a simple 
existence proof that U+000F is never, ever, ever, ever going to be 
defined to be "discardable" in the Unicode Standard.


--Ken






Re: Unicode "no-op" Character?

2019-07-03 Thread Mark E. Shoulson via Unicode
I think the idea being considered at the outset was not so complex as 
these (and indeed, the point of the character was to avoid making these 
kinds of decisions). There was a desire for some reason to be able to 
chop up a string into equal-length pieces or something, and some of 
those divisions might wind up between bases and diacritics or who knows 
where else.  Rather than have to work out acceptable places to place the 
characters, the request was for a no-op character that could safely be 
plopped *anywhere*, even in the middle of combinations like that.


~mark

On 6/23/19 4:24 AM, Richard Wordingham via Unicode wrote:

On Sat, 22 Jun 2019 23:56:50 +
Shawn Steele via Unicode  wrote:


+ the list.  For some reason the list's reply header is confusing.

From: Shawn Steele
Sent: Saturday, June 22, 2019 4:55 PM
To: Sławomir Osipiuk 
Subject: RE: Unicode "no-op" Character?

The original comment about putting it between the base character and
the combining diacritic seems peculiar.  I'm having a hard time
visualizing how that kind of markup could be interesting?

There are a number of possible interesting scenarios:

1) Chopping the string into user perceived characters.  For example,
the Khmer sequences of COENG plus letter are named sequences.  Akin to
this is identifying resting places for a simple cursor, e.g. allowing it
to be positioned between a base character and a spacing, unreordered
subscript.  (This last possibility overlaps with rendering.)

2) Chopping the string into collating elements.  (This can require
renormalisation, and may raise a rendering issue with HarfBuzz, where
renomalisation is required to get marks into a suitable order for
shaping.  I suspect no-op characters would disrupt this
renormalisation; CGJ may legitimately be used to affect rendering this
way, even though it is supposed to have no other effect* on rendering.)

3) Chopping the string into default grapheme clusters.  That
separates a coeng from the following character with which it
interacts.

*Is a Unicode-compliant *renderer* allowed to distinguish diaeresis
from the umlaut mark?

Richard.





Re: Unicode "no-op" Character?

2019-07-03 Thread Mark E. Shoulson via Unicode
Um... How could you be sure that process X would get the no-ops that 
process W wrote?  After all, it's *discardable*, like you said, and the 
database programs and libraries aren't in on the secret.  The database 
API functions might well strip it out, because it carries no meaning to 
them. Unless you can count on _certain_ programs not discarding it, and 
then you'd need either specialty libraries or some kind of registry or 
terminology for "this program does NOT strip no-ops" vs ones that do... 
But then they wouldn't be discardable, would they?  Not by 
non-discarding programs.  Which would have to have ways to pass them 
around between themselves.


Moreover, as you say, what about when Process Z (or its companions) 
comes along and is using THE SAME MECHANISM for something utterly 
different?  How does it know that process W wasn't writing no-ops for 
it, but was writing them for Process X?  And of course, Z will trash 
them and insert its own there, and when process X comes to read it, they 
won't be there. You'd need to make sure that NOBODY is allowed to touch 
the string between *pairs* of generators and consumers of no-ops, 
specifically designated for each other.


Yes, this is about consensual acts between responsible processes W and 
X, but that's exactly what the PUA is for: being assigned meaning 
between consenting processes. And they are not discardable by 
non-consenting processes, precisely because they mean something to 
someone.  If your no-ops carry meaning, they are going to need to be 
preserved and passed around and not thrown away.  If they carry no 
meaning, why are you dealing with them?  Yes, PUA characters are 
annoying and break up grapheme clusters and stuff.  But they're the only 
way to do what you're trying to do.


~mark

On 7/3/19 11:44 AM, Sławomir Osipiuk via Unicode wrote:


A process, let’s call it Process W, adds a bunch of U+000F to a string 
it received, or built, or a user entered via keyboard. Maybe it’s to 
packetize. Maybe to mark every word that is an anagram of the name of 
a famous 19^th -century painter, or that represents a pizza topping. 
Maybe something else. This is a versatile character. Process W is done 
adding U+000F to the string. It stores in it a database UTF-8 encoded 
field. Encoding isn’t a problem. The database is happy.


Now Process X runs. Process X is meant to work with Process W and it’s 
well-aware of how U+000F is used. It reads the string from the 
database. It sees U+000F and interprets it. It chops the string into 
packets, or does a websearch for each famous painter, or it orders 
pizza. The private meaning of U+000F is known to both Process X and 
Process W. There is useful information encoded in-band, within a 
limited private context.


But now we have Process Y. Process Y doesn’t care about packets or 
painters or pizza. Process Y runs outside of the private context that 
X and W had. Process Y translates strings into Morse code for 
transmission. As part of that, it replaces common words with 
abbreviations. Process Y doesn’t interpret U+000F. Why would it? It 
has no semantic value to Process Y.


Process Y reads the string from the database. Internally, it clears 
all instances of U+000F from the string. They’re just taking up space. 
They’re meaningless to Y. It compiles the Morse code sequence into an 
audio file.


But now we have Process Z. Process Z wants to take a string and mark 
every instance of five contiguous Latin consonants. It scrapes the 
database looking for text strings. It finds the string Process W 
created and marked. Z has no obligation to W. It’s not part of that 
private context. Process Z clears all instances of U+000F it finds, 
then inserts its own wherever it finds five-consonant clusters. It 
stores its results in a UTF-16LE text file. It’s allowed to do that.


Nothing impossible happened here. Let’s summarize:

Processes W and X established a private meaning for U+000F by 
agreement and interacted based on that meaning.


Process Y ignored U+000F completely because it assigned no meaning to it.

Process Z assigned a completely new meaning to U+000F. That’s 
permitted because U+000F is special and is guaranteed to have no 
semantics without private agreement and doesn’t need to be preserved.


There is no need to escape anything. Escaping is used when a character 
must have more than one meaning (i.e. it is overloaded, as when it is 
both text and markup). U+000F only gets one meaning in any context. In 
a new context, the meaning gets overridden, not overloaded. That’s 
what makes it special.


I don’t expect to see any of this in official Unicode. But I take 
exception to the idea that I’m suggesting something impossible.


*From:*Philippe Verdy [mailto:verd...@wanadoo.fr]
*Sent:* Wednesday, July 03, 2019 04:49
*To:* Sławomir Osipiuk
*Cc:* unicode Unicode Discussion
*Subject:* Re: Unicode "no-op" Character?

Your goal is **impossible** to reach with Unicode. Assume sich 
character is "adde

Watermarking with Apostrophes

2019-06-17 Thread Mark E. Shoulson via Unicode

  
  
An interesting application of Unicode confusables...


https://www.tomsguide.com/us/google-stealing-song-lyrics-genius,news-30370.html


There's all kinds of wacky steganography you could do with all
  the lookalike characters (See also
  https://metacpan.org/pod/Acme::Bleach), but of course it would
  only fool visual inspection, and be very difficult (or impossible)
  to decode visually, while a computer would never be taken in.


~mark

  



MIRROR emoji

2019-05-06 Thread Mark E. Shoulson via Unicode

  
  
I peek in on the various proposals on the document register from
  time to time, and it is only with some effort that I restrain
  myself from sending this list some sort of checklist with my
  crochety-old-man opinions on most of them as if anyone cared.  But
  I do have a thought to raise regarding the proposed MIRROR emoji.


It's just that it seems to me that a wall-mirror and a
  hand-mirror have rather different connotations and are so unalike
  in kind that it would be weird not to disunify them, or at the
  very least to stipulate clearly which one of them is being
  accepted.  Some of the meanings/connotations of MIRROR in the
  proposal seem to me to be mostly true of HAND MIRRORs, not so much
  ones on the wall or freestanding.  These include vanity and
  primping.


Just a thought.  It looks like the accepted sample glyph is a
  freestanding or wall mirror; I just think that it should be
  stipulated in the notes.  Not sure how well I could defend that
  feeling.


~mark

  



Re: Symbols of colors used in Portugal for transport

2019-04-29 Thread Mark E. Shoulson via Unicode

On 4/29/19 3:34 PM, Doug Ewell via Unicode wrote:

Hans Åberg wrote:
  

The guy who made the artwork for Heroes is completely color-blind,
seeing only in a grayscale, so they agreed he coded the colors in
black and white, and then that was replaced with colors.
  
Did he use this particular scheme? That is something I would expect to

see on the scheme's web site, and would probably be good evidence for a
proposal.


And what about existing schemes, such as have already been in use even 
by the esteemed company present on this very list, and in several fonts, 
for the same purpose?  See 
https://en.wikipedia.org/wiki/Hatching_(heraldry)



I do see several awards related to the concept, but few examples where
this scheme is actually in use, especially in plain text.
  
I'm not opposed to this type of symbol, but I like to think the classic

rule about "established, not ephemeral" would still apply.


Indeed.

If there were encoded mere color patches (like, say, colored circles, 
possibly in the U+1F534 range or something; just musing here), would 
those already count as encoding these sorts of things, as 
black-and-white font designers would be likely to interpret them in some 
readable fashion, perhaps with hatching. Is it better to have the color 
be canonical and the hatched design a matter of design, or have a set of 
hatched circles with fixed hatching?


~mark



Re: Emoji Haggadah

2019-04-17 Thread Mark E. Shoulson via Unicode

On 4/16/19 11:52 PM, James Kass via Unicode wrote:


> 
http://historyview.blogspot.com/2011/10/yukaghir-girl-writes-love-letter.html


According to a comment, the Yukaghir love letter as semasiographic 
communication was debunked by John DeFrancis in 1989 who asserted that 
it was merely a prop in a Yukaghir parlor game.  Perhaps that 
debunking was in the very book cited by Martin J. Dürst earlier in 
this thread.


The blog page comment went on to say that Geoffrey Sampson, who wrote 
the book from which the blogger learned of the Yukaghir love letter, 
published a retraction in 1994.


Thank you.  I read about it in Sampson's book, but had not heard about 
the debunking or the retraction.


Almost too bad; it seems to work so well.  The closest thing I know to 
something like that, expressing ideas but not language-dependent, would 
be mathematical notation.


~mark



Re: Emoji Haggadah

2019-04-16 Thread Mark E. Shoulson via Unicode

On 4/16/19 4:00 AM, James Kass via Unicode wrote:


On 2019-04-16 7:09 AM, Martin J. Dürst via Unicode wrote:

All the examples you cite, where images stand for sounds, are typically
used in some of the oldest "ideographic" scripts. Egyptian definitely
has such concepts, and Han (CJK) does so, too, with most ideographs
consisting of a semantic and a phonetic component.


Using emoji as rebus puzzles seems harmless enough but it defeats the 
goals of those emoji proponents who want to see emoji evolve into a 
universal form of communication because phonetic recognition of 
symbols would be language specific.  Users of ancient ideographic 
systems typically shared a common language where rebus or phonetic 
usage made sense to the users.  (Of course, diverse CJK user 
communities were able to adapt over time.)


All of the reviews of this publication on the page originally linked 
seemed positive, so it appears that people are having fun with emoji.  
But I suspect that this work would be jibber-jabber to any non-English 
speaker unfamiliar with the original Haggadah. No matter how otherwise 
fluent they might be in emoji communication.


You are certainly correct that you need to be an English-speaker to read 
it.  Knowing the original (and Hebrew) helps, and maybe sometimes is 
necessary too (How can Rabbi Akiva be translated as 🐇👠??  Well, 
"rabbit" for "Rabbi" [English-speaking knowledge], and "Akiva" comes 
from the root AYIN-QOF-BET, meaning "heel" [Hebrew knowledge]).  There 
is a section in the back that purports to explain the workings of some 
of this, but I actually haven't read it, and have been avoiding it.  
Just working it out on my own.  The back of the book also has the actual 
text in both Hebrew and English, and sometimes I'll look there to see 
what the English was that they were translating to get whatever it was 
they got to.


I think the notion that emoji could evolve into a "universal form of 
communication" is unrealistic.  Emoji are in many ways *definitionally* 
culture-specific, far from culturally neutral (at best they can try to 
be kinda inclusive, but that only goes so far.)  Crafting specific 
sentences to meet the demands of a language-speaking population needs 
more than the cute-looking symbols.  It also needs boring ones to 
express their relationships, or at least some cool way to join them 
together (see the famous "Yukaghir Love Letter"; one description here: 
historyview.blogspot.com/2011/10/yukaghir-girl-writes-love-letter.html) 
At any rate, emoji are not designed or selected with completeness for 
communication in mind.  For them to fill that role, there would have to 
be some work done on figuring out what's missing, etc.  (see also a 
whole slew of conlang projects from the zany to the scholarly (but 
mostly zany) attempting to distill all meaning down to a ridiculously 
small set of symbols for expressing anything.  What's coming to mind to 
me right now is aUI, which if I recall correctly had all of 
communication boiled down to 36 symbols—of which 10 were numerals).


It's still kinda fun to work out what the book is trying to say, though...


~mark



Re: Emoji Haggadah

2019-04-15 Thread Mark E. Shoulson via Unicode
Yes.  But the sentences aren't just symbolic representations of the 
concepts or something.  They are frequently direct 
transcriptions—usually by puns—for *English* sentences, so left-to-right 
makes sense.  So for example, the phrase "🕉️⌛️🕉️" translates "The LORD 
our God".  For whatever reason, the author decided to go with 🕉️ for 
"God" and such, and the hourglass in the middle is for "our", which 
sounds like "hour".  See?  Ugh.  I think he uses 🇺🇸 for "us" (U.S. = 
us). In the story of the five Rabbis discussing the laws in Bnei Brak, 
for one thing the word "Rabbi" is transcribed 🐇 ("rabbit" instead of 
"rabbi"), and it says they were in "👦👦🌩" (boy - boy - 
cloud-with-lightning).  The two boys for "sons" (which translates the 
word "Bnei" in the name of the city), and the lightning, "barak" in 
Hebrew, is for "brak", the second part of the name. The front cover, 
which you can see on the amazon page... That 🐚 (shell) in the title?  
Because it's saying "Haggadah shel Pesach", the Hebrew word "shel" 
meaning "of."  The author's name?  🍸🎀♥♢♣♠ (or whatever the exact 
ordering is): "Martin Bodek", that is martini-glass, bow, and the four 
suits of a DECK of cards.  Sorry; see what I mean about getting carried 
away by being able to read the silly thing?  Anyway.  The sentences are 
definitely ENGLISH sentences, not Hebrew or any sort of language-neutral 
semasiography or whatever, so LTR ordering makes sense (to the extent 
any of this makes sense.)


~mark

On 4/15/19 10:56 PM, Beth Myre via Unicode wrote:

This is amazing.

It's also really interesting that he decided to make the sentences 
read left-to-right.


On Mon, Apr 15, 2019 at 10:05 PM Tex via Unicode <mailto:unicode@unicode.org>> wrote:


Oy veh!

*From:*Unicode [mailto:unicode-boun...@unicode.org
<mailto:unicode-boun...@unicode.org>] *On Behalf Of *Mark E.
Shoulson via Unicode
*Sent:* Monday, April 15, 2019 5:27 PM
*To:* unicode@unicode.org <mailto:unicode@unicode.org>
*Subject:* Emoji Haggadah

The only thing more disturbing than the existence of The Emoji
Haggadah
(https://www.amazon.com/Emoji-Haggadah-Martin-Bodek/dp/1602803463/)
is the fact that I'm starting to find that I can read it...

~mark





Emoji Haggadah

2019-04-15 Thread Mark E. Shoulson via Unicode

  
  
The only thing more disturbing than the existence of The Emoji
  Haggadah
  (https://www.amazon.com/Emoji-Haggadah-Martin-Bodek/dp/1602803463/)
  is the fact that I'm starting to find that I can read it...


~mark

  



Re: Encoding colour (from Re: Encoding italic)

2019-02-13 Thread Mark E. Shoulson via Unicode

On 2/12/19 12:05 PM, Kent Karlsson via Unicode wrote:

Den 2019-02-12 03:20, skrev "Mark E. Shoulson via Unicode"
:


On 2/11/19 5:46 PM, Kent Karlsson via Unicode wrote:

Continuing too look deep into the crystal ball, doing some more
hand swirls...

...

...

The scheme quoted (far) below (from wjgo_10009), or anything like it,
will NEVER be part of Unicode!

Not in Unicode, but I have to say I'm intrigued by the idea of writing
HTML with tag characters (not even necessarily "restricted" HTML: the
whole deal).  This does NOT make it possible to write "italics in plain
text," since you aren't writing plain text.  But what you can do is
write rich text (HTML) that Just So Happens to look like plain text when
rendered with a plain-text-renderer (and maybe there could be
plain-text-renderers that straddle the line, maybe supporting some
limited subset of HTML and doing boldface and italics or something.

And so would ESC/command sequences as such, if properly skipped for display.
If some are interpreted, those would affect the display of other characters.
Just like "HTML in tag characters" would. A show invisibles mode would
display both ESC/command sequences as well as "HTML in tag characters"
characters.
Very true.  Maybe the explicitness of HTML appealed to me; escape 
sequences feel more like... you know, computer "codes" and all. (which 
of course is what all this is anyway!  So what's wrong with that?)

BUT, this would NOT be a Unicode feature/catastrophe at all.  This would
be purely the decision of the committee in charge of HTML/XML and
related standards, to decide to accept Unicode tag characters as if they
were ASCII for the purposes of writing XML tags/attributes &c.  It's

I have no say on HTML/CSS, but I would venture to predict that those
who do have a say, would not be keen on that idea. And XML tags in
general need not be in ASCII. And... identifiers in CSS need not
be in pure ASCII either... And attribute values, like filenames
including those that refer to CSS files (CSS is preferably stored
separately from the HTML/XML), certainly need not be pure ASCII.)

So, no, I'd say that that idea is completely dead.


You're probably right, and CSS is practically a different animal, and I 
guess at best one would have to settle for a stripped-down version of 
HTML (in which case, why bother?)  And again, all this is before we even 
consider other issues; I can't shake the feeling that there security 
nightmares lurking inside this idea.


~mark


Re: Encoding colour (from Re: Encoding italic)

2019-02-11 Thread Mark E. Shoulson via Unicode

On 2/11/19 5:46 PM, Kent Karlsson via Unicode wrote:

Continuing too look deep into the crystal ball, doing some more
hand swirls...

...

...

The scheme quoted (far) below (from wjgo_10009), or anything like it,
will NEVER be part of Unicode!


Not in Unicode, but I have to say I'm intrigued by the idea of writing 
HTML with tag characters (not even necessarily "restricted" HTML: the 
whole deal).  This does NOT make it possible to write "italics in plain 
text," since you aren't writing plain text.  But what you can do is 
write rich text (HTML) that Just So Happens to look like plain text when 
rendered with a plain-text-renderer  (and maybe there could be 
plain-text-renderers that straddle the line, maybe supporting some 
limited subset of HTML and doing boldface and italics or something.  
BUT, this would NOT be a Unicode feature/catastrophe at all.  This would 
be purely the decision of the committee in charge of HTML/XML and 
related standards, to decide to accept Unicode tag characters as if they 
were ASCII for the purposes of writing XML tags/attributes &c.  It's 
totally nothing to do with Unicode, unless the XML folks want Unicode to 
change some properties on the tag chars or something.  I think it's a... 
fascinating idea, and probably has *disastrous* consequences lurking 
that I haven't tried to think of yet, but it's not a Unicode idea.


~mark



Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Mark E. Shoulson via Unicode

On 1/30/19 8:58 AM, Egmont Koblinger via Unicode wrote:

There's another side to the entire BiDi story, though. Simple
utilities like "echo", "cat", "ls", "grep" and so on, line editing
experience of your shell, these kinds. It's absolutely not feasible to
add BiDi support to these utilities. Here the only viable approach is
to have the terminal emulator do it.


How will "ls -l" possibly work?  This is an example of the "table" 
layout you were already discussing.


I think us command-line troglodytes just have to deal with not having a 
whole lot of BiDi support.  There's simply no way any terminal emulator 
could possibly know what makes sense and what doesn't for a given line 
of text, coming from some random program.  Your "grep" could be grepping 
from a file with ANY layout, not necessarily one conducive to terminal 
layout, and so on.


~mark



Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread Mark E. Shoulson via Unicode

On 1/28/19 3:58 PM, Richard Wordingham via Unicode wrote:

Interestingly, bringing this word breaker into line with TUS in the UK
may well be in breach of the Equality Act 2010.

Richard.


OK, I've got to ask: how would that be?  How would this impinge on 
anyone's equality on the basis of "age, disability, gender reassignment, 
marriage and civil partnership, pregnancy and maternity, race, religion 
or belief, sex, and sexual orientation"? (quote from WP)



~mark



Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread Mark E. Shoulson via Unicode

On 1/28/19 2:31 AM, Mark Davis ☕️ via Unicode wrote:


But the question is how important those are in daily life. I'm not 
sure why the double-click selection behavior is so much more of a 
problem for Ancient Greek users than it is for the somewhat larger 
community of English users. Word selection is not normally as 
important an operation as line break, which does work as expected.


This is a good point.  Bottom line is that word-selection, at least, is 
not going to be _exactly_ right.  Oh, and for another example, note that 
Esperanto also regularly (in poetry, anyway) uses a word-final 
apostrophe (of some kind) to indicate elision of the final -o of a 
nominative singular noun, or the -a of the article "la".  What shall we 
say to Esperantists who can't correctly the third word in «al la mond’ 
eterne militanta / Ĝi promesas sanktan harmonion»?  I guess "Suck it up 
and deal with it."  And that may indeed be the answer.


~mark


Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread Mark E. Shoulson via Unicode

On 1/27/19 4:30 PM, Philippe Verdy via Unicode wrote:
For Volapük, it looks much more like U+02BE (right half ring modifier 
letter)

than like U+02BC (apostrophe "modifier" letter).
according to the PDF on 
https://archive.org/details/cu31924027111453/page/n12



No, I don't think it's 02BE (especially since it goes in the other 
direction.  You mean 02BF.  But I don't think it's that either).  Note 
the thickness at the top.  That isn't a half-ring.  It's pretty clearly 
an 02BD on that page, whereas on the page before, it's just as clearly 
an 02BB.  Or I guess another lesson to be learned is they weren't 
terribly picky.  Which I guess is good, because I don't want to have to 
fret about "gee, we need a boldface 02BB for capitalized Volapük..."  
There's a reason they dropped that letter.


~mark


Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Mark E. Shoulson via Unicode

On 1/27/19 11:08 AM, Michael Everson via Unicode wrote:

It is a letter. In “can’t” the apostrophe isn’t a letter. It’s a mark of 
elision.  I can double-click on the three words in this paragraph which have 
the apostrophe in them, and they are all whole-word selected.


That doesn't work when I try it: I double-click on the "a" in "can’t" 
and get only the "can" selected.


This does not necessarily prove anything; my software (Thunderbird) is 
arguably doing it wrong.


~mark


Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Mark E. Shoulson via Unicode
Well, sure; some languages work better with some fonts.  There's nothing 
wrong with saying that 02BC might look the same as 2019... but it's 
nice, when writing Hawaiian (or Klingon for that matter) to use a bigger 
glyph. That's why they pay typesetters the big bucks (you wish): to make 
things look good on the page.


I recall in early Volapük, ʼ was a letter (presumably 02BC), with value 
/h/.  And the "capital" ʼ was the same, except bolder: see 
https://archive.org/details/cu31924027111453/page/n11 (entry 4, on the 
left-hand page).


~mark

On 1/27/19 12:23 AM, Asmus Freytag via Unicode wrote:

On 1/26/2019 6:25 PM, Michael Everson via Unicode wrote:
the 02BC’s need to be bigger or the text can’t be read easily. In our 
work we found that a vertical height of 140% bigger than the quotation 
mark improved legibility hugely. Fine typography asks for some other 
alterations to the glyph, but those are cosmetic.

If the recommended glyph for 02BC were to be changed, it would in no case 
impact adversely on scientific linguistics texts. It would just make the mark a 
bit bigger. But for practical use in Polynesian languages where the character 
has to be found alongside the quotation marks, a glyph distinction must be made 
between this and punctuation.


It somehow seems to me that an evolution of the glyph shape of 02BC in 
a direction of increased distinction from U+2019 is something that 
Unicode has indeed made possible by a separate encoding. However, that 
evolution is a matter of ALL the language communities that use U+02BC 
as part of their orthography, and definitely NOT something were 
Unicode can be permitted to take a lead. Unicode does not *recommend* 
glyphs for letters.


However, as a publisher, you are of course free to experiment and to 
see whether your style becomes popular.


There is a concern though, that your choice may appeal only to some 
languages that use this code point and not become universally accepted.


A./






Re: Encoding italic

2019-01-23 Thread Mark E. Shoulson via Unicode
There is something deliciously simple, elegant... and kinda... 
rebellious? about doing this.  And it wouldn't even be in purview of 
Unicode.  "Yep, my HTML-renderer treats characters E0020..E007F just 
exactly the same 0020..007F, 'cept that it won't render 'em."  And you 
can send HTML text that looks for all the world like plain text to any 
normal Unicode-conformant viewer.  Now, the security issues of being 
able to write "invisible" JavaScript, or rather, Yet Another way you 
need to look at and reveal possible code, are a headache for someone 
else.  Viewed like this, you might do better taking this suggestion to 
W3C and having them amend the HTML/XML specs so that E0020..E007F are 
non-rendering synonyms for 0020..007F.  It wouldn't be a Unicode thing 
anymore, just changing the definition of HTML.  (I'm not saying it would 
be a GOOD idea, mind you.)


~mark

On 1/22/19 10:43 PM, James Kass via Unicode wrote:


Nobody has really addressed Andrew West's suggestion about using the 
tag characters.


It seems conformant, unobtrusive, requiring no official sanction, and 
could be supported by third-partiers in the absence of corporate 
interest if deemed desirable.


One argument against it might be:  Whoa, that's just HTML.  Why not 
just use HTML?  SMH


One argument for it might be:  Whoa, that's just HTML!  Most everybody 
already knows about HTML, so a simple subset of HTML would be 
recognizable.


After revisiting the concept, it does seem elegant and workable. It 
would provide support for elements of writing in plain-text for anyone 
desiring it, enabling essential (or frivolous) preservation of 
editorial/authorial intentions in plain-text.


Am I missing something?  (Please be kind if replying.)

On 2019-01-20 10:35 AM, Andrew West wrote:


A possibility that I don't think has been mentioned so far would be to
use the existing tag characters (E0020..E007F). These are no longer
deprecated, and as they are used in emoji flag tag sequences, software
already needs to support them, and they should just be ignored by
software that does not support them. The advantages are that no new
characters need to be encoded, and they are flexible so that tag
sequences for start/end of italic, bold, fraktur, double-struck,
script, sans-serif styles could be defined. For example start and end
of italic styling could be defined as the tag sequences  and 
(E003C E0069 E003E and E003C E002F E0069 E003E).

Andrew





Re: Encoding italic (was: A last missing link)

2019-01-23 Thread Mark E. Shoulson via Unicode

On 1/22/19 6:26 PM, Kent Karlsson via Unicode wrote:

Ok. One thing to note is that escape sequences (including control sequences,
for those who care to distinguish those) probably should be "default
ignorable" for display. Requiring, or even recommending, them to be default
ignorable for other processing (like sorting, searching, and other things)
may be a tall order. So, for display, (maximal) substrings that match:

\u001B[\u0020-\002F]*[\u0030-\007E]|
(\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E]

should be default ignorable (i.e. invisible, but a "show invisibles" mode
would show them; not interpreted ones should be kept, even if interpreted
ones need not, just (re)generated on save). That is as far as Unicode
should go.


So it isn't just "these characters should be default ignorable", but 
"this regular expression is default ignorable."  This gets back to 
"things that span more than a character" again, only this time the 
"span" isn't the text being styled, it's the annotation to style it.  
The "bash" shell has special escape-sequences (\[ and \]) to use in 
defining its prompt that tell the system that the text enclosed by them 
is not rendered and should not be counted when it comes to doing 
cursor-control and line-editing stuff (so you put them around, yep, the 
escape sequences for coloring or boldfacing or whatever that you want in 
your prompt). That would seem to be at least simpler than a big ol' 
regexp, but really not that much of an improvement.  It also goes to 
show how things like this require all kinds of special handling, 
even/especially in a "simple" shell prompt (which could make a strong 
case for being "plain text", though, yes, terminal escape codes are a 
thing.)


~mark


Re: Encoding italic

2019-01-23 Thread Mark E. Shoulson via Unicode

On 1/19/19 3:34 PM, James Kass via Unicode wrote:


On 2019-01-19 6:19 PM, wjgo_10...@btinternet.com wrote:

> It seems to me that it would be useful to have some codes that are
> ordinary characters in some contexts yet are control codes in 
others, ...


Italics aren't a novel concept.  The approach for encoding new 
characters is that  conventions for them exist and that people *are* 
exchanging them, people have exchanged them in the past, or that 
people demonstrably *need* to exchange them.


Excluding emoji, any suggestion or proposal whose premise is "It seems 
to me that  it would be useful if characters supporting that>..." is doomed to be deemed out of scope for the standard.


This was the quote I had been looking for, sorry James and Asmus.  It 
isn't the first time it's been pointed out here.


~mark



Re: Encoding italic (was: A last missing link)

2019-01-23 Thread Mark E. Shoulson via Unicode

On 1/19/19 1:19 PM, wjgo_10...@btinternet.com via Unicode wrote:


Well, a variation sequence character is being used for requesting 
emoji display (is that a control code?), so it seems there is no lack 
of precedent to use one for italics. It seems that someone only has to 
say 'out of scope' and then that is the veto for any consideration of 
a new idea for ISO/IEC 10646 or The Unicode Standard. There seems to 
be no way for a request to the committee to consider a widening of the 
scope to even be put before the committee if such a request is from 
someone outside the inner circle.


You make it sound like there's been invented some magical incantation 
that *anyone* can use to quash all discussion on a particular (your) 
topic.  It doesn't just take someone saying "out of scope."  It also has 
to *be* out of scope!  If someone chants the incantation, but I can 
persuasively argue that no, it IS in scope, then the spell fails.  
Requesting the scope of Unicode be widened is not like other discussions 
being had here, so it makes sense that it should be treated differently, 
if treated at all. There were discussions and agreements made as to the 
scope of Unicode, long ago.  And just like you can't petition to change 
a character name, no matter how wrong it is, asking the Unicode 
consortium to redefine itself on your say-so is not going to be taken 
seriously either.  Out of scope means just that: it isn't something 
we're discussing.  Discussing how to change the scope so that 
whatever-it-is IS in scope is a very large undertaking, and would need a 
tremendous groundswell of support from all the major stakeholders in 
Unicode, so you should probably start there.  Get Microsoft and Google 
and various national bodies on your side, not just to say "um, ok, 
maybe," but to actively argue with you that the scope needs to be 
changed.  Or that there needs to be, as Asmus says, another, 
supplemental standard.  Raise popular support, write petitions, get 
signatures, all that fun stuff. "But so many of the people I would want 
to talk to about this are right here on this list!" you say?  Be that as 
it may, it doesn't mean the list has to grant you a platform.  Change 
the world on your own dime.




It seems to me that it would be useful to have some codes that 


See, once you start a proposal like that, you're already looking down 
the wrong end of the Unicode scope.  This is exactly what Asmus (I 
think) said in a quote I can't seem to find, repeating it for the n+1st 
time: Unicode isn't here to encode cool new ideas that would be cool and 
new.  It's here for writing what people already do.  You want a standard 
that does something else?  That's another thing.  It's as appropriate to 
demand that Unicode support these things as it would be to go to OSHA or 
the Bureau of Weights and Measures or the Académie Française and tell 
them you want some new letters...


~mark



Re: Encoding italic (was: A last missing link)

2019-01-20 Thread Mark E. Shoulson via Unicode

On 1/19/19 10:14 PM, James Kass via Unicode wrote:


(In the event that a persuasive proposal presentation prompts the 
possibility of italics encoding...)

Possible approaches include:

1 - Liberating the italics from the Members Only Math Club
...which has been an ongoing practice since they were encoded.  It 
already works, but the set is incomplete and the (mal)practice is 
frowned upon.  Many of the older "shortcomings" of the set can now be 
overcome with combining diacritics.  These italics decompose to ASCII.
Provides italics the same way that ASCII provides letters.  You can use 
them with any alphabet you want, as long as it's Latin.  (Or Greek, 
true).  Essentially requires doubling of huge chunks of the Unicode 
repetoire.

2 - Character level
Variation selectors work with today's tech.  Default ignorable 
property suggests that apps that don't want to deal with them won't.  
Many see VS as pseudo-encoding.  Stripping VS leaves ASCII behind.
This, or something like this, is IMO the only possibility that has any 
chance at all.


As "food for thought" questions, if a persuasive case is presented for 
encoding italics, and excluding 4, which approach would have the least 
impact on the rich-text world?  Which would have the least impact on 
existing plain-text technology?  Which would be least likely to 
conflict with Unicode principles/encoding model?


#2.

~mark



Re: Encoding italic

2019-01-18 Thread Mark E. Shoulson via Unicode

On 1/17/19 1:27 AM, Martin J. Dürst via Unicode wrote:


This lead to the layering we have now: Case distinctions at the
character level, but style distinctions at the rich text level. Any good
technology has layers, and it makes a lot of sense to keep established
layers unless some serious problem is discovered. The fact that Twitter
(currently) doesn't allow styled text and that there is a small number
of people who (mis)use Math alphabets for writing italics,... on Twitter
doesn't look like a serious problem to me.
How small a number?  How big?  I don't know either.  To mention Second 
Life again, which is pretty strongly defensible as a plain-text 
environment (with some exceptions, as for hyperlinks), I note that the 
viewers for it (and the servers?) don't seem to support Unicode 
characters outside of the BMP.  Which leads the flip-side of the "gappy" 
mathematical alphabets: you can say SOME things in italic or fraktur or 
double-struck... but only if they have the correct few letters that 
happen to be in the BMP already. Obviously, this can and should be 
blamed on incomplete Unicode support by the software vendors, but it 
still matters in the same way that "incomplete" markup support (i.e. 
none) matters to Twitter users: people make do with what they have, and 
will (mis)use even the few characters they can, though that leads to odd 
situations (see earlier list of display names.)


~mark



Re: Encoding italic (was: A last missing link)

2019-01-18 Thread Mark E. Shoulson via Unicode

On 1/16/19 7:16 AM, Andrew Cunningham via Unicode wrote:
HI Victor, an off list reply. The contents are just random thoughts 
sparked by an interesting conversation.


On Wed, 16 Jan 2019 at 22:44, Victor Gaultney via Unicode 
mailto:unicode@unicode.org>> wrote:



- It finally, and conclusively, would end the decades of the mess
in HTML that surrounds  and .


I am not sure that would fix the issue, more likely compound the issue 
making it even more blurry what the semantic purpose is. HTML5 make 
both  and  semantic ... and by the definition the style of the 
elements is not necessarily italic.  for instance would be script 
dependant,  may be partially script dependant when another 
appropriate semantic tag is missing. A character/encoding level 
distinction is just going to compound the mess.


A good point, too.  While italics are being used sort of as an example, 
what the "evidence" really is for (and by evidence I mean what I alluded 
to at the end of my last post, over centuries of writing) is that people 
like to *emphasize* things from time to time.  It's really more the 
semantic side of "this text should be read louder."  So not so much 
"italic marker" but "emphasis marker."


But... that ignores some other points made here, about specific meanings 
attached to italics (or underlining, in some settings), like 
distinguishing book or movie titles (or vessel names) from common or 
proper nouns.  Is it better to lump those with emphasis as "italic", or 
better to distinguish them semantically, as "emphasis marker" vs "title 
marker"?  And if we did the latter, would ordinary folks know or care to 
make that distinction?  I tend to doubt it.



My main point in suggesting that Unicode needs these characters is
that italic has been used to indicate specific meaning - this text
is somehow special - for over 400 years, and that content should
be preserved in plain text.


Underlying, bold text, interletter spacing, colour change, font style 
change all are used to apply meaning in various ways. Not sure why 
italic is special in this sense. Additionally without encoding the 
meaning of italic, all you know is that it is italic, not what 
convention of semantic meaning lies behind it.


Um... yeah.  That's what I meant, also.



And I am curious on your thoughts, if we distinguish italic in 
Unicode, encode some way of spacifying italic text, wouldn't it make 
more sense to do away with italic fonts all together? and just roll 
the italic glyphs into the regular font?


Eh.  Fonts are not really relevant to this.  Unicode already has more 
characters than you can put into a single font.  It's just as sensible, 
still, to have italic fonts and switch to them, just like you have to 
switch to your Thai font when you hit Thai text that your default font 
doesn't support.  (However, this knocks out the simplicity of using 
OpenType to handle it, as has been suggested.)


~mark


Re: Encoding italic (was: A last missing link)

2019-01-18 Thread Mark E. Shoulson via Unicode

On 1/16/19 6:23 AM, Victor Gaultney via Unicode wrote:


Encoding 'begin italic' and 'end italic' would introduce difficulties 
when partial strings are moved, etc. But that's no different than with 
current punctuation. If you select the second half of a string that 
includes an end quote character you end up with a mismatched pair, 
with the same problems of interpretation as selecting the second half 
of a string including an 'end italic' character. Apps have to deal 
with it, and do, as in code editors.


It kinda IS different.  If you paste in half a string, you get a 
mismatched or unmatched paren or quote or something.  A typo, but a 
transient one.  It looks bad where it is, but everything else is 
unaffected.  It's no worse than hitting an extra key by mistake. If you 
paste in a "begin italic" and miss the "end italic", though, then *all* 
your text from that point on is affected!  (Or maybe "all until a 
newline" or some other stopgap ending, but that's just damage-control, 
not damage-prevention.)  Suddenly, letters and symbols five 
words/lines/paragraphs/pages look different, the pagination is all 
altered (by far more than merely a single extra punctuation mark, since 
italic fonts generally are narrower than roman).  It's a disaster.


No.  This kind of statefulness really is beyond what Unicode is designed 
to cope with.  Bidi controls are (almost?) the sole exception, and even 
they cause their share of headaches.  Encoding separate _text_ 
italics/bold is IMO also a disastrous idea, but I'm not putting out 
reasons for that now.  The only really feasible suggestion I've heard is 
using a VS in some fashion. (Maybe let it affect whole words instead of 
individual characters?  Makes for fewer noisy VSs, but introduces a 
whole other host of limitations (how to italicize part of a word, how to 
italicize non-letters...) and is also just damage-control, though stronger.)


Apps (and font makers) can also choose how to deal with presenting 
strings of text that are marked as italic. They can choose to present 
visual symbols to indicate begin/end, such as /this/. Or they can 
present it using the italic variant of the font, if available.


At which point, you have invented markdown.  Instead of making Unicode 
declare it, just push for vendors everywhere to recognize /such 
notation/ as italics (OK, I know, you want dedicated characters for it 
which can't be confused for anything else.)



- Those who develop plain text apps (social media in particular) don't 
have to build in a whole markup/markdown layer into their apps


With the complexity of writing an social media app, a markup layer is 
really the least of the concerns when it comes to simplifying.


- Misuse of math chars for pseudo-italic would likely disappear

- The text runs between markers remain intact, so they need no special 
treatment in searching, selecting, etc.


- It finally, and conclusively, would end the decades of the mess in 
HTML that surrounds  and .


Adding _another_ solution to something will *never* "conclusively end" 
anything.  On a good day, you can hope it will swamp the others, but 
they'll remain at least in legacy.  More likely, it will just add one 
more way to be confused and another side to the mess.  (People have 
pointed out here about the difficulties of distinguishing or 
not-distinguishing between HTML-level  and putative plain-text 
italics.  And yes, that is an issue, and one that already exists with 
styling that can change case and such.  As with anything, the question 
is not whether there are going to be problems, but how those problems 
weigh against potential benefits.  That's an open question.)


My main point in suggesting that Unicode needs these characters is 
that italic has been used to indicate specific meaning - this text is 
somehow special - for over 400 years, and that content should be 
preserved in plain text.


There is something to this: people have been *emphasizing* text in some 
fashion or another for ages.  There is room to call this plain text.


~mark



Re: wws dot org

2019-01-18 Thread Mark E. Shoulson via Unicode

On 1/17/19 1:50 PM, Frédéric Grosshans via Unicode wrote:


On a side note, you the site considers visible speech as a 
living-script, which surprised be. This information is indeed in the 
Wikipedia infobox and implied by its “HMA status” on the Berkeley SEI 
page, but the text of the wikipedia page says “However, although 
heavily promoted [...] in 1880, after a period of a dozen years or so 
in which it was applied to the education of the deaf, Visible Speech 
was found to be more cumbersome [...] compared to other methods,and 
eventually faded from use.”


My (cursory) research failed to show a more recent date than this for 
the system than this “dosen of year or so [past 1880]” . Is there any 
indication of the system to be used later? (say, any date in the 20th 
century)


I just got email a few days ago from someone who wants to use it on an 
album cover...


But on the whole I think you are correct; I have not seen much use or 
even study of it (outside of my own and a very few others) in recent 
times.  And I *still* have to submit a proposal for it to be included in 
Unicode.


~mark



Re: A last missing link for interoperable representation

2019-01-14 Thread Mark E. Shoulson via Unicode

(sorry for multiple responses...)

On 1/13/19 10:00 PM, Martin J. Dürst via Unicode wrote:

On 2019/01/14 01:46, Julian Bradfield via Unicode wrote:

On 2019-01-12, Richard Wordingham via Unicode  wrote:

On Sat, 12 Jan 2019 10:57:26 + (GMT)
And what happens when you capitalise a word for emphasis or to begin a
sentence?  Is it no longer the same word?

Indeed. As has been observed up-thread, the casing idea is a dumb one!
We are, however, stuck with it because of legacy encoding transported
into Unicode. We aren't stuck with encoding fonts into Unicode.

No, the casing idea isn't actually a dumb one. As Asmus has shown, one
of the best ways to understand what Unicode does with respect to text
variants is that style works on spans of characters (words,...), and is
rich text, but thinks that work on single characters are handled in
plain text. Upper-case is definitely for most part a single-character
phenomenon (the recent Georgian MTAVRULI additions being the exception).
Not just an exception, but an exception that proves the rule.  It's 
precisely because plain-text distinctions, generally speaking, should be 
at the letter level as Asmus says that there was so much shouting about 
MTAVRULI.  That these are exceptional demonstrates the existence of the 
rule.

But even most adults won't know the rules for what to italicize that
have been brought up in this thread. Even if they have read books that
use italic and bold in ways that have been brought up in this thread,
most readers won't be able to tell you what the rules are. That's left
to copy editors and similar specialist jobs.
I don't think there's really a case to be made that italics are or 
should work the same as capitals, or that they are justified for the 
same reasons that capitals are justified.  And the use-cases show how 
people are using them: not necessarily for Chicago Manual of Style 
mandated purposes, but for emphasis of varying kinds.

There was a time when computers (and printers in particular) were
single-case. There was some discussion about having to abolish case
distinctions to adapt to computers, but fortunately, that wasn't necessary.
Abolishing case I could see as a hassle, and we have become somewhat 
dependent on it for other things.  But it was a bad idea to start with.



~mark



Re: A last missing link for interoperable representation

2019-01-14 Thread Mark E. Shoulson via Unicode

On 1/13/19 10:00 PM, Martin J. Dürst via Unicode wrote:

On 2019/01/14 01:46, Julian Bradfield via Unicode wrote:

On 2019-01-12, Richard Wordingham via Unicode  wrote:

On Sat, 12 Jan 2019 10:57:26 + (GMT)
And what happens when you capitalise a word for emphasis or to begin a
sentence?  Is it no longer the same word?

Indeed. As has been observed up-thread, the casing idea is a dumb one!
We are, however, stuck with it because of legacy encoding transported
into Unicode. We aren't stuck with encoding fonts into Unicode.

No, the casing idea isn't actually a dumb one.


Well, for me, when I say or said that the "casing idea" is a dumb one, I 
don't mean how Unicode handled it.  Unicode is quite correct in encoding 
capitals distinctly from lowercase, both for computer-historical reasons 
and others you mention.  I think the idea of having case in alphabets 
_in the first place_ was a bad move.  It's a "mistake" that happened 
centuries ago.


~mark


Re: A last missing link for interoperable representation

2019-01-14 Thread Mark E. Shoulson via Unicode
In some of this discussion, I'm not sure what is being proposed or 
forbidden here... I don't know that anyone is advocating removing the 
"don't use these for words!" warning sticker on the mathematical 
italics.  The closest-to-sensible suggestions I've heard are things like 
a VS to italicize a letter, a combining italicizer so to speak (this is 
actually very similar to the emoji-style vs text-style VS sequences).  
*If* the VS is ignored by searches, as apparently it should be and some 
have reported that it is, then VS-type solutions would NOT be a problem 
when it comes to searches (and don't go whining about legacy software.  
If Unicode had to be backward-compatible with everything we wouldn't 
have gone beyond ASCII).  So I'm not sure what you mean when you speak 
of "Unicode italics".  Do you mean using the mathematical italics as 
we've been seeing?  Or having a whole new plane of italic characters for 
everything that could conceivably be italicized?  Those would probably 
both be mistakes, I agree.


~mark

On 1/14/19 5:58 PM, David Starner via Unicode wrote:

On Mon, Jan 14, 2019 at 2:09 AM Tex via Unicode  wrote:

The arguments against italics seem to be:

·Unicode is plain text. Italics is rich text.

·We haven't had it until now, so we don't need it.

·There are many rich text solutions, such as html.

·There are ways to indicate or simulate italics in plain text including 
using underscore or other characters, using characters that look italic (eg 
math), etc.

·Adding Italicization might break existing software

·The examples of  existing Unicode characters that seem to represent 
rich text (emoji, interlinear annotation, et al) have justifications.

There generally shouldn't be multiple ways of doing things. For
example, if you think that searching for certain text in italics is
important, then having both HTML italics and Unicode italics are going
to cause searches to fail or succeed unexpectedly, unless the
underlying software unifies the two systems (an extra complexity).
Searching for certain italicized text could be done today in rich text
applications, were there actual demand for it.


·Plain text still has tremendous utility and rich text is not always an 
option.

Where? Twitter has the option of doing rich text, as does any closed
system. In fact, Twitter is rich text, in that it hyperlinks web
addresses. That Twitter has chosen not to support italics is a choice.
If users don't like this, they could go another system, or use
third-party tools to transmit rich text over Twitter. The use of
underscores or   markings for italics would be mostly
compatible with human twitterers using the normal interface.

Source code is an example of plain text, and yet adding italics into
comments would require but a trivial change to editors. If the user
audience cared, it would have been done. In fact, I suspect there
exist editors and environments where an HTML subset is put into
comments and rendered by the editors; certainly active links would be
more useful in source code comments than italics.

Lastly, the places where I still find massive use of plain text are
the places this would hurt the most. GNU Grep's manpage shows no sign
that it supports searching under any form of Unicode normalization.
Same with GNU Less. Adding italics would just make searching plain
text documents more complex for their users. The domain name system
would just add them to the ban list, and they'd be used for spoofing
in filenames and other less controlled but still sensitive
environments.





Re: A last missing link for interoperable representation

2019-01-14 Thread Mark E. Shoulson via Unicode

On 1/14/19 4:21 PM, Asmus Freytag via Unicode wrote:

On 1/14/2019 2:08 AM, Tex via Unicode wrote:


Perhaps the question should be put to twitter, messaging apps, 
text-to-voice vendors, and others whether it will be useful or not.


If the discussion continues I would like to see more of a 
cost/benefit analysis. Where is the harm? What will the benefit to 
user communities be?


The "it does no harm" is never an argument "for" making a change. It's 
something of a necessary, but not a sufficient condition, in other words.


More to the point, if there were platforms (like social media) that 
felt an urgent need to support styling without a markup language, and 
could articulate that need in terms of a proposal, then we would have 
something to discuss. (We might engage them in a discussion of the 
advisability of supporting "markdown", for example).


Short of that, I'm extremely leery of "leading" standardization; that 
is, encoding things that "might" be used.


It is certainly true that Unicode should not be (and wasn't, before 
emoji) in the business of encoding things that "could be used", but 
rather, was for encoding things that *were* used.  This, naturally, 
poses a chicken-and-egg problem which has been complained about by 
several people in the past (including me).  Still, there are ways to 
show that things that haven't been encoded are still being "used", as 
people make shift to do what they can to use the script/notation, like 
using PUA or characters that aren't QUITE right, but close...  And in 
fairness, I'd have to say that the use of mathematical italics would 
count in that regard.  It's hard to dispute that there is a demand for 
it, just by looking at how people have been trying to do it!  So I'm 
starting to think this is not really "leading" standardization, but 
rather following up and, well, standardizing it, replacing ad-hoc 
attempts with a standard way to do things, just as Unicode is supposed 
to do.


~mark


As for the abuse of math alphabetics. That's happening whether we like 
it or not, but at this point represents playful experimentation by the 
exuberant fringe of Unicode users and certainly doesn't need any 
additional extensions.






Re: A last missing link for interoperable representation

2019-01-14 Thread Mark E. Shoulson via Unicode

On 1/14/19 5:08 AM, Tex via Unicode wrote:


This thread has gone on for a bit and I question if there is any more 
light that can be shed.


BTW, I admit to liking Asmus definition for functions that span text 
being a definition or criteria for rich text.



Me too.  There are probably some exceptions or weird corner-cases, but 
it seems to be a really good encapsulation of the distinction which I 
had never seen before.


~mark



Re: A last missing link for interoperable representation

2019-01-14 Thread Mark E. Shoulson via Unicode

On 1/14/19 4:45 AM, Martin J. Dürst via Unicode wrote:

Hello James, others,

  From the examples below, it looks like a feature request for Twitter
(and/or Facebook). Blaming the problem on Unicode doesn't seem to be
appropriate.


I think what people here are doing is not blaming the problem on 
Unicode, but rather blaming the _solution_ on Unicode, for better or worse.


~mark



Re: A last missing link for interoperable representation

2019-01-12 Thread Mark E. Shoulson via Unicode
Just to add some more fuel for this fire, I note also the highly popular 
(in some places) technique of using Unicode letters that may have 
nothing whatsoever to do with the symbol or letter you mean to 
represent, apart from coincidental resemblance and looking "cool" 
enough.  This happens a lot on Second Life, where you can set your 
"display name" distinct from your "user name", but the display name 
appears to be limited to Unicode *letters* and some punctuation, mostly, 
and certainly can't be outside the BMP.  So for a sampling from stuff 
I've heard of...


ΑbiΑИØ SŦээlSØul
ΛPΉӨD
ΛИƓĿƐĪƇ  Ɗє ℓα ℜudǝ ωђitmαη
ΛЯℂӨƧ BΛПDΣЯΛƧ
ღLɪɴᴅᴀღ
ђÅℵℵƔ Fashionablez ℬãŋќş Ķhaгg
єσηα MιяєƖуηη
ℒυςノσυʂ ツ .
乙u 乙u
尺αмση ℓυιѕ αуαℓα
mღn
ᄊムレo
Ɩ'M ŦЯØЦßĿЄ ƧЄƝƖȤЄƝ ƓƠƬƊƛMMƖƬ™
øקςøги вαℓℓѕ ßⱥţţïţuđє
Ąşђεгöη ĄĶЯĨ Ğrєץ
Đ尺ѦႺΘȠ
đ σ  ℜ ι ค ℵ :.
ĦΔZΔRĐ
ʕ·ᴥ·ʔ
ϮJΩƧӇƲΔϮ
ϯcH ℭℛℯȡĩȵŧă
ⓁợⒼαℵ
亗 Amy 亗
ßяуⒶℵ GяєуωσLƒ
тαקקαt Wuηđǝяレǝ
کhäşhι ℰղcαηϯäɖσƦ
ۣღۣۜ Jarah Sparksۣღۣۜ
ઇઉ fleur ઇઉ
໓яαкє ςнυяςн
ڰۣღ- Pandora Barbarosڰۣღ-
ஐ tenayah ஐ-x-
ღⒹムяк 丂σuℓ™ღ
ץlđє Ͼђץlɠє
Լסяє ℳססɗү
עΨ Gatatem ђαвίв Ψיע

I could do more searching... Some of these things are even more common 
than shown here.  Using ღ for a heart ♡ is extremely widespread, and 
decorations like 亗 and Ϯ abound.  Note some decorations involving ღ with 
some Arabic(!) combining characters. Note the use of Hebrew and Arabic 
and CJK and other characters to represent Latin letters to which they 
bear only a passing resemblance.  There are also a lot of names in all 
small-caps or all full-width (I didn't include any examples of just that 
because they seemed so ordinary), or "inverted"  ·uoı̣ɥsɐɟ ꞁɐnsn əɥʇ uı̣


I don't know what, precisely, this argues for or against.  Would people 
deny that this is an "abuse" of the character-set, even though people 
are doing it and it works for them?  The medium is pretty indisputably 
plain-text.  Should all this kind of thing be somehow made to "work" for 
these creative, if mystifying, people? These are clearly pretty far-out 
examples (though not extreme, compared to what's out there, nor 
uncommon, from what I have been told.)


This discussion has been very interesting, really.  I've heard what I 
thought were very good points and relevant arguments from both/all 
sides, and I confess to not being sure which I actually prefer.  Just 
giving you more to think about...


~mark



Re: A last missing link for interoperable representation

2019-01-10 Thread Mark E. Shoulson via Unicode

On 1/10/19 6:43 PM, James Kass via Unicode wrote:


The first step would be to persuade the "powers that be" that italics 
are needed.  That seems presently unlikely.  There's an entrenched 
mindset which seems to derive from the fact that pre-existing 
character sets were based on mechanical typewriting technology and 
were further limited by the maximum number of glyphs in primitive 
computer fonts.


The second step would be to persuade Unicode to encode a new character 
rather than simply using an existing variation selector character to 
do the job.


A perhaps more affirmative step, not necessarily first but maybe, would 
be to write up a proposal and submit it through channels so the "powers 
that be" can respond officially.


~mark



Re: A last missing link for interoperable representation

2019-01-09 Thread Mark E. Shoulson via Unicode

On 1/9/19 4:25 AM, David Starner via Unicode wrote:



Honestly, I could argue that case should not be encoded. It would 
simplify so much processing of Latin script text, and most of the time 
case-sensitive operations are just wrong. Case is clearly a headache 
that has to be dealt with in plain text, but it certainly doesn't 
encourage me to add another set of characters that are basically the 
same but not.


I completely agree.  Casing of letters (in general, I mean) was a 
horrible mistake and is way more trouble than it’s worth.  Too late to 
fix it, and given how entrenched it is it did kind of have to be 
encoded, but it’s such a bad idea.  And then other alphabets see it and 
think “hey, we need capitals too!” and you get capitals for all the IPA 
extensions and Cherokee and so on... Ugh.


~mark



Re: A last missing link for interoperable representation

2019-01-09 Thread Mark E. Shoulson via Unicode

On 1/9/19 12:33 AM, David Starner via Unicode wrote:



Is there any way to preserve The Art of Computer Programming except as 
a PDF or its TeX sources? Grabbing a different book near me, I don't 
see any way to preserve them except as full-color paged reproductions. 
Looking at one data format, it uses bold, italics, and inversion 
(white on black), in sans-serif, serif and script fonts; certainly in 
lines like "Treasure standard (+1 starknife)", offering 
"Treasure standard (+1 starknife)" is completely insufficient.


Can some books be mostly handled with Unicode plain text and italics? 
Sure. HTML can handle them quite nicely. I'd say even them will have 
headers that are typographically distinguished and should optimally be 
marked in a transcription.


The line I used to say about this is “there’s no such thing as plain 
text on paper.”  The concept of “plain text” vs markup or styling is 
purely in the digital domain.  On physical artifacts, it’s just ink on 
wood-pulp, and the only “real” description of the page is a graphic image.


~mark



Re: A last missing link for interoperable representation

2019-01-09 Thread Mark E. Shoulson via Unicode

On 1/9/19 2:30 AM, Asmus Freytag via Unicode wrote:


English use of italics on isolated words to disambiguate the reading 
of some sentences is a convention. Everybody who does it, does it the 
same way. Not supported in plain text.


German books from the Fraktur age used Antiqua for Latin and other 
foreign terms. Definitely a convention that was rather universally 
applied (in books at least). Not supported in plain text.


Aren't there printing conventions that indicate this type of 
"contrastive stress" using letterspacing instead of font style?  I'm 
s u r e I've seen it in German and other Latin-written languages, and 
also even occasionally in Hebrew, whose experiments with italics tend 
not to be encouraging.


~mark



Re: Aleph-umlaut

2018-11-12 Thread Mark E. Shoulson via Unicode
You know, you're right (as is Beth), and I don't know why I'm arguing 
the point.  It's something I've been working on: I shouldn't defend a 
position JUST because it's _my_ position, and yet that's just what I did.


So, yes, it certainly does seem essentially German.  I couldn't say why 
they chose to write this part in German, or why they chose to transcribe 
it in Hebrew letters, really.  I assumed Yiddish probably because of the 
context and the alphabet used, but there's no reason for it not to be 
German.  Now, the pamphlet originated from Kloizenberg, i.e. 
https://en.wikipedia.org/wiki/Cluj-Napoca which is in Romania, but 
German was probably enough of a lingua franca (after all, Yiddish 
developed from it for that reason).  And the text being basically German 
would explain the aleph-umlaut which was the start of all this, though 
it doesn't so much need an "explanation": it's interesting enough that 
it's _there_.  Also interesting that no other umlauted letters were 
considered distinct enough to be transcribed so (or else they just 
happened not to show up).  There are probably mildly interesting things 
(depending on your interests) to be gleaned from studying how the 
transliterations, how they seemed to use ע for word-final "e" in "die" 
in some places but א in others, etc.


Anyway, still interesting, I thought.

~mark

On 11/11/18 8:28 PM, Asmus Freytag via Unicode wrote:


I agree with Beth that the text reads like a transcription of a 
standard German text, not like a transcription of Yiddish, small 
infidelities in vowel/consonant renderings notwithstanding. These are 
either because the transcription conventions deliberately make some 
substitutions (presumably there's no Hebrew letter that would directly 
match an "ü", so they picked "i") or because the writer, while trying 
to capture standard German in this instance, is aware of a different 
orthography. The result, before Beth tweaked it, would resemble a bit 
a phonetic transcription of someone speaking standard German with a 
Yiddish accent. The fact that there are no differences in grammar and 
the phrasing is absolutely natural for written German is what confirms 
the identification as German, rather than Yiddish text.


Just because Yiddish is closely related to German doesn't mean that 
you can simply write the former with standard German phonetics and 
have it match a text in standard German to the point where there's no 
distinction. I think the sample is long enough and involved enough to 
give quite decent confidence in discriminating between these two 
Germanic languages. Grammar, phrasing and word choice are in that 
sense much better indicators than pure spelling; just as people trying 
to assume some foreign accent will give themselves away by faithfully 
maintaining the underlying structure of the language - that even works 
if the "accent" includes a few selected bits of "foreign" word order 
or grammar. In those artificial examples, there's rarely the kind of 
subtle mistake that a true non-native will make.


A./





Re: Aleph-umlaut

2018-11-11 Thread Mark E. Shoulson via Unicode

On 11/11/18 6:00 PM, Asmus Freytag (c) via Unicode wrote:

On 11/11/2018 1:37 PM, Hans Åberg wrote:

On 11 Nov 2018, at 22:16, Asmus Freytag via Unicode  wrote:

On 11/11/2018 12:32 PM, Hans Åberg via Unicode wrote:
One should not rely too much these autotranslation tools, but it may 
be quicker using some OCR program and then correct by hand, than 
entering it all by hand. The setup did not admit transliterating 
Hebrew script directly into German. It seems that the translator 
program recognizes it as Yiddish, though it might be as a result of 
an assumption it makes. 



Well, the OCR does a much better job than the "translation".


Agreed:




The German translation it gives:
Unsere Sünde kommt von der Seite der Verletzten, nachdem sie darauf gewartet 
hat, erwartet zu werden, und nachdem sie die Vorstellungen dieser rabbinischen 
Andachten kennengelernt haben, haben sie begonnen, mit der Motivation zu 
schließen:



This is simply utter nonsense and does not even begin to correlate 
with the transliteration.


Yeah, that looks like word salad even to me and my tiny knowledge of 
German.  The first words are definitely "Wir sind," for example.



And in English:
Our sin is coming out of the side of the injured side, after waiting to be 
expected, and having the concepts of these rabbinical devotiones, they have 
begun to agree with the motivation:



In fact, the English translation makes somewhat more sense. For 
example, "Gegenpartei" in many legal contexts (which this sample 
isn't, by the way) can in fact be translated as "injured party", which 
in turn might correlate with an "injured side" as rendered. However 
"Seite der Verletzten" makes no sense in this context, unless there's 
a Hebrew word that accidentally matches and got picked up.


The pamphlet seems to be referring to forming some sort of sub-community 
or group as a "gegenpartei," I think.


The actual content of the work is not a deep mystery, really.

~mark

(I'm suspicious that some of the auto translation does in fact work 
like many real translations which often are not direct, but involve an 
intermediate language - simply because it's not possible to find 
sufficient translators between random pairs of languages.).



>From the original Hebrew script, in case someone wants to try out more 
possibilities:
וויר זינד אונס דעססען בעוואוסט דאסס פֿאָן זייטע דער גע־ געפארטהיי וועדער רייע , 
נאך איינזיכט צו ערווארטען איזט אונד דאסט זיא דיא קאַנסעקווענצען דיעזער 
ראבבינישען גוטאכטען פֿאָן זיך אבשיטטעלען ווערדען מיט דער מאָטיווירונג , דאסס :


I don't know what that will tell you. You have a rendering that 
produces coherent text which closely matches a phonetic 
transliteration. What else do you hope to learn?


A./





Re: Aleph-umlaut

2018-11-11 Thread Mark E. Shoulson via Unicode

On 11/11/18 4:16 PM, Asmus Freytag via Unicode wrote:

On 11/11/2018 12:32 PM, Hans Åberg via Unicode wrote:



Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), noch 
Einsicht zu erwarten ist und dass sie die Konsequenzen dieser rabbinischen 
Gutachten von sich abschüttelen werden mit der Motivierung, dass:

vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye , nakh 
eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer rabbinishen 
gutakhten fon zikh abshittelen verden mit der motivirung ,  dass :



This agrees rather well with Beth's retranslation.

Mapping "z" to "s",  "f" to "v" and "v" to "w" would match the way 
these pronunciations are spelled in German (with a few outliers like 
"izt" for "ist", where the "s" isn't voiced in German). There's also a 
clear convention of using "kh" for "ch" (as in English "loch" but also 
for other pronunciation of the German "ch"). The one apparent mismatch 
is "ge- gefarthey" for "Gegenpartei". Presumably what is 
transliterated as "f" can stand for phonetic "p". "Parthey" might be 
how Germans could have written "Partei" in earlier centuries (when 
"th" was commonly used for "t" and "ey" alternated with "ei", as in my 
last name).  So, perhaps it's closer than it looks, superficially.


I think that really IS a "p"; elsewhere in the document they seem to be 
quite careful to put a RAFE on top of the PEH when it means "f", and not 
using a DAGESH to mark "p".  There definitely does seem to be usage of 
TET-HEH for "th"; in the Hebrew text at the beginning it talks about the 
אורטה׳ community—took me a bit to work out that was an abbreviation for 
"Orthodox".


From context, "Reue" is by far the best match for "Reye" and seems to 
match a tendency elsewhere in the sample where the transliteration, if 
pronounced as German, would result in a shifted quality for the vowels 
(making them sound more Yiddish, for lack of a better description).


That word is hard to read in the original, hence the "?" in the 
transliteration.  It isn't clear if it's YOD YOD or YOD VAV and the VAV 
is missing its body (the head looks different than it should if it were 
a YOD).  Which would match your "Reue" fairly well—except that they 
generally use AYIN for "e", not "YOD".


"abschüttelen" - here the second "e" would not be part of Standard 
German orthography. It's either an artifact of the transcription 
system or possibly reflects that the writer is familiar with a 
different spelling convention (to my eyes the spelling "abshittelen" 
looks somehow more Yiddish, but I'm really not familiar enough with 
that language).


The ü is, of course, not in the text in the original; it's just "i".  
German ü wound up as "i" in Yiddish, in most cases.


~mark


Re: Aleph-umlaut

2018-11-11 Thread Mark E. Shoulson via Unicode

On 11/11/18 3:32 PM, Hans Åberg via Unicode wrote:

Taking a picture in the Google Translate app, and then pasting the Hebrew 
character string it identifies into translate.google.com for Yiddish gives the 
text:


Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), noch 
Einsicht zu erwarten ist und dass sie die Konsequenzen dieser rabbinischen 
Gutachten von sich abschüttelen werden mit der Motivierung, dass:

vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye , nakh 
eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer rabbinishen 
gutakhten fon zikh abshittelen verden mit der motivirung ,  dass :


Yeah, you have to be careful of auto-transliterating, if that's what 
you're using for this transliteration.  The third word is definitely not 
"auns"; the alef at the beginning is a "shtumer-alef", a *silent* letter 
used in Yiddish a little like a mater lectionis, now that I think about 
it: it's a nominal (but void) consonant used as a place-holder to hold 
the vowel  (Hebrew allows words to start with a vocalic vav, only when 
it's used as a conjunction, but Yiddish does not, generally.  Nor a 
vocalic yod or double-yod or vav-yod diphthong.)  Interesting that you 
have "*zya dya" there (those are silent as well; the words are just "zi 
di"); it looks like elsewhere in the document they spell it with a more 
precise transliteration, strictly using AYIN for "e", not ALEF as here.


~mark



Re: Aleph-umlaut

2018-11-10 Thread Mark E. Shoulson via Unicode
Oh yeah, fun fact about this document that I linked at the outset: I 
found it like 10 years ago when researching something unrelated... it 
apparently is a ruling opposing an earlier announcement by another group 
of Rabbis, declaring it void.  And looking at the rabbis they say are 
supporting them in this decision, I see they mention Rabbi Joseph Rosen, 
chief Rabbi of "Wisloch".  And I think to myself, "How interesting.  I 
have a great-grandfather who was named Rabbi Joseph Rosen, chief Rabbi 
of a town called Swisloch" (with an S; presumably an error in the 
pamphlet.)  I checked with my father; the timing is about right, would 
have been shortly before he came to America.  The Internet moves in 
mysterious ways.


~mark



Re: Aleph-umlaut

2018-11-10 Thread Mark E. Shoulson via Unicode

On 11/10/18 10:28 AM, Beth Myre via Unicode wrote:

Hi Everyone,

Are we sure this is actually Yiddish?  To me it looks like it could be 
German transliterated into the Yiddish/Hebrew alphabet.


I can spend a little more time with it and put together some examples.

Beth


Is there really a difference?  In a very real sense, Yiddish *IS* a form 
of German (I'm told it's Middle High German, but I don't actually have 
knowledge about that), with a strong admixture of Hebrew and Russian and 
a few other languages, and which is usually written using Hebrew 
letters.  There's probably something like a continuum with "Yiddish" and 
"German" as ranges or points.


Is the text *standard* German written with Hebrew letters?  I don't 
think so.  Let's see, on the next-to-last page, end of first paragraph, 
I see the phrase אויטאָריטא̈טען בעקרא̈פֿטינג, which would transliterate 
to "oytoritäten bekräfting"—with umlauted "a", but "oy-" instead of 
"au-" at the beginning.  OK, I know in German "au" can be pronounced 
"oy-" sometimes (I think), but at least 
https://en.wiktionary.org/wiki/Autorit%C3%A4t implies that this isn't 
the usual/standard pronunciation (I make no claims as to expertise in 
German).  The text is littered with terms like בי״ד, abbreviation for 
Hebrew בית דין, "house of judgment" or legal court, pronounced in 
Yiddish "beisdin", or פסק (can't be German as it has no vowels!) meaning 
"legal decision," from Hebrew—Hebrew-derived words in Yiddish do not 
change their spelling, as a rule.  There are definitely German spelling 
features that are not found in later spellings, for example, double 
letters in German are written double in the Yiddish spelling too, which 
is quite unusual (you're used to letters in Hebrew never being silent or 
even geminate, but always having at least a semi-syllable sound between 
like letters, except in special cases, so it seems striking to see אללע 
for a simple two syllables).


So I'm not sure if there's a *real* answer to your question, but it does 
look to me like this isn't "normal" German, at least.  And would it 
matter, anyway?


~mark




Re: Aleph-umlaut

2018-11-10 Thread Mark E. Shoulson via Unicode

On 11/10/18 1:25 AM, James Kass via Unicode wrote:


In the last pages of the text linked by Mark E. Shoulson, both the 
gershayim and the aleph-umlaut are shown.  A quick look didn't find 
any other base letter with the combining umlaut.


Indeed; there is no shortage of use of the GERSHAYIM, used as it 
normally is, to indicate abbreviations.  The umlaut on the alef is used 
specifically in the Yiddish parts, to be an umlaut (the word with the 
GERSHAYIM on the top line is an abbreviation for the phrase for a legal 
court or authority; the word on the second like transliterates 
apparently to "bestätigt"; someone with better German than me can make 
more sense of it.  The example I sent at first used the word 
"legalität", which even I can understand as "legality" or something like 
that.)  I think the Yiddish at the time may already not have had ö or ü 
sounds, so had no need to transliterate those (or maybe there just 
happened not to be a need for them in this text); certainly I see 
Yiddish spellings like אויפֿ־ ("oyf-") where German would have "auf".


~mark



Re: Aleph-umlaut

2018-11-10 Thread Mark E. Shoulson via Unicode

On 11/9/18 7:02 PM, Tex via Unicode wrote:

My notes on Hebrew numbers on 
http://www.i18nguy.com/unicode/hebrew-numbers.html  include:

"Using letters for numbers, there is the possibility of confusion as to whether a 
string of letters is a word or a numerical value. Therefore, when numbers are used with 
text, punctuation marks are added to distinguish their numerical meaning. Single 
character numbers (numbers less than 10) add the punctuation character geresh after the 
numeric character. Larger numbers insert the punctuation character gershayim before the 
last character in the number."

So perhaps Alef with diaeresis is a collapsed form of Alef followed by 
Gershayim when it is used as a numeric value. I wonder if that may also occur 
for other values.


I don't know that it's a "collapsed" form.  I think the double-dotted 
form is just an alternate one, and one that was more popular in older 
times.  Standardized Hebrew numerical usage would be to use a GERESH 
(not a GERSHAYIM) after an ALEF to indicate a thousand; GERSHAYIM is 
used before the last letter in a number that is "large" generally in the 
sense of the number of letters (i.e. more than one or two).  Since 
GERESH is also used for single-letter numbers, this means that א׳ could 
mean "one" (much more common) or "one thousand".  The GERESH-after 
becomes useful in something like the full number of the year, ה׳תשע״ט 
where it sets off the initial 5, making it 5000 (this notation is not 
place-value, but there is a usual ordering, so technically it would 
(usually) be understandable even without the punctuation marks, due to 
the out-of-order placement of the initial HE).


Again, what interested me about this usage was that it really *was* an 
umlaut.  But yes, there are other situations where such a thing could 
happen.


~mark



Re: Aleph-umlaut

2018-11-10 Thread Mark E. Shoulson via Unicode

On 11/9/18 6:25 PM, Marius Spix via Unicode wrote:

Dear Mark,

I found another sample here:
https://www.marketscreener.com/BRILL-5240571/pdf/61308/Brill_Report.pdf

On page 86 it says that the aleph with diaresis is a number with
the value 1000.


That's true, I've heard of that, and even occasionally seen it. And 
sometimes in old printings things like a diaeresis or a dot above were 
used where later Hebrew uses a U+05F3 HEBREW PUNCTUATION GERESH or 
U+05F4 HEBREW PUNCTUATION GERSHAYIM.  I think what struck me about this 
one was that this was not just something that looked like a 
diaeresis/umlaut, it really WAS an umlaut, a direct transcoding of the 
a-umlaut in Latin letters into aleph-umlaut in Hebrew letters.



Yet another usage in a mathematical context of an aleph with umlaut can
be found here, however they used U+2135 ALEF SYMBOL instead of U+05D0
HEBREW LETTER ALEF. This is not related to the value 1000, as the umlaut
is used to mark the second derivative.
https://de.slideshare.net/StephenAhiante/dynamics-modelling-design-of-a-quadcopter
(page 28-29 or slide 41-42)


Kind of an odd usage, since ALEF SYMBOL is usually used for transfinite 
cardinals, as in ℵ₀, and you don't normally take time-derivatives of 
those.  But mathematicians love using weird symbols for whatever they 
like.  This is the mathematical usage of two-dots-above, as you note.



~mark


Aleph-umlaut

2018-11-09 Thread Mark E. Shoulson via Unicode

  
  
Noticed something really fascinating in an old pamphlet I was
  reading.  It's from 1922, in Hebrew mostly but with some Yiddish
  at the end.  The Yiddish spelling is not according to more modern
  standardization, but seems to be significantly more faithful to
  the German spellings of the same words, replacing Latin letters
  with Hebrew ones more than respelling phonetically.  And there are
  even places where it appears they represented a German ä with a
  Hebrew aleph—with an umlaut!  Actually it looks a little more like
  a double acute accent but that's surely a style choice, since it
  obviously is mapping to an umlaut.





(Note also the spelling דיע, a calque for German "die", where
  modern Yiddish would spell it phonetically as די.)


I do NOT think this needs any special encoding, btw.  I would
  probably encode this as simply U+05D0 U+0308 (א̈).  Combining
  symbols do not (necessarily) belong to a specific alphabet, and
  the fact that most fonts would render this badly is a different
  issue.  I just thought the people here might find it interesting.


(Link is
http://rosetta.nli.org.il/delivery/DeliveryManagerServlet?dps_pid=IE36609604&_ga=2.182410660.2074158760.1541729874-1781407841.1541729874
  look at the last few pages.)


~mark

  



Re: Private Use areas

2018-08-31 Thread Mark E. Shoulson via Unicode

On 08/28/2018 04:26 AM, William_J_G Overington via Unicode wrote:

Hi
  
Mark E. Shoulson wrote:
  

I'm not sure what the advantage is of using circled characters instead of plain 
old ascii.
  
My thinking is that "plain old ascii" might be used in the text encoded in the file. Sometimes a file containing Private Use Area characters is a mix of regular Unicode Latin characters with just a few Private Use Area characters mixed in with them. So my suggestion of using circled characters is for disambiguation purposes. The circled characters in the PUAINFO sequence would not be displayed if a special software program were being used to read in the text file, then act upon the information that is encoded using the circled characters.


What if circled characters are used in the text encoded in the file?  
They're characters too, people use them and all.  Whenever you designate 
some characters to be used in a way outside their normal meaning, you 
have the problem of how to use them *with* their normal meaning.  So 
there are various escaping schemes and all.  So in XML, all characters 
have their normal meanings—except <, >, and &, which mean something 
special and change the interpretations of other nearby characters (so 
"bold" is a word in English that appears in the text, but "" is 
part of an instruction to the renderer that doesn't appear in the 
text.)  And the price is that those three characters have to be 
expressed differently (< > &).  I don't really see what you 
gain by branding some large swath of unicode ("circled characters") as 
"special" and not meaning their usual selves, and for that matter making 
these hard-to-type characters *necessary* for using your scheme, when 
you could do something like what XML does, and say "everything between < 
and > is to be interpreted specially, and there, these characters have 
the following meanings" and then have some other way of expressing those 
two reserved characters.  (not saying you need to do it XML's way, but 
something like that: reserve a small number of characters that have to 
be escaped, not some huge chunk.)
  
My thinking is that using this method just adds some encoded information at the start of the text file and does not require the whole document to become designated as a file conformant to a particular markup format.


That's another way of saying that this is a markup format which accepts 
a large variety of plain texts.  Because you ARE talking about making a 
"particular markup format," just a different and new one.


I guess there's not even any reason for me to argue the point, though, 
since it is up to you how to design your markup language, and you can 
take advice (or not) from anyone you like.  Draw up some design, find 
some interested people, start a discussion, and work it out.  (but not 
here; this list is for discussing Unicode.)


~mark


Re: Private Use areas

2018-08-31 Thread Mark E. Shoulson via Unicode

On 08/28/2018 11:58 AM, William_J_G Overington via Unicode wrote:

Asmus Freytag wrote:


There are situations where an ad-hoc markup language seems to fulfill a need that is not 
well served by the existing full-fledged markup languages. You find them in internet 
"bulletin boards" or services like GitHub, where pure plain text is too 
restrictive but the required text styles purposefully limited - which makes the syntactic 
overhead of a full-featured mark-up language burdensome.

I am thinking of such an ad-hoc special purpose markup language.

I am thinking of something like a special purpose version of the FORTH computer 
language being used but with no user definitions, no comparison operations and 
no loops and no compiler. Just a straight run through as if someone were typing 
commands into FORTH in interactive mode at a keyboard. Maybe no need for spaces 
between commands. For example, circled R might mean use Right-to-left text 
display.


That starts to sound no longer "ad-hoc", but that is not a well-defined 
term anyway.  You're essentially describing a special-purpose markup 
language or protocol, or perhaps even programming language.  Which is 
quite reasonable; you should (find some other interested people and) 
work out some of  the details and start writing up parsers and such

I am thinking that there could be three stacks, one for code points and one for 
numbers and one for external reference strings such as for accessing a web page 
or a PDF (Portable Document Format) document or listing an International 
Standard Book Number and so on. Code points could be entered by circled H 
followed by circled hexadecimal characters followed by a circled character to 
indicate Push onto the code point stack. Numbers could be entered in base 10, 
followed by a circled character to mean Push onto the number stack. A later 
circled character could mean to take a certain number of code points (maybe 
just 1, or maybe 0) from the character stack and a certain number of numbers 
(maybe just 1, or maybe just 0) from the number stack and use them to set some 
property.

It could all be very lightweight software-wise, just reading the characters of 
the sequence of circled characters and obeying them one by one just one time 
only on a single run through, with just a few, such as the circled digits, each 
having its meaning dependent upon a state variable such as, for a circled 
digit, whether data entry is currently hexadecimal or base 10.


I still don't see why you're fixated on using circled characters. You're 
already dealing with a markup-language type setup, why not do what other 
markup schemes do?  You reserve three or four characters and use them to 
designate when other characters are not being used in their normal sense 
but are being used as markup.  In XML, when characters are inside '<>' 
tags, they are not "plain text" of the document, but they mean other 
things—perhaps things like "right-to-left" or "reference this web page" 
and so forth, which are exactly the kinds of things you're talking about 
here.  If you don't want to use plain ascii characters because then you 
couldn't express plain ascii in your text, you're left with exactly the 
same problem with circled characters: you can't express circled 
characters in your text.  While that is a smaller problem, it can be 
eliminated altogether by various schemes used by XML or RTF or 
lightweight markup languages.  Reserve a few special characters to give 
meanings to the others, and arrange for ways to escape your handful of 
reserved characters so you can express them.  More straightforward to 
say "you have to escape <, >, and & characters" than to say "you have to 
escape all circled characters."


Anyway, this is clearly a whole new high-level protocol you need (or 
want) to work out, which would *use* Unicode (just like XML and JSON 
do), but doesn't really affect or involve it (Unicode is all about the 
"plain text".  Kind of getting off-topic, but get some people interested 
and start a mailing list to discuss it.  Good luck!


~mark


Re: Private Use areas

2018-08-27 Thread Mark E. Shoulson via Unicode
But there's nothing wrong with proposing a higher-level protocol; 
indeed, that's what Ken Whistler was saying: you need a protocol to 
transmit  this information.  It's metadata, so it will perforce be a 
higher-level protocol of some kind, whether transmitting actually 
out-of-band or reserving a piece of the file for metadata.  That's 
fine.  I'm not sure what the advantage is of using circled characters 
instead of plain old ascii.  You have to set off your reserved area 
somehow, and I don't think using circled chars is the least obtrusive 
way to do it.  You could use XML; that would be pretty well-suited to 
the task, but maybe it's overkill.  If all you need is to reference some 
"standard" PUA interpretation (per James Kass' take on this, not William 
Overington's), then just a header like "[PUA1]" would work just 
fine.  (Compare emacs with things like "-*- encoding: utf-8 -*-" or 
whatever.)


For larger chunks of meta-info, XML might be a good choice, but even 
then, it could be an XML *header* to an otherwise ordinary text file.  
Yes, you'd have to delimit it somehow, and probably have a top header (a 
"magic number") to signal the protocol, but that's doable.  For 
applications not supporting this protocol, such a setup is probably 
easier for the eye to skip past (even if it's long) than a bunch of 
circled letters.


A protocol like that is outside of Unicode's scope (just like XML is), 
but it's certainly something you could write up and try to standardize 
and get used, with or without the support of ISO. People are coming up 
with file formats all the time (and if you really want to used circled 
characters, go ahead.  That's something for you to consider in the 
design phase of the project).


~mark


On 08/27/2018 05:20 PM, Rebecca Bettencourt via Unicode wrote:


> That sounds like a non-conformant use of characters in
the U+24xx block.

Well, you are an expert on these things and I do not
understand as to with what it would be non-conformant.


A conformant process must interpret ⓅⓊⒶⒹⒶⓉⒶ as the characters ⓅⓊⒶⒹⒶⓉⒶ 
and not as a signal to process what follows as anything other than 
plain text.


What you are proposing is a higher-level protocol, whether you realize 
it or not. Unfortunately your higher-level protocol has a serious flaw 
in that it cannot represent the string "ⓅⓊⒶⒹⒶⓉⒶ". Also, seeing a bunch 
of circled alphanumeric characters in a document ⓘⓢ◯ⓕⓐⓡ◯ⓕⓡⓞⓜ◯ⓤⓝⓞⓑⓣⓡⓤⓢⓘⓥⓔ.


There are plenty of already-existing higher-level protocols (you 
mentioned one: XML) that could be used to provide information about 
PUA characters, and they are all much better suited to that purpose 
than what you are proposing.






Re: Private Use areas

2018-08-27 Thread Mark E. Shoulson via Unicode

On 08/27/2018 05:18 PM, James Kass via Unicode wrote:

William Overington wrote,



On Mon, Aug 27, 2018 at 12:59 AM, William_J_G Overington
 wrote:


Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters
U+24B6 .. U+24E9.

Use U+2473 as if it were a circled space.

ⓌⒽⓎ◯ⓃⓄⓉ◯ⓊⓈⒺ◯ⓉⒽⒺ◯ⒸⒾⓇⒸⓁⒺⒹ◯ⓈⓅⒶⒸⒺ◯
ⒻⓄⓇ◯ⓉⒽⒺ◯ⒸⒾⓇⒸⓁⒺⒹ◯ⓈⓅⒶⒸⒺ?


And what's wrong with the ASCII digits?


~mark



Re: Aw: Re: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process)

2018-08-23 Thread Mark E. Shoulson via Unicode

On 08/23/2018 06:48 AM, Asmus Freytag (c) via Unicode wrote:

On 8/23/2018 3:28 AM, "Jörg Knappen" wrote:

Asmus,
I know your style of humor, but to keep it straight:
All known human languages, even Piraha, have pronouns for "I" and "you".


And languages like Japanese, tend to use them - mostly not.

Even if the concepts are known, and can be named, there are deep 
differences across languages concerning the need  or conventions for 
demarcating them with words in any given context.


Replacing words by symbols is not going to fix this - the only way to 
get a 'universal' system of symbolic expression is to invent a new 
language, with its own conventions for use of these symbols in any 
given context.




It isn't like replacing words with symbols hasn't been tried... I think 
Francis Lodwick had a "universal symbology" like this in the works in 
the 1600s.


~mark



Re: Aw: Re: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process)

2018-08-23 Thread Mark E. Shoulson via Unicode
Still, pronouns may be universal, but their features aren't... Pronouns 
in Japanese are not a closed class, and it is not uncommon to use a 
person's name/title instead of "you".  Happens in English and other 
languages too, with extremely formal speech, even down to conjugating 
with 3rd-person verb forms.  (it's really cool to see the mid-sentence 
back-and-forth shifting in Biblical Hebrew, e.g. Genesis chapter 44.)  
All of which is to say, as Asmus did, that even "I" and "you" are not 
interchangeable pieces between languages, easily symbolized by a single 
"fits-all-languages" placeholder.


~mark

On 08/23/2018 06:28 AM, "Jörg Knappen" via Unicode wrote:

Asmus,
I know your style of humor, but to keep it straight:
All known human languages, even Piraha, have pronouns for "I" and "you".
--Jörg Knappen
*Gesendet:* Montag, 20. August 2018 um 16:20 Uhr
*Von:* "Asmus Freytag via Unicode" 
*An:* unicode@unicode.org
*Betreff:* Re: Thoughts on working with the Emoji Subcommittee (was 
Re: Thoughts on Emoji Selection Process)


What about languages that don't have or don't use personal pronouns. 
Their speakers might find their use odd or awkward.


The same for many other grammatical concepts: they work reasonably 
well if used by someone from a related language, or for linguists 
trained in general concepts, but languages differ so much in what they 
express explicitly that if any native speaker transcribes the features 
that are exposed (and not implied) in their native language it may not 
be what a reader used to a different language is expecting to see.


A./





Re: Private Use areas

2018-08-21 Thread Mark E. Shoulson via Unicode

On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote:



On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote:

On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:

On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:

Is there a block of RTL PUA also?

No.

Perhaps there should be?


This is a periodic suggestion that never goes anywhere--for good 
reason. (You can search the email archives and see that it keeps 
coming up.)


Presuming that this question was asked in good faith...


Yeah, I know there has been talk about such things, and I also knew that 
whether or not there was an RTL block (which I did not remember for 
certain), there weren't going to be any *changes* in the PUA, and we 
were going to have to make do with what there was.  There's no way to 
anticipate all the possible properties people would want in the PUA, 
though I remember thinking it was probably wrong to make the PUA 
*strongly* LTR; I know there's a not-strongly flavor too.


Best we can do is shout loudly at OpenType tables and hope to cram in 
behavior (or at least appearance, which is more likely all we can get) 
that vaguely resembles what we're after.  And that's not SO awful, given 
what we're dealing with.




As I see it, the only feasible way for people to get specialized 
behavior for PUA ranges involves first ceasing to assume that somehow 
they can jawbone the UTC into *standardizing* some ranges for some 
particular use or another. That simply isn't going to happen. People 
who assume this is somehow easy, and that the UTC are a bunch of 
boneheads who stand in the way of obvious solutions, do not -- I 
contend -- understand the complicated interplay of character 
properties, stability guarantees, and implementation behavior baked 
into system support libraries for the Unicode Standard.


The whole point of the PUA is that it *isn't* standardized (by the 
UTC).  It might have been nice to make some more varied choices of 
things that couldn't be left unspecified, but you're still going to wind 
up with "but there aren't any PUA codepoints that are JUST what I 
need!"  And, as said, it's too late now.


~mark


Re: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))

2018-08-20 Thread Mark E. Shoulson via Unicode

On 08/20/2018 03:12 PM, Mark Davis ☕️ via Unicode wrote:

> ... some people who would call a PUA solution either batty
> or crazy.

I don't think it is either batty or crazy. People can certainly use 
the PUA to interchange text (assuming that they have downloaded fonts 
and keyboards or some other input method beforehand), and

it
 can definitely serve as a proof of concept
. Plain symbols — with no interactions between them (like changing 
shape with complex scripts), no combining/non-spacing marks, no case 
mappings, and so on — are the best possible case for PUA.


It is kind of a bummer, though, that you can't experiment (easily? or at 
all?) in the PUA with scripts that have complex behavior, or even 
not-so-complex behavior like accents & combining marks, or RTL direction 
(here, also, am I speaking true?  Is there a block of RTL PUA also?  I 
guess there's always RLO, but meh.)  Still, maybe it doesn't really 
matter much: your special-purpose font can treat any codepoint any way 
it likes, right?


~mark



Re: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process)

2018-08-20 Thread Mark E. Shoulson via Unicode

On 08/20/2018 10:30 AM, James Kass via Unicode wrote:

As The Universal Character Set, it should be able to support the needs
of all users.  And with the Private Use Areas, it does.


Here, I agree with you.  This kind of experimentation is exactly what 
the PUA is for, especially for these putative "universal pictographic 
systems" which will need space to hold the whole system, since the 
individual signs won't mean much unless you understand the system (which 
I know I said was an argument against encoding them at all, but that's 
the point of the PUA: see if you can get some traction, if people really 
DO find it useful, etc. Then you can make me eat my words.)  I think 
it's been suggested a few times.


Go forth into the PUA, and make it yours, then!

~mark



Re: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process)

2018-08-20 Thread Mark E. Shoulson via Unicode

On 08/20/2018 10:20 AM, Asmus Freytag via Unicode wrote:

On 8/20/2018 7:09 AM, James Kass via Unicode wrote:

Leo Broukhis responded to William Overington:


I decided that trying to design emoji for 'I' and for 'You' seemed
interesting so I decided to have a go at designing some.

Why don't we just encode Blissymbolics, where pronouns are already
expressible as abstract symbols, and emojify them?

Emoji enthusiasts seeking to devise a universal pictographic set might
be well-advised to build from existing work such as Blissymbolics.

I think William Overington's designs are clever, though.  Anyone who
has ever studied a foreign language (or even their own language) would
easily and quickly recognize the intended meanings of the symbols once
they understand the derivation.

What about languages that don't have or don't use personal pronouns. 
Their speakers might find their use odd or awkward.


The same for many other grammatical concepts: they work reasonably 
well if used by someone from a related language, or for linguists 
trained in general concepts, but languages differ so much in what they 
express explicitly that if any native speaker transcribes the features 
that are exposed (and not implied) in their native language it may not 
be what a reader used to a different language is expecting to see.




Most of the emoji are heavily dependent on a presumed culture anyway.  
The smiley-faces maybe could be argued to be cross-cultural (facial 
expressions are the same for all people—well, mostly), though even then 
the styling is cultural.  But a lot of the rest are 
culture-dependent—and that's fine and how it should be, IMO.


That said, I think William Overington's designs are generally opaque and 
incomprehensible.  James Kass says, "Anyone who has ever studied a 
foreign language (or even their own language) would easily and quickly 
recognize the intended meanings of the symbols *once they understand the 
derivation*." (emphasis added).  Well, yeah, once you tell me what 
something means, I know what it means!  The point of emoji is that they 
already make some sort of "obvious" sense—admittedly, to those who are 
in the covered culture.  (You can't say the same would be true of 
pronoun emoji for linguists, because no linguist would ever look at 
those symbols and think, "Oh right!  Pronouns!"  Yes, they'll make sense 
*once explained* and once you're told they're pronouns, but that's not 
the same thing.)


Moreover, they are once again an attempt to shoehorn Overington's pet 
project, "language-independent sentences/words," which are still 
generally deemed out of scope for Unicode.


~mark


Re: Unicode 11 Georgian uppercase vs. fonts

2018-07-30 Thread Mark E. Shoulson via Unicode
O blessed gods of writing, you mean yet *another* script wants (wanted?) 
to commit the mistake of bicamerality?  Just quit while you're ahead!


~mark

On 07/27/2018 10:14 AM, Khaled Hosny via Unicode wrote:

On Fri, Jul 27, 2018 at 02:02:07PM +0100, Michael Everson via Unicode wrote:

1) Show evidence of titlecasing in Hebrew or Arabic.

FWIW, there was a case system for Arabic used at some point in Egypt,
called “crown letters”, and introduced under the direction of king Fuad
and was used in some capacity in official documents till the end of the
monarch:
https://en.wikipedia.org/wiki/Crown_Letters_and_Punctuation_and_Their_Placements
http://hibastudio.com/wp-content/uploads/2014/01/ar458.jpg

Regards,
Khaled





Re: Fwd: RFC 8369 on Internationalizing IPv6 Using 128-Bit Unicode

2018-04-02 Thread Mark E. Shoulson via Unicode
Whew!  Thanks for explaining the joke! Everyone here really thought they 
were serious.  Maybe you should write to the authors of the RFC and 
explain to them that their growth-function is incorrect.  I'm sure 
they'd be glad of the correction.


~mark

On 04/02/2018 09:49 PM, Philippe Verdy via Unicode wrote:
It's fun to consider the introdroduction (after emojis) of imojis, 
amojis, umojis and omojis for individual people (or named pets), alien 
species (E.T. wants to be able to call home with his own language and 
script !), unknown things, and obfuscated entities. Also fun for new 
"trollface" characters. In fact you could represent every individual 
or even single atom in the universe that has ever created since the 
BingBang !


But unlike peoples and social entities, characters to encode don't 
grow exponentially but still linearily at a slowing speed. 


Re: Fwd: RFC 8369 on Internationalizing IPv6 Using 128-Bit Unicode

2018-04-02 Thread Mark E. Shoulson via Unicode

On 04/02/2018 08:52 PM, J Decker via Unicode wrote:



On Mon, Apr 2, 2018 at 5:42 PM, Mark E. Shoulson via Unicode 
mailto:unicode@unicode.org>> wrote:


For unique identifiers for every person, place, thing, etc,
consider
https://en.wikipedia.org/wiki/Universally_unique_identifier
<https://en.wikipedia.org/wiki/Universally_unique_identifier>
which are indeed 128 bits.

What makes you think a single "glyph" that represents one of these
3.4⏨38 items could possibly be sensibly distinguishable at any
sort of glance (including long stares) from all the others?  I
have an idea for that: we can show the actual *digits* of some
encoding of the 128-bit number.  Then just inspecting for a
different digit will do.


there's no restirction that it be one character cell in size... 
rendered glyphs could be thousands of pixels wide...


Yes, but at that point it becomes a huge stretch to call it a 
"character".  It becomes more like a "picture" or "graphic" or 
something.  And even then, considering the tremendohunormous number of 
them we're dealing with, can we really be sure each one can be uniquely 
recognized as the one it's *supposed* to be, by everyone?


~mark




Re: Fwd: RFC 8369 on Internationalizing IPv6 Using 128-Bit Unicode

2018-04-02 Thread Mark E. Shoulson via Unicode
For unique identifiers for every person, place, thing, etc, consider 
https://en.wikipedia.org/wiki/Universally_unique_identifier which are 
indeed 128 bits.


What makes you think a single "glyph" that represents one of these 
3.4⏨38 items could possibly be sensibly distinguishable at any sort of 
glance (including long stares) from all the others?  I have an idea for 
that: we can show the actual *digits* of some encoding of the 128-bit 
number.  Then just inspecting for a different digit will do.


Now, what about a registry for "important" (and 
not-necessarily-important) UUIDs for key things and people, which 
associates them with an image of some kind?  Some sort of global icon?  
And indeed, perhaps used for Internet-of-Things-like things?  Not 
necessarily a bad idea—but decidedly outside of the scope of Unicode.  
(Maybe you could even assign your beloved sentences to some UUIDs and 
stick them in such a registry.  Again, who knows, maybe a decent idea.  
But it ain't Unicode.)


~mark

On 04/02/2018 02:15 PM, William_J_G Overington via Unicode wrote:

Doug Ewell wrote:


Martin J. Dürst wrote:
  

Please enjoy. Sorry for being late with forwarding, at least in some
parts of the world.
  

Unfortunately, we know some folks will look past the humor and use this

as a springboard for the recurring theme "Yes, what *will* we do when
Unicode runs out of code points?"

An interesting thing about the document is that it suggests a Unicode code 
point for an individual item of a particular type, what the document terms an 
imoji.

This being beyond what Unicode encodes at present.

I wondered if this could link in some ways to the Internet of Things.




Re: 0027, 02BC, 2019, or a new character?

2018-01-24 Thread Mark E. Shoulson via Unicode

On 01/24/2018 09:29 PM, Shriramana Sharma via Unicode wrote:
On 24-Jan-2018 00:25, "Doug Ewell via Unicode" > wrote:


I think it's so cute that some of us think we can advise Nazarbayev on
whether to use straight or curly apostrophes or accents or x's or
whatever. Like he would listen to a bunch of Western technocrats.


Sir why this assumption that everyone here is "western"? I'm situated 
at an even more eastern longitude than Kazakhstan.
It hardly matters. As the intent here is to comment on Nazarbayev's 
putative view of these discussions, it's quite likely he would write the 
whole lot of us off as "Western technocrats" no matter what our longitudes.


~mark


Observations and rants

2018-01-17 Thread Mark E. Shoulson via Unicode

  
  
I've been keeping up with the Document Register and following some
of the discussions therein, especially regarding Emoji, and I remain
surprised with some of the bizarre suggestions.  Not random people
suggesting weird emoji, but things from the ESC and UTC!

A lot can be summed up in the simple question, "Why doesn't anyone
listen to Charlotte Buff?"  She has written, from what I have seen,
pretty cogently and clearly on various minefields that emoji
decisions are likely to be treading, and has been quite the opposite
of "entitled" or demanding of strange special characters.  For the
most part, she has warned *against* encoding too much, lest covering
some special cases open up demand for covering all of them.  What is
the UTC thinking, that hair variation can be reasonably covered, in
initial cases, by RED/CURLY/NONE/WHITE?  Are these supposed to be
orthogonal?  In which case, the instant question is how to handle
hair that is red and curly.  If they can be combined, how do we
handle CURLY+NONE?  Why are the more common colors (brown, black,
blond) left for future study?

Regarding
https://www.unicode.org/L2/L2018/18027-wg2-fdbk-response.pdf, I also
don't really see what the point of SOFTBALL is as distinct from
BASEBALL, especially since softballs I've seen are often the same
color as baseballs, and can be distinguished mainly by size—and
neither size _nor color_ is something reliably encoded in a Unicode
character (characters with colors in their names notwithstanding. 
Glyphs are canonically foreground-vs-background markings, from what
I can tell.)  A similar issue of color might apply to NAZAR, which,
without blue coloring, looks mostly like some form of target.

MAGNET kind of has to be horseshoe-shaped.  Otherwise it's just a
cylinder.  The horseshoe shape is indeed already obsolete, as it was
used mainly to keep old alnico magnets from demagnetizing
themselves, and modern magnets (especially NdFeB magnets) have no
need of such gentling.  But cartoons for decades represented magnets
with the distinctive horseshoe shape (with the ends marked off, so
it doesn't look like an actual shoe-for-a-horse), and any kid in the
appropriate culture would instantly recognize such a shape as a
magnet.  (Cultural specificity of emoji is a longer rant I may
inflict on this list.  Briefly, nothing about emoji is "culturally
neutral" and it's ridiculous to make them so.)  Even though
technology changes, our symbols often remain.  Telephones are no
longer shaped like U+1F4DE TELEPHONE RECEIVER or U+1F57D RIGHT HAND
TELEPHONE RECEIVER or U+260E BLACK TELEPHONE or U+1F57E WHITE
TOUCHTONE TELEPHONE, but the symbols are widely recognized. 
ERLENMEYER FLASK is certainly a strong and recognizable science
symbol.

This circles back again to some of Charlotte Buff's points.  A
"bride" is a culturally-recognizable symbol, with specific graphic
cues (bridal gown, which is unlike other dresses in use; veil) that
we associate with a bride and weddings, specifically.  The "male
counterpart" of a bride is not a "man in tuxedo."  A man in a tuxedo
is a man in a tuxedo, and might be the head waiter.  It happens that
Western culture lacks recognizable cues that signify "man about to
be married"; that symbol just isn't representable the way "bride"
is.  Show any European or American (and probably Japanese) kid a
picture of a bride, and they'll say it's a bride, no matter what
other pictures you showed them before.  Show them a picture of a
groom—without any other context—and they might recognize it as a
waiter, or a maitre-d', or a person going to a fancy ball...  Much
the same argument regarding kid-recognizability can be made with
respect to Santa Claus/Father Christmas vs "Mother Christmas."  Any
time of the year, in any context, any kid will recognize Santa
Claus.  Mrs. Claus, on the other hand, shown out of context, would
be a real stumper.  I could easily see someone guessing she was
Mother Goose.  (Actually the whole "show it to some random kids"
test is not a bad idea for judging the sensibility of proposed
emoji.)

There are a bunch of cultural visual cues we recognize for a variety
of things, some of which aren't yet emoji (or emoji sequences) but
likely could/should be.  Certainly something like BURGLAR or THIEF
would be recognizable by the telltale mask and hat (person + mask
might be a good sequence—if we had a "mask" emoji, which we don't
and likely should); CONVICT by the striped clothes.  (Not all the
things we like to talk about are things we approve of.)

OK, enough rambling for now.  Back to your usual discussions.

~mark
  



Re: PETSCII mapping?

2017-04-06 Thread Mark E. Shoulson

On 04/06/2017 08:07 AM, Rebecca T wrote:
Here’s a copy of the Teletext character set; it includes box-drawing 
characters
for all combinations of a 2×3 grid of cells. 2⁶ = 64 characters, so we 
might

need a new block.

[1]: http://www.galax.xyz/TELETEXT/CHARSET.HTM

My old TRS-80 also did "graphics" like this, with 64 2×3 cells. That was 
even how it did it when you were setting individual blocks.  The 
smallest "pixel" you could control in graphics was one of these ⅙ths of 
a character cell, and wouldn't you know it? As soon as you set one in a 
cell occupied by some other character, the other character would disappear.


Not positive these count as plain text, but there's a decent argument 
for it.


~mark



Re: PETSCII mapping?

2017-04-06 Thread Mark E. Shoulson

On 04/05/2017 05:25 PM, Rebecca T wrote:


As time goes on, “not in widespread use” will become a flimsier and 
flimsier

argument against inclusion
Indeed.  This is the chicken-and-egg problem, and you are not the first 
to (rightly) point it out as a flimsy excuse.  Thanks for bringing it up 
again, though: people still seem to go back to it a lot.


~mark


Re: Unicode Emoji 5.0 characters now final

2017-03-28 Thread Mark E. Shoulson
Kind of have to agree with Doug here. Either support the mechanism or 
don't.  Saying "we, you CAN do this if you WANT to" always 
implies a "...but you probably shouldn't."  Why even bother making it a 
possibility?


On 03/28/2017 02:41 PM, Doug Ewell wrote:

"Even though it is possible to support the US states, or any subset of
them, implementations don’t have to." Well, of course they don't.
Implementations don't have to support the three British flags either if
they don't want to, or any national flags or other emoji, or any
particular character for that matter. The superfluous statement is
easily reduced to "Don't do this."


That's a pretty good re-statement.

~mark


Re: Encoding of old compatibility characters

2017-03-28 Thread Mark E. Shoulson

On 03/28/2017 09:09 AM, Asmus Freytag wrote:

On 3/28/2017 4:00 AM, Ian Clifton wrote:

I’ve used ⏨ a couple of times, without explanation, in my own
emails—without, as far as I’m aware, causing any misunderstanding.


Works especially well, whenever it renders as a box with 23E8 inscribed!

A./


I ⬚ Unicode.

~mark



Re: Encoding of old compatibility characters

2017-03-28 Thread Mark E. Shoulson
I don't think I want my text renderer to be *that* smart.  If I want ⏨, 
I'll put ⏨.  If I want a multiplication sign or something, I'll put 
that.  Without the multiplication sign, it's still quite understandable, 
more so than just "e".


It is valid for a text rendering engine to render "g" with one loop or 
two.  I don't think it's valid for it to render "g" as "xg" or "-g" or 
anything else.  The ⏨ character looks like it does.  You don't get to 
add multiplication signs to it because you THINK you know what I'm 
saying with it.  And using 20⏨ to mean "twenty base ten" sounds 
perfectly reasonable to me also.


~mark

On 03/28/2017 05:33 AM, Philippe Verdy wrote:
Ideally a smart text renderer could as well display that glyph with a 
leading multiplication sign (a mathematical middle dot) and implicitly 
convert the following digits (and sign) as real superscript/exponent 
(using contextual substitution/positioning like for Eastern 
Arabic/Urdu), without necessarily writing the 10 base with smaller 
digits.
Without it, people will want to use 20⏨ to mean it is the decimal 
number twenty and not hexadecimal number thirty two.


2017-03-28 11:18 GMT+02:00 Frédéric Grosshans 
mailto:frederic.grossh...@gmail.com>>:


Le 28/03/2017 à 02:22, Mark E. Shoulson a écrit :

Aw, but ⏨ is awesome!  It's much cooler-looking and more
visually understandable than "e" for exponent notation. In
some code I've been playing around with I support it as a
valid alternative to "e".


I Agree 1⏨3 times with you on this !

Frédéric






Re: Encoding of old compatibility characters

2017-03-27 Thread Mark E. Shoulson

On 03/27/2017 05:46 PM, Frédéric Grosshans wrote:
An example of a legacy character successfully  encoded recently is ⏨ 
U+23E8 DECIMAL EXPONENT SYMBOL, encoded in Unicode 5.2.
It came from the Soviet standard GOST 10859-64 and the German standard 
ALCOR. And was proposed by Leo Broukhis in this proposal 
http://www.unicode.org/L2/L2008/08030r-subscript10.pdf . It follows a 
discussion on this mailing list here 
http://www.unicode.org/mail-arch/unicode-ml/y2008-m01/0123.html, where 
Ken Whistler was already sceptical about the usefulness of this encoding. 
Aw, but ⏨ is awesome!  It's much cooler-looking and more visually 
understandable than "e" for exponent notation.  In some code I've been 
playing around with I support it as a valid alternative to "e".


~mark


Re: "A Programmer's Introduction to Unicode"

2017-03-13 Thread Mark E. Shoulson
A word ending in A *or* AA preceding a word beginning in A *or* AA will 
all coalesce to a single AA in Sanskrit.  That's four possibilities, and 
that doesn't count a word ending in a consonant preceding a word 
beginning in AA, which would be written the same.  My memory is rusty, 
so I should actually be looking things up, but I think these are valid 
constructions:


न + अगच्छत्  →  नागच्छत्
न + आगच्छत्  → नागच्छत्

(and indeed, आगच्छत् is the upasarga आ plus अगच्छत्, so there too the A 
+ AA coalesced.)  I should probably find you examples for all the other 
possibilities.  Sanskrit external vowel sandhi is comparatively 
straightforward (compared to consonant sandhi), and it frequently loses 
information.  A *or* AA plus I is E; A *or* AA plus U is O (you need A + 
O to get AU).


~mark


On 03/13/2017 06:26 PM, Manish Goregaokar wrote:

Do you have examples of AA being split that way (and further reading)?
I think I'm aware of what you're talking about, but would love to read
more about it.
-Manish


On Mon, Mar 13, 2017 at 2:47 PM, Richard Wordingham
 wrote:

On Mon, 13 Mar 2017 23:10:11 +0200
Khaled Hosny  wrote:


But there are many text operations that require access to Unicode code
points. Take for example text layout, as mapping characters to glyphs
and back has to operate on code points. The idea that you never need
to work with code points is too simplistic.

There are advantages to interpreting and operating on text as though it
were in form NFD.  However, there are still cases where one needs
fractions of a character, such as word boundaries in Sanskrit, though I
think the locations are liable to be specified in a language-specific
form.  U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it
in at least 4 ways.

Richard.





Re: Soyombo empty letter frame

2017-01-04 Thread Mark E. Shoulson

On 01/04/2017 04:18 PM, eduardo marin wrote:


The Soyombo proposal is beautiful, but it is missing a very important 
character in my opinion: 
http://www.unicode.org/L2/L2015/15004-soyombo.pdf 



Encoding an empty letter frame will allow for its proper description 
in plain text (as it is clear in the proposal itself), it could be 
used as an stylized cursor in text processors and also we could make 
zwj sequences such that combining with consonants makes it only render 
the nucleus.




According to the proposal:

   In the proposed encoding a combination of frame and nucleus is
   considered an atomic letter This approach enhances the
   conceptualization and identification of letters in the script; for
   instance, the letter ‘ka’ refers inherently to the fully-formed (X)
   and not to the nucleus (X).

In other words, they are explicitly rejecting the model considering the 
"frame" as an item in its own right.  I realize that you are not calling 
for redefining all the letters in terms of frame+nucleus, but encoding 
the frame seems to be something the proposers deliberately decided 
against doing.  In calling for encoding the frame (and why just one 
frame?  Wouldn't you want both the "closed" and "open" ones?), I think 
you really are going against what seems to be a design principle of the 
proposers.  Which of course you are completely entitled to do: just that 
you probably are better off talking it over with the proposers directly, 
to learn their thinking and so they can learn yours.


~mark


Re: Manatee emoji?

2016-11-23 Thread Mark E. Shoulson

On 11/23/2016 10:15 AM, James Kass wrote:

http://patch.com/florida/southtampa/petition-drive-aims-raise-manatee-awareness-adorable-way

If enough people sign the petition, will Unicode add a manatee emoji?
And, how about wolverines and lemmings?  Are any petitions underway
for them?  How many signatures on a petition would be needed before
Unicode would consider adding a non-existent character to the
repertoire?

Aren't many emoji "non-existent[sic]" characters prior to their adoption?

~mark


Re: The (Klingon) Empire Strikes Back

2016-11-15 Thread Mark E. Shoulson

On 11/15/2016 08:29 PM, Michael Everson wrote:

Mark,

No need to be defensive.

Tengwar and Cirth are in there because *I* put them there *long ago*, and the 
argument made was the nature of Tolkien’s work and study of it. That remains 
valid for keeping there, for one day the Tolkien Estate may revise its view on 
the matter.

Maybe a version of the Roadmap had Klingon in it. I don’t recall. I’d’ve been 
the one to have put it there. There are records. It doesn’t matter, though. 
When lack of use made Klingon made UTC remove it from consideration, it would 
have been removed.


The defensiveness was not that Tolkienian scholarship was deemed 
"worthy", but more that Klingon's apparently was not.  There was a 
Roadmap with pIqaD on it, and indeed you were the one who put it there.  
Nick Nicholas, in 
https://web.archive.org/web/20120307231609fw_/http://www.tlg.uci.edu/~opoudjis/Klingon/piqad.html 
credits you with a "delightful move of defiance" for replacing pIqaD 
with Sarati when it was removed.



The Roadmaps are really of no consequence. They’re useful, but they have no 
status and are subject to any kind of change before ballotting ends.


Getting pIqaD off the "not-roadmapped" list is more important, both 
symbolically and, as Ken Whistler says, practically.


~mark


Re: The (Klingon) Empire Strikes Back

2016-11-15 Thread Mark E. Shoulson

On 11/15/2016 08:26 PM, Shawn Steele wrote:

As I understand the issue, the problem is less of whether or not it is legal, 
then whether or not Paramount might sue.  Whether Unicode wins or not, it would 
still cost money to defend.


There ought to be laws against suits brought just to intimidate.  I 
think there are.  But yes, they aren't easy to prove or enforce.

I was wondering like Mark Davis mentioned if there were some sort of companies 
that sold bonds for this kind of thing (though that might be out of KLI's 
budget.)

Being afraid of a no answer probably isn't going to inspire confidence.  But maybe you 
could do a combination of the above.  Get someone to give you a legal opinion and then 
present that to Paramount with a "hey, they said this was probably legal anyway, but 
we wanted to ask nicely to be sure."


Not so much "afraid" of a no answer, but would rather not give the sense 
that we even thought that one was an option.  And for a company that 
makes its living from IP, they usually don't even have to bother 
listening to the whole question: "Say, can we use your—" "No!"  (This is 
probably also partly due to the way the laws are structured).


Your idea is a good one, though.  Get a legal opinion and maybe *inform* 
Paramount of it, and ask if they'd like to be involved in sanctioning 
it.  If spun right, it could even be sold as offering them the 
opportunity to get in on this, magnanimously offering them the privilege 
of giving their blessing...


~mark


Re: The (Klingon) Empire Strikes Back

2016-11-15 Thread Mark E. Shoulson

On 11/15/2016 08:15 PM, Ken Whistler wrote:


On 11/15/2016 10:21 AM, Asmus Freytag wrote:
Finally, I really can't understand the reluctance to place anything 
in the roadmap. An entry in the roadmap is not a commitment to 
anything - many scripts listed there face enormous obstacles before 
they could even reach the stage of a well-founded proposal. And, 
until such a proposal exists, there's no formal determination that a 
script has a truly separate identity and meets the bar for encoding.


The barrier to putting it in the roadmap is the that it pIQaD is 
currently listed on *not*-the-roadmap:


http://www.unicode.org/roadmaps/not-the-roadmap/

as Mark Shoulsen has been repeatedly pointing out.

It would be inconsistent to add it to the SMP roadmap unless we delete 
it from not-the-roadmap.


And the reason that step has been stuck is because the UTC is still on 
record with a nonapproval notice for the Klingon script from 2001. 
(Based on Consensus 87-M3.)


http://www.unicode.org/alloc/nonapprovals.html

So figure it out, folks. First bring to the UTC a proposal to reverse 
87-M3. (Not to *encode* pIQaD yet -- just, on the basis of the new, 
more mature proposal, to *entertain* appropriate discussion about 
suitability for encoding, by rescinding the prior determination of 
nonapproval.) If *that* proposal passed, then the nonapproval notice 
would also be dropped. If the nonapproval notice is dropped, the 
not-the-roadmap entry would be dropped. And if that is dropped, then 
the Roadmap committee would dig around for a tentative allocation 
slot, pending the determination of outcome for any other issues. Which 
then could focus on the next obstacle, which is IP and the unresolved 
risk of litigation.


So now the problem *isn't* the IP.  All along I've been saying that 
UTC needs to decide that pIqaD *should* be encoded first, without 
consideration of the IP issues, and *then* we can worry about dealing 
with the IP.  And the answers I got were all about how we can't do 
*anything* until this IP stuff is dealt with.  And now Ken Whistler 
comes and says what I said in the first place!  At least someone was 
paying attention.


So... Now it's not enough to propose that pIqaD get encoded, like any 
other script would need.  First we need a proposal to *permit* a 
proposal for encoding?  Um.  OK.  What should such a thing look like?  
Perhaps something like the document I submitted, showing lots of usage 
and asking if it could be considered now?  I originally wasn't going to 
append the full proposal to the document, but it was suggested to me 
that it would be expected.


Should I split the document up into two pieces and re-submit the two 
halves, one as a proposal, and one for permission to consider the 
proposal?  Would that satisfy the requirements?


In any case, folks should stop with with "Unfair! Unfair!" stuff, and 
just set to work, step-by-step, to deal with the items noted above. "A 
Klingon is trained to use everything around them to their advantage." 
O.k., I've just provided something useful -- go for it. And you won't 
even need a cloaking device.


I've been working with whatever I could find all along.  The unfairness 
is a recognized fact, apparently, that can finally be faced and fixed, 
or so I hope.  I'm trying to get this done; best I can do is answer the 
questions put to me and look how other scripts in similar situations 
(like Tolkien scripts) have done what they did.


~mark


Re: The (Klingon) Empire Strikes Back

2016-11-15 Thread Mark E. Shoulson

On 11/15/2016 07:47 PM, Michael Everson wrote:

A body of a particular kind of scholarship surrounds Tolkien’s oeuvre. That’s 
probably the reason.

Michael Everson


Ah.  So it *is* a matter of "some literature is better than others."  I 
repeat here all the stuff I said in my response to Asmus' letter.  Since 
when did Unicode get in the business of deciding whose literature was 
important and whose wasn't?  And what do they base their decisions on?  
How much Klingon correspondence and conversation did the UTC sift 
through in order to reach its learned conclusion that Klingon-speakers 
don't do anything "scholarly"?


Do you guys even hear how ridiculously bigoted this all sounds?

~mark



Re: The (Klingon) Empire Strikes Back

2016-11-15 Thread Mark E. Shoulson

On 11/15/2016 07:31 PM, Mark Davis ☕️ wrote:
> However, it appears relatively settled that one cannot claim 
copyright in an alphabet...


We know that these parties tend to be litigious, so we have to be 
careful. "relatively settled" is not good enough.


We do not want to be the ones responsible (and liable) for making a 
determination as to whether that is settled. Nor do we want to pay the 
legal fees necessary to make a water-tight determination.


That is why if there is any question as to the IP issues, we leave it 
up to the proposers to get absolutely rock-solid clearance (eg from 
the Tolkien estate for Tengwar, or from Paramount for Klingon). The 
only other alternative I can think of is if the proposers provide 
indemnification for any legal costs that could obtain from a legal 
suit of us or our vendors.


Mark
//


How about legal counsel on the matter?

We're a little hesitant of asking Paramount/CBS about this, because of 
course, asking means that we think maybe they can say no, and we don't 
want to imply that.  So I'm thinking/hoping maybe we can do some 
research by a qualified legal expert (and not us armchair-lawyers, 
"yeah, it looks pretty settled to me...") to make a determination.


I'm trying to find out some more information about the KLI's pIqaD font, 
which it has been using and distributing for decades, during some of 
which time it was licensed by Paramount, and which apparently was *not* 
covered in the licensing agreements—precisely because typefaces are 
*not* copyrightable in the US!  (I thought they were, though... like I 
said, I'm trying to find out more about this.)  And all that time 
without objection from Paramount.  Not a slam-dunk argument, but it's 
something.


~mark


Re: The (Klingon) Empire Strikes Back

2016-11-15 Thread Mark E. Shoulson

On 11/15/2016 01:21 PM, Asmus Freytag wrote:

On 11/15/2016 9:22 AM, Peter Constable wrote:


Klingon _/should not/_ be encoded so long as there are open IP 
issues. For that reason, I think it would be premature to place it in 
the roadmap.



Peter,

I certainly sympathize with the fact that the Consortium wants to 
avoid being drawn into litigation, and that even litigation based on 
unsustained IP claims could be costly.


However, it appears relatively settled that one cannot claim copyright 
in an alphabet; one of the roles of the Unicode Consortium in this 
regard would be to reach a formal decision whether this is, in fact, 
an alphabet/script (and one that, based on the usual criteria of 
usage) is acceptable for encoding.


Ducking this particular determination serves no-one.


Thanks, Asmus.

I can understand the UTC's caution: you don't want to open yourself up 
to litigation—even if you eventually win.  But this also is likely not 
going to be the first time that there is this kind of legal hold on 
something encodable.  I note that Blissymbolics, according to Wikipedia, 
*does* have a copyright (as opposed to "maybe they might think they do") 
and yet it, too, is roadmapped. If I didn't know better (and I don't), I 
might think there was some sort of bias against Klingon.


Finally, I really can't understand the reluctance to place anything in 
the roadmap. An entry in the roadmap is not a commitment to anything - 
many scripts listed there face enormous obstacles before they could 
even reach the stage of a well-founded proposal. And, until such a 
proposal exists, there's no formal determination that a script has a 
truly separate identity and meets the bar for encoding.


NOT being called out for being unencodable would be a step up for 
Klingon, at least, let alone the roadmap.


PS: the "real" reason that Klingon was never put in the roadmap (as I 
recall discussions in the early years) was not so much the question 
whether IP issues existed/could be resolved, but the fear that adding 
such an "invented" and "frivolous" script would undermine the 
acceptance of Unicode. Given the way Unicode is invested in 
"frivolous" communication systems of very recent origin (emoji), that 
original argument surely doesn't apply :)


Yes, of course, though it's nice to have someone say it out loud. You do 
of course realize that that sentiment is *precisely* as offensive as 
"Unicode shouldn't encode African scripts, because only darkies use them 
anyway, and we wouldn't want to be seen as supporting *those* people."  
Bigotry is bigotry, even when applied to fans.  Essentially, the claim 
is "we shouldn't encode those, not because nobody uses them, but because 
nobody *important* uses them."


I was talking to someone once about Unicode, and explained that they 
were responsible for encoding emoji, etc.  And he scoffed at that, "why 
encode those?  who uses those anyway?"  I said, "Millions of people 
around the world use them every day in tweets and instant messages..." 
"Yeah, but I mean, aside from that!"  The question is, who out there who 
is *important* is using them for *important* things.  And if the UTC has 
to get in the business of judging what qualifies as "important" 
communication, you're going to need a lot more members, just to go 
through everything being printed. (Why encode chess pieces?  Only chess 
nerds use them, and I don't care about chess.  Go piece signs?  Nobody 
*I* talk to uses those.  And don't even get me started on pictures of 
baseballs.  And only goyim would need a picture of a breaded shrimp...)


It's refreshing to hear it finally admitted in full.  I always felt that 
if people are going to act unfairly, they should at least say "yes, 
we're acting unfairly, because you don't deserve fairness." Then they 
can explain why fairness is undeserved.


~mark


Re: The (Klingon) Empire Strikes Back

2016-11-15 Thread Mark E. Shoulson

On 11/15/2016 12:22 PM, Peter Constable wrote:


Klingon _/should not/_ be encoded so long as there are open IP issues. 
For that reason, I think it would be premature to place it in the roadmap.


Then why is tengwar there, and Klingon proclaimed "unsuitable" for 
encoding?  Everyone's telling me the situation is the same with tengwar, 
and yet it isn't.  What is it about Tolkien scripts that makes them 
suitable and pIqaD not?  Artistic interest doesn't count.


I'm not trying to get tengwar/cirth *demoted*, but I would like someone 
to explain to me why some fandoms/scripts seem to be better than others.



~mark



Re: The (Klingon) Empire Strikes Back

2016-11-13 Thread Mark E. Shoulson

On 11/10/2016 02:34 PM, Mark Davis ☕️ wrote:

The committee doesn't "tentatively approve, pending X".

But the good news is that I think it was the sense of the committee 
that the evidence of use for Klingon is now sufficient, and the rest 
of the proposal was in good shape (other than the lack of a date), so 
really only the IP stands in the way.


Fair enough.  There have, I think, been other cases of this sort of 
informal "tentative approval", usually involving someone from UTC 
telling the proposer, "your proposal is okay, but you probably need to 
change this..."  And that's about the best I could hope for at this 
point anyway.  So it sounds like (correct me if I'm wrong) there is at 
least unofficial recognition that pIqaD *should* be encoded, and that 
it's mainly an IP problem now (like with tengwar), and possibly some 
minor issues that maybe hadn't been addressed properly in the proposal.


Can we get pIqaD removed from 
http://www.unicode.org/roadmaps/not-the-roadmap/ then?  And (dare I ask) 
perhaps enshrined someplace in http://www.unicode.org/roadmaps/smp/ 
pending further progress with Paramount?


I would suggest that the Klingon community work towards getting 
Paramount to engage with us, so that any IP issues could be settled.


I'll see what we can come up with; have to start somewhere.  There is a 
VERY good argument to be made that Paramount doesn't actually have the 
right to stop the encoding, as you can't copyright an alphabet (as we 
have seen), and they don't have a current copyright to "Klingon" in this 
domain, etc., and it may eventually come down to these arguments.  
However, I recognize that having a good argument on your side, and 
indeed even having the law on your side, does not guarantee smooth 
sailing when the other guys have a huge well-funded legal department on 
their side, and thus I understand UTC's reluctance to move forward 
without better legal direction. But at least we can say we've made 
progress, can't we?


~mark



Mark

Mark
//

On Thu, Nov 10, 2016 at 10:33 AM, Shawn Steele 
mailto:shawn.ste...@microsoft.com>> wrote:


More generally, does that mean that alphabets with perceived
owners will only be considered for encoding with permission from
those owner(s)?  What if the ownership is ambiguous or unclear?

Getting permission may be a lot of work, or cost money, in some
cases.  Will applications be considered pending permission,
perhaps being provisionally approved until such permission is
received?

Is there specific language that Unicode would require from owners
to be comfortable in these cases?  It makes little sense for a
submitter to go through a complex exercise to request permission
if Unicode is not comfortable with the wording of the permission
that is garnered.  Are there other such agreements that could
perhaps be used as templates?

Historically, the message pIqaD supporters have heard from Unicode
has been that pIqaD is a toy script that does not have enough
use.  The new proposal attempts to respond to those concerns,
particularly since there is more interest in the script now.  Now,
additional (valid) concerns are being raised.

In Mark’s case it seems like it would be nice if Unicode could
consider the rest of the proposal and either tentatively approve
it pending Paramount’s approval, or to provide feedback as to
other defects in the proposal that would need addressed for
consideration.  Meanwhile Mark can figure out how to get
Paramount’s agreement.

-Shawn

*From:*Unicode [mailto:unicode-boun...@unicode.org
<mailto:unicode-boun...@unicode.org>] *On Behalf Of *Peter Constable
*Sent:* Wednesday, November 9, 2016 8:49 PM
*To:* Mark E. Shoulson mailto:m...@kli.org>>; David
Faulks mailto:davidj_fau...@yahoo.ca>>
*Cc:* Unicode Mailing List mailto:unicode@unicode.org>>
*Subject:* RE: The (Klingon) Empire Strikes Back

*From:*Unicode [mailto:unicode-boun...@unicode.org
<mailto:unicode-boun...@unicode.org>] *On Behalf Of *Mark E. Shoulson
*Sent:* Friday, November 4, 2016 1:18 PM

>At any rate, this isn't Unicode's problem…

You saying that potential IP issues are not Unicode’s problem does
not in fact make it not a problem. A statement in writing from
authorized Paramount representatives stating it would not be a
problem for either Unicode, its members or implementers of Unicode
would make it not a problem for Unicode.

Peter






Re: The (Klingon) Empire Strikes Back

2016-11-13 Thread Mark E. Shoulson

On 11/09/2016 11:49 PM, Peter Constable wrote:


*From:*Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of 
*Mark E. Shoulson

*Sent:* Friday, November 4, 2016 1:18 PM

**

> At any rate, this isn't Unicode's problem…

You saying that potential IP issues are not Unicode’s problem does not 
in fact make it not a problem. A statement in writing from authorized 
Paramount representatives stating it would not be a problem for either 
Unicode, its members or implementers of Unicode would make it not a 
problem for Unicode.


Peter

That's a fair point; any problems arising from this *would* affect 
Unicode.  I guess what I was trying to say is that such an issue, while 
a problem once encoding proceeds, should not affect the determination of 
whether or not the encoding is *warranted*.


~mark



Re: The (Klingon) Empire Strikes Back

2016-11-13 Thread Mark E. Shoulson

On 11/08/2016 06:58 AM, Julian Bradfield wrote:

On 2016-11-08, Mark E. Shoulson  wrote:

I've heard that there are similar questions regarding tengwar and cirth,
but it is notable that UTC *did* see fit to consider this question for
them and determine that they were worthy of encoding (they are on the
roadmap), even though they have not actually followed through on that
yet, perhaps because of these very IP concerns.  Notably, pIqaD is not

The Tolkien Estate considers that the tengwar constitute a work of
art, and it's not willing to see them in Unicode, because this would
hinder its ability to pursue people using tengwar for what it
considers inappropriate purposes. (I finally asked them a couple of
years ago for permission to encode, based on Michael Everson's draft
proposal from yonks ago, and that's the summary of their reply.)


I've said it before: if we could get pIqaD at leasr on the same footing 
as tengwar, that would be a step in the right direction. Saying they're 
in a similar fix is (currently) blatantly contradicted by the facts, and 
we might as well clear up whatever *else* it is that's holding pIqaD 
back, and then see about IP problems.


It sounds like some progress is being made in this front.

~mark


Re: The (Klingon) Empire Strikes Back

2016-11-07 Thread Mark E. Shoulson

Thanks, Asmus.

The document from the copyright office is pretty explicit and final, and 
it is pretty clear that you can't copyright an *alphabet*, that is 
*characters*.  You can copyright *glyphs* (a font), but that is another 
matter entirely.


I've heard that there are similar questions regarding tengwar and cirth, 
but it is notable that UTC *did* see fit to consider this question for 
them and determine that they were worthy of encoding (they are on the 
roadmap), even though they have not actually followed through on that 
yet, perhaps because of these very IP concerns.  Notably, pIqaD is not 
only not on  the roadmap, it is specifically listed on the "Not on the 
Roadmap" page as an example of something that was not deemed worthy of 
being on the roadmap. If it's an IP issue, then someone will have to 
explain to me why it applies so asymmetrically to Tolkien and Klingon 
(and Blissymbolics, for that matter).  And yes, these are not the only 
writing systems with these issues and will not be the last.  One way or 
another, the question will have to be faced and dealt with one way or 
another; ignoring it won't help.


~mark

On 11/06/2016 09:16 PM, Asmus Freytag wrote:

On 11/6/2016 2:22 PM, David Starner wrote:



On Fri, Nov 4, 2016 at 10:42 AM David Faulks > wrote:


There is another issue of course, which I think could be a huge
obstacle: the Trademark/Copyright issue. Paramount claims
copyright over the entire Klingon language (presumably including
the script). The issue has recently gone to court. Encoding
criteria for symbols (and this likely extends to letters) is
against encoding them without the permission of the
Copyright/Trademark holder.


The US copyright office will not register letters for copyright: cf. 
http://web.archive.org/web/20160304062736/http://www.ipmall.info/hosted_resources/CopyrightAppeals/2004/Mark%20Hendricksen.pdf

So the copyright issue is not relevant here.


On the face of it, the cited statement seems to very broadly reject 
the copyrightability of alphabets and writing systems, tracing that 
decision back to statements of intent around the copyright legislation.


Given that, I'd tend to concur with Doug that UTC should feel free to 
discuss this on the merit, but that in the case of a positive outcome 
the Consortium would of course have counsel review this issue. Given 
that this won't be the only writing system for which the original 
invention post-dates modern IP laws, it would probably be good to have 
some clarity here.


A./





Re: The (Klingon) Empire Strikes Back

2016-11-06 Thread Mark E. Shoulson

On 11/04/2016 05:02 PM, Doug Ewell wrote:

Mark E. Shoulson wrote:


At any rate, this isn't Unicode's problem. Unicode would not be
creating anything in Klingon anyway!

Well, to be fair, I thought IPR was the primary reason Unicode had never
encoded the Apple logo either. I doubt that whether Unicode intended to
use such a character themselves was a factor. (Of course, users who
really wanted that character encoded are probably using 🍎 or 🍏
now.)
  
--

Doug Ewell | Thornton, CO, US | ewellic.org


The Apple logo is just that: a logo.  Unicode is/used to be explicitly 
NOT in the business of encoding logos, and only peripherally in the 
business of encoding cute Wingdings and icons.  pIqaD is an *alphabet* 
for writing a *language*; that's a whole different situation, and one 
that is squarely in what Unicode is all about doing.  "Should" the Apple 
logo have been encoded?  Possibly, though there are a lot of reasons not 
to which do not depend specifically on IP (we'd have to encode all the 
other emblems of all the other computer companies also... not to mention 
gasoline companies, cereal companies...) Should pIqaD be encoded?  It is 
my claim that it should, and that reasons not to are (mainly) limited to 
IP considerations.  In which case, IP considerations need to be 
addressed, yes, but they should not pre-determine the decision of 
whether or not it's worthy of inclusion.



~mark



Re: The (Klingon) Empire Strikes Back

2016-11-04 Thread Mark E. Shoulson
I know of the Axanar flap.  I'm not sure that Paramount was *seriously* 
saying "we own everything anyone ever says or will say in this 
language."  What they said was more "you used Klingon in your story, and 
Klingon is our language, therefore your story is infringing on our 
stuff."  So while it's true they *might* make that claim, I don't know 
that they *have*.


All of which is neither here nor there; it's something they could say.  
The LCS wrote an amicus brief, which is linked to from my document, by 
the way, arguing that very point, which the judge dismissed without 
prejudice on the grounds that he wasn't going to be addressing that 
issue (so he may not have seen it as critical to Paramount's case 
either).  A claim as bald and universal as the way I worded it above is 
practically indefensible logically, intuitively, and legally (Sun 
invented Java, but can they claim every Java program???)  At any rate, 
this isn't Unicode's problem.  Unicode would not be creating anything in 
Klingon anyway!  Just encoding letters used to write it.  Now, those 
letter-shapes might (for all I know) have legal strings attached, and 
what's more, the word "Klingon" is definitely owned and claimed by 
Paramount, which might cause problems with naming the block.


Really, though, that isn't what UTC should be deciding.  The question is 
whether or not to encode pIqaD: is it a writing system that people use 
or have used in the past to communicate (that's the main criterion, 
right?  Unicode is supposed to contain "all" alphabets).  If there are 
additional issues outside of UTC's purview that raise difficulties, 
those will have to be heard and addressed. But decide to act first, 
*then* see what obstacles need to be overcome.


~mark

On 11/04/2016 01:41 PM, David Faulks wrote:

On Thu, 11/3/16, Mark Shoulson  wrote:
Subject: The (Klingon) Empire Strikes Back
   

At the time of writing this letter it has not yet hit the UTC
Document Register, but I have recently submitted a document
revisiting the ever-popular issue of the encoding of Klingon
"pIqaD".  The reason always given why it could not be
encoded was that it did not enjoy enough usage, and so I've
collected a bunch of examples to demonstrate that this is not
true (scans and also web pages, etc.)  So the issue comes
back up, and time to talk about it again.

There is another issue of course, which I think could be a huge obstacle: the 
Trademark/Copyright issue. Paramount claims copyright over the entire Klingon 
language (presumably including the script). The issue has recently gone to 
court. Encoding criteria for symbols (and this likely extends to letters) is 
against encoding them without the permission of the Copyright/Trademark holder.

Is Paramount endorsing your proposal?




~mark

David Faulks
  
  

  
  





Re: The (Klingon) Empire Strikes Back

2016-11-03 Thread Mark E. Shoulson
Yes, it isn't unique to Klingon, I never said it was, and who cares that 
Latin also has it??  We weren't talking about Latin!


~mark

On 11/03/2016 08:06 PM, Philippe Verdy wrote:
2016-11-04 0:43 GMT+01:00 Mark Shoulson >:


3. For my part, I've invented a pair of ampersands for Klingon
(Klingon has two words for "and": one for joining verbs/sentences
and one for joining nouns (the former goes between its
"conjunctands", the latter after them)), from ligatures of the
letters in question.

That is not new to Klingon, and it exists also in Classical Latin :

- the coordinator "et" between words, for simple cases; this 
translates as "and" in English...
- the "-que" suffix at end of the second word which may be far after 
the first one (which could be in another prior sentence, or implied by 
the context and not given explicitly); this translates as the adverb 
"also" in English... I've seen that suffix abbreviated as a "q" with a 
tilde above, or a slanted tilde mark attached above, or an horizontal 
tilde crossing the leg of the q below... Sorry I can't remember the 
name of these abbreviation marks.







Re: Wogb3 j3k3: Pre-Unicode substitutions for extended characters live on

2016-10-16 Thread Mark E. Shoulson
I have the rare good fortune to see John Cowan on a near-daily basis 
(except this month, with all the Jewish Holidays); I'll forward your 
message on.


~mark

On 10/16/2016 01:08 PM, Marcel Schneider wrote:

On 11 Oct 2016 09:48:00 -0700, Doug Ewell wrote:
[…]

You mentioned mobile devices, but also mentioned ISO/IEC 9995 and 14755,
which seem to deal primarily with computer keyboards.

On Windows, John Cowan's Moby Latin keyboard [1] allows the input of
more than 800 non-ASCII characters, including the two mentioned in your
post (ɛ and ɔ):

AltGr+p, o 0254 LATIN SMALL LETTER OPEN O
AltGr+p, e 025B LATIN SMALL LETTER OPEN E

Moby Latin is a strict superset of the standard U.S. English keyboard;
that is, none of the standard keystrokes were redefined, unlike
keyboards such as United States-International which tend to redefine
keys for ASCII characters that look like diacritical marks, making
adoption difficult. There are also versions of Moby based on the
standard U.K. keyboard.

[1]
http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html


U.S. Moby Latin and Whacking Latin keyboard driver packages
are not available any more. What happened?
Neither can John Cowanʼs home pae be accessed:
http://home.ccil.org/%7Ecowan/XML/
Though the Chester County Interlink host is not down.
Still the ReadMe can be accessed, from another domain:
http://www.smo.uhi.ac.uk/gaidhlig/sracan/Whacking/MobyLatinKeyboard.html





Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Mark E. Shoulson

On 10/10/2016 05:36 PM, Julian Bradfield wrote:

On 2016-10-10, Michael Everson  wrote:


Apparently it’s used to good effect in mathematics, though a great
deal of TeX material appears printed and has an obvious “TeX” feel

It's for printing, so of course it appears printed. The obvious TeX
feel is the result of using the default style, which arises from
Knuth's personal taste in mathematical typesetting, with Lamport's
(abominable) taste in structural layout on top. There are tens of
thousands of journals and books produced with LaTeX, in hundreds or
thousands of styles.

Among publishers you may have heard of, Addison-Wesley, CUP, Elsevier,
John Benjamins, OUP, Princeton UP, Wiley all use LaTeX for a
significant proportion of their output. They're all professionals.

To me, the main "TeX" feel that TeX-printed things tend to share is 
Knuth's distinctive Computer Modern font, not necessarily structure.  
You can typeset amazing things in TeX (viz. the Comparing Torah that 
Michael published for me); limitations there are mostly of your own making.


(I haven't really been able to keep up with this thread in general, though.)

~mark


Re: Moving The Hebrew Extended Block Into The SMP

2016-05-10 Thread Mark E. Shoulson
Oh yeah.  I also wonder a bit about things like the "half-letters" that 
were used sometimes in early Hebrew printing to fill out space left at 
the end of a line.  They would often write part of the next word, the 
first few letters, but maybe the last letter was missing part of it, or 
just random semi-characters (things like a SHIN with only two heads 
shows up a lot, or even complete SHINs). http://xkcd.com/1676/ got me 
thinking of it.  They're probably not encodable... or are they?


I'll have to find some example scans.  If it's as common as I say, that 
should be easy... unless I'm wrong about that, which I guess would make 
the whole question easier too.


~mark


Re: The Hebrew Extended (Proposed) Block

2016-05-10 Thread Mark E. Shoulson
Sounds like a plan; most additional Hebrew characters can probably 
safely live in the SMP, as they are not all that common (except, of 
course, TETRAGRAMMATON, which I'll be writing another proposal about).


What Samaritan vowel and accent points did we miss when we did Samaritan 
the first time around?  We tried to be pretty comprehensive with it, 
including contact with the user community and inspecting books and MSS.


Somewhere I have a list of signs I started making by reading an entry in 
an encyclopedia (Encyclopedia Judaica?) s.v. "Masorah". Ah, found it.  
Various lines, strokes, dots, colons, pairs of dots in assorted 
configurations around letters (Palestinian and Babylonian vowel points, 
etc)...  A bunch of combining letters (COMBINING SAMEKH ABOVE, etc), 
some not exactly normal (SLANTED NUN ABOVE)... I think I had about 
sixty.  But it isn't particularly well-organized or researched.


There is also the "Expanded" Tiberian cantillation system I have seen 
mentioned (in Yeivin's book on Masorah for example, in the part on 
accents, para. #220).  It seems to distinguish things like different 
flavors of MUNAH; I have never really found much about it, so I don't 
know if it needs special graphemes.  The only examples in the Yeivin 
book that I see appear to use existing symbols in combinations (e.g. 
MUNAH plus a MERKHA KEFULA for a "mekarbel").


What other Hebrew characters have you got in mind?  Could be 
interesting.  Are you considering symbols for PETUHA and SETUMA 
pericopes in your "typesetting" section?  Are those fit to be encoded?  
I think they've been mentioned before, but it's hard to show that they 
are anything other than specialized uses of PEH and SAMEKH (unless we're 
talking about using them as formatters, and then they're pretty 
definitely out of scope).


~mark

On 05/10/2016 07:55 PM, Robert Wheelock wrote:

Hello again, y’all!

¡BAD NEWS! (CRUCIALLY IMPORTANT):  The Unicode Consortium has assigned 
OTHER characters into the U+00860-U+008FF areas in the BMP of 
Unicode—Malayalam extended additional characters for Garshuni, and 
more additional Arabic characters.


We’ll need to find a DIFFERENT subblock to plant down our Hebrew 
extended characters...  either somewhere ELSE within the BMP, 
_or_ somewhere within either SMP areas 1 or 2.
It’ll be the same arrangement originally planned for the U+00860 
area—but relocated and expanded upon!


·Additional characters for correct typesetting of Hebrew
·Hebrew Palestinian vowel and pronunciation points
·The small superscript signs /śin/ and /shin/ for the letter /shin/
·Hebrew Palestinian cantillation
·Hebrew Babylonian vowel and pronunciation points
·Hebrew Babylonian cantillation
·Hebrew Samaritan vowel and pronunciation points
·Additional Hebrew characters for other Jewish languages
A new TXT listing of this subblock (with the new CORRECT location) 
will be forthcoming. STAY TUNED!







Re: precomposed polytonic Greek characters with macrons and other diacritics

2016-02-08 Thread Mark E. Shoulson

On 02/08/2016 01:47 PM, James Tauber wrote:


I'd be interested if others have tackled similar issues outside of Greek.

James


Keep in mind that in pointed Hebrew (or Arabic (or for that matter 
Devanagari)), practically every letter is like this, since each vowel is 
a diacritical, from a typographical point of view.  Though perhaps not 
considered in the same way that Greek considers its accented letters.


~mark



Re: Proposal for German capital letter "ß"

2015-12-09 Thread Mark E. Shoulson

On 12/09/2015 06:49 PM, Hans Meiser wrote:

Yes, they do it wrong because (1) they don't know better and (2) they let their 
software convert lower case text into upper case (a feature nearly every 
typographic software provides).

Yet, if we let the majority of illiterate people decide what's right and what's 
wrong we could as easily decide to have 2 + 2 = 5.

Here's an official text of the correct today's rules on how to write a capital 
"ß" (it's in German):

http://www.duden.de/sprachwissen/rechtschreibregeln/doppel-s-und-scharfes-s


I remember when we went through all this the first time around, encoding 
ẞ in the first place.  People were saying "But the Duden says no!!!"  
And someone then pointed out, "Please close your Duden and cast your 
gaze upon ITS FRONT COVER, where you will find written in inch-high 
capitals plain as day, "DER GROẞE DUDEN" 
(http://www.typografie.info/temp/GrosseDuden.jpg)  So in terms of 
prescription vs description, the Duden pretty much torpedoes itself.


~mark


Re: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?)

2015-10-22 Thread Mark E. Shoulson
It's nice that you've written proposals.  I suppose the various groups 
will pick it up and get back to you as they usually do.  But if they say 
"no, you're out of scope" again, it probably means that you're out of 
scope, and submitting another proposal of the same thing will not make 
it any more in-scope.


I have no idea why deposition with the British Library is in any way 
significant or even relevant.  It's nice to mail documents to people who 
will save them, yes.


You ask about these same questions often.  Often enough that some have 
been banned as topics of conversation here.  You've been doing it for 
years.  "Can the scope of Unicode change?" you ask. At this point, I 
suggest that you act as if the answer is "No!" and move on, without 
trying to force Unicode to become a partner in your research.  Even if 
the answer is "Maybe" it's not the kind of thing you can be *sure* will 
happen.  You need to proceed in a way that doesn't depend on other 
things beyond your control.


You want to join Unicode as an official member and try to change its 
scope from the inside, where you can even vote?  Be my guest.


You can't proceed with your research without a multinational standards 
committee changing *its entire scope and outlook* just to accommodate 
you?  Then you're going about research wrong. "Unicode and the 
International Standard with which it is synchronized are the standards" 
you say?  Obviously not, since Unicode has said that it doesn't encode 
what you want.  So it is NOT the standard for the things you want to use 
it for.  It's the standard for other things.


Do doctors insist that the WHO completely change its focus so their 
research can be included?  Other researchers the world over are doing 
their thing without asking ISO, Unicode, ANSI, DIN, or the IAEA for that 
matter change to suit them.  I have not heard of other cases like this, 
which doesn't mean there aren't any, but it probably means there aren't 
many, and I haven't heard any standards organizations announcing changes 
based on requests like this, either.


This is not the standard you were looking for.  Find another or make 
your own (or both), like a responsible researcher and scientist.


~mark

On 10/22/2015 05:21 AM, William_J_G Overington wrote:

Mark E. Shoulson wrote:


Unicode isn't doing what you want?  Make your own standard.  Make it standard 
for *your* stuff.  Get people to like it and use it.

Unicode and the International Standard with which it is synchronized are the 
standards.

I submitted a rewritten document on Monday 19 October 2015.

The document is available on the web.

http://www.users.globalnet.co.uk/~ngo/a_preliminary_proposal_to_encode_two_base_characters.pdf

It is linked from the following web page.

http://www.users.globalnet.co.uk/~ngo/library.htm

The document has been deposited, as an email attachment, with the British 
Library for Legal Deposit and a receipt received.

Here is a link about Legal Deposit in the United Kingdom.

http://www.bl.uk/aboutus/legaldeposit/index.html

William Overington

22 October 2015







Re: The scope of Unicode (from Re: How can my research become implemented in a standardized manner?)

2015-10-21 Thread Mark E. Shoulson

On 10/16/2015 01:10 PM, William_J_G Overington wrote:


I have been considering how to make progress with trying for my 
research to become implemented in a standardized manner.


I have been informed that a group of people have examined the document 
that I submitted and determined that it is out of scope for UTC.


There are millions of people on this great globe doing all kinds of 
research into all kinds of things.  Most of them somehow manage to do so 
without requiring an international standards body to change its workings 
and basic outlook to accommodate them.  It staggers the imagination that 
your research simply cannot be done without the cooperation of Unicode, 
and moreover, that you have the nerve to ask for it to change its 
*entire scope* just so that your personal  project, stalled by your own 
hand, can move forward.


Learn how all those millions of people out there manage to do their work 
and further their research without calling on multinational bodies to 
bend to their whims.  It must be possible, everyone else seems to be 
able to do it.  The only thing stopping your research from progressing 
and standardizing is you.  Unicode isn't doing what you want?  Make your 
own standard.  Make it standard for *your* stuff.  Get people to like it 
and use it.  You cannot expect Unicode to change to be what you want any 
time in the foreseeable future; make do without it.


Please.  Grow up and take responsibility for your own research and stop 
trying to bend Unicode into what YOU think it should be, when the clear 
consensus is that it isn't.  The rest of us are tired of having to 
answer this question (or see it answered) over and over.


~mark



Re: a suggestion new emoji .

2015-08-19 Thread Mark E. Shoulson
And is there an emoji for GRAIN OF SALT?  (Actually, that could almost 
be useful... or even just a geometric CUBE...)


~mark

On 08/19/2015 01:10 PM, William_J_G Overington wrote:


Mark Davis wrote:

> As far as petitions go, we take them with a sizable grain of salt.

Who, exactly, precisely, is "we" please?

William Overington

19 August 2015







  1   2   3   4   >