Re: comma ellipses

2019-10-07 Thread Asmus Freytag (c) via Unicode
Now you are introducing research - that kills all the fun . . . (oops , 
, , )

A./

On 10/6/2019 10:39 PM, Tex wrote:


Just for additional info on the subject:

https://www.theguardian.com/science/2019/oct/05/linguist-gretchen-mcculloch-interview-because-internet-book

“…I’ve been spending a fair bit of time recently with the comma 
ellipsis, which is three commas (,,,) instead of dot-dot-dot. I’ve 
been looking at it for over a year and I’m still figuring out what’s 
going on there. There seems to be something but possibly several 
somethings.


One use is by older people who, in some cases where they would use the 
classic ellipsis, use commas instead. It’s not quite clear if that’s a 
typo in some cases, but it seems to be more systematic than that. 
Maybe they’re preferring the comma because it’s a little bit easier to 
see if you’re on the older side, and your vision is not what it once 
was. Or maybe they just see the two as equivalent. It then seems to 
have jumped the shark into parody form. There’s a Facebook group in 
which younger people pretend to be to be baby boomers, and one of the 
features people use there is this comma ellipsis. And then in some 
circles there also seems to be a use of comma ellipses that is very, 
very heavily ironic. But what exactly the nature is of that heavy 
irony is still something that I’m working on figuring out….”


*From:*Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of 
*Asmus Freytag via Unicode

*Sent:* Sunday, October 6, 2019 10:21 PM
*To:* unicode@unicode.org
*Subject:* Re: comma ellipses

On 10/6/2019 8:21 PM, Garth Wallace via Unicode wrote:

It’s deliberately incorrect for humorous effect. It gets used, but
making it “official” would almost defeat the purpose.

Well then it should encode a "typographically incorrect" comma ellipsis :)

A./

On Sun, Oct 6, 2019 at 5:02 PM Asmus Freytag via Unicode
mailto:unicode@unicode.org>> wrote:

On 10/6/2019 4:05 PM, Tex via Unicode wrote:

Now that comma ellipses (,,,) are a thing (at least on
social media) do we need a character proposal?

Asking for a friend,,, J

tex

I thought the main reason we ended up with the period (dot)
one is because it was originally needed for CJK-style fixed
grid layout purposes. But It could be wrong.

What's the current status for 3-dot ellipsis. Does it get
used? Do we have autocorrect for it? If so, that would argue
that implementers have settled and any derivative usage
(comma) should be kept compatible.

A./





Re: Alternative encodings for Malayalam “nta”

2019-10-06 Thread Asmus Freytag (c) via Unicode

On 10/6/2019 11:57 AM, 梁海 Liang Hai wrote:

Folks,

(Microsoft Peter and Andrew, search for “Windows” in the document.)

(Asmus, in the document there’s a section 5, /ICANN RZ-LGR 
situation/—let me know if there’s some news.)


The issue, as it affects domain names, has been brought to the authors 
of the Malayalam Root Zone LGR proposal, the Neo-Brahmi Generation 
Panel; however, there is no new status to report at this time. I would 
appreciate if you could keep me updated on any details of the UTC 
decision (particularly those that do not make the rather terse UTC minutes).


A./




This is a pretty straightforward document about the notoriously 
problematic encoding of Malayalam /rra/>. I always wanted to properly document this, so finally here it is:


L2/19-345

*Alternative encodings for Malayalam "nta"*
Liang Hai
2019-10-06


Unfortunately, as  has already become the de facto 
standard encoding, now we have to recognize it in the Core Spec. It’s 
a bit like another Tamil /srī/ situation.


An excerpt of the proposal:

Document the following widely used encoding in
the Core Specification as an alternative representation for
Malayalam [glyph] () that is a
special case and does not suggest any productive rule in the
encoding model:




Best,
梁海 Liang Hai
https://lianghai.github.io





Re: Alternative encodings for Malayalam “nta”

2019-10-06 Thread Asmus Freytag (c) via Unicode

Have you submitted that response as a UTC document?
A./

On 10/6/2019 2:08 PM, Cibu wrote:
Thanks for addressing this. Here is my response: 
https://docs.google.com/document/d/1K6L82VRmCGc9Fb4AOitNk4MT7Nu4V8aKUJo_1mW5X1o/


In summary, my take is:

The sequence  for ൻ്റ (<>) 
should not be legitimized as an alternate encoding; but should be 
recognized as a prevailing non-standard legacy encoding.



On Sun, Oct 6, 2019 at 7:57 PM 梁海 Liang Hai > wrote:


Folks,

(Microsoft Peter and Andrew, search for “Windows” in the document.)

(Asmus, in the document there’s a section 5, /ICANN RZ-LGR
situation/—let me know if there’s some news.)

This is a pretty straightforward document about the notoriously
problematic encoding of Malayalam . I always wanted to properly document this, so finally here
it is:

L2/19-345

*Alternative encodings for Malayalam "nta"*
Liang Hai
2019-10-06


Unfortunately, as  has already become the de
facto standard encoding, now we have to recognize it in the Core
Spec. It’s a bit like another Tamil /srī/ situation.

An excerpt of the proposal:

Document the following widely used encoding in
the Core Specification as an alternative representation for
Malayalam [glyph] () that
is a special case and does not suggest any productive rule in
the encoding model:




Best,
梁海 Liang Hai
https://lianghai.github.io





Re: What is the time frame for USE shapers to provide support for CV+C ?

2019-08-07 Thread Asmus Freytag (c) via Unicode

On 8/7/2019 5:08 PM, Andrew Glass wrote:


Shaping domain names is a new requirement. It would be good to 
understand the specific cases that are falling in the gap here.


Domain names are simply strings, but the protocol enforces normalization 
to NFC. In some situations, it might be possible for a browser, for 
example, to have access to the user-provided string, but I can see any 
number of situations where the actual string (as stored in the DNS) 
would need to be displayed.


For the scenario, it does not matter whether it's NFC or NFD, what 
matters is that some particular un-normalized state would be lost; and 
therefore it would be bad if the result is that the string can no longer 
be rendered correctly.


In particular, as the strings in question would be identifiers, where 
accurate recognition is prime.


A./

*From:*Unicode  *On Behalf Of *Asmus 
Freytag via Unicode

*Sent:* 07 August 2019 14:19
*To:* unicode@unicode.org
*Subject:* Re: What is the time frame for USE shapers to provide 
support for CV+C ?


What about text that must exist normalized for other purposes?

Domain names must be normalized to NFC, for example. Will such strings 
display correctly if passed to USE?


A./

On 8/7/2019 1:39 PM, Andrew Glass via Unicode wrote:

That's correct, the Microsoft implementation of USE spec does not normalize 
as part of the shaping process.

Why? Because the ccc system for non-Latin scripts is not a good mechanism 
for handling complex requirements for these writing systems and the effects of 
ccc-based normalization can disrupt authors intent. Unfortunately, because we 
cannot fix ccc values, shaping engines at Microsoft have ignored them. 
Therefore, recommendation for passing text to USE is to not normalize.

By the way, at the current time, I do not have a final consensus from Tai 
Tham experts and community on the changes required to support Tai Tham in USE. 
Therefore, I've not been able to make the changes proposed in this thread.

Cheers,

Andrew

-Original Message-

From: Richard Wordingham    


Sent: 07 August 2019 13:29

To: Richard Wordingham via Unicode  


Cc: Andrew Glass  


Subject: Re: What is the time frame for USE shapers to provide support for 
CV+C ?

On Tue, 14 May 2019 03:08:04 +0100

Richard Wordingham via Unicode  
  wrote:

On Tue, 14 May 2019 00:58:07 +

Andrew Glass via Unicode  
  wrote:

Here is the essence of the initial changes needed to support CV+C.

Open to feedback.

   *   Create new SAKOT class

SAKOT (Sk) based on UISC = Invisible_Stacker

   *   Reduced HALANT class

Now only HALANT (H) based on UISC = Virama

   *   Updated Standard cluster mode

[< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB

[VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)*

(VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk

B)* (FAbv)* (FBlw)* (FPst)* [FM]

This next question does not, I believe, affect HarfBuzz.  Will NFC

code render as well as unnormalised code?  In the first example above,

 normalises to , which

does not match any portion of the regular expression.

Could someone answer this question, please?  The USE documentation ("CGJ 
handling will need to be updated if USE is modified to support

normalization") still implies that the USE does not respect canonical 
equivalence.

Richard.





Re: Removing accents and diacritics from a word

2019-07-17 Thread Asmus Freytag (c) via Unicode

On 7/17/2019 11:25 AM, Sławomir Osipiuk wrote:


“Transliteration”?

Maybe more generic that what you’re looking for. Used for the process 
of producing the “machine readable zone” on passports:


https://www.icao.int/publications/Documents/9303_p3_cons_en.pdf (see 
section 6, page 30)


“Accent folding” or “diacritic folding” is used in some places. String 
folding is “A string transform F, with the property that repeated 
applications of the same function F produce the same output: F(F(S)) = 
F(S) for all input strings S”. Accent folding is a special case of that.


https://unicode.org/reports/tr23/#StringFunctionClassificationDefinitions

https://alistapart.com/article/accent-folding-for-auto-complete/

Diacritic folding. Thanks. Just didn't think of the operation as folding 
the way it came up, but that's what it is.


A./


*From:*Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of 
*Asmus Freytag via Unicode

*Sent:* Wednesday, July 17, 2019 13:38
*To:* Unicode Mailing List
*Subject:* Removing accents and diacritics from a word

A question has come up in another context:

Is there any linguistic term for describing the process of removing 
accents and diacritics from a word to create its “base form”, e.g. São 
Tomé to Sao Tome?


The linguistic term "string normalization" appears not that preferable 
in a computing context.


Any ideas?

A./







Re: Removing accents and diacritics from a word

2019-07-17 Thread Asmus Freytag (c) via Unicode

On 7/17/2019 11:37 AM, Tex wrote:


Asmus, are you including the case where an accented character maps to 
two unaccented characters?


e.g. Å to AA or Ä to AE

If that's covered by the same term; but it's not simple 
"typewriter/telegraph" fallback.





*From:*Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of 
*Asmus Freytag (c) via Unicode

*Sent:* Wednesday, July 17, 2019 11:07 AM
*To:* Norbert Lindenberg
*Cc:* Unicode Mailing List
*Subject:* Re: Removing accents and diacritics from a word

On 7/17/2019 11:02 AM, Norbert Lindenberg wrote:

“Misspelling”?

Not helpful. Anybody have a serious suggestion?

A./

On Jul 17, 2019, at 10:37, Asmus Freytag via Unicode  
<mailto:unicode@unicode.org>  wrote:

A question has come up in another context:

Is there any linguistic term for describing the process of removing 
accents and diacritics from a word to create its “base form”, e.g. São Tomé to 
Sao Tome?

The linguistic term "string normalization" appears not that preferable 
in a computing context.

Any ideas?

A./





Re: Removing accents and diacritics from a word

2019-07-17 Thread Asmus Freytag (c) via Unicode

On 7/17/2019 11:02 AM, Norbert Lindenberg wrote:

“Misspelling”?


Not helpful. Anybody have a serious suggestion?

A./





On Jul 17, 2019, at 10:37, Asmus Freytag via Unicode  
wrote:

A question has come up in another context:

Is there any linguistic term for describing the process of removing accents and 
diacritics from a word to create its “base form”, e.g. São Tomé to Sao Tome?

The linguistic term "string normalization" appears not that preferable in a 
computing context.

Any ideas?

A./








Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Asmus Freytag (c) via Unicode

On 2/9/2019 1:40 PM, Egmont Koblinger wrote:

On Sat, Feb 9, 2019 at 10:10 PM Asmus Freytag via Unicode
 wrote:


I hope though that all the scripts can be supported with more or less
compromises, e.g. like it would appear in a crossword. But maybe not.

See other messages: not.

For the crossword analogy, I can see why it's not good. But this
doesn't mean there aren't any other ideas we could experiment with.



"all...scripts" is the issue.  We know how to handle text for all 
scripts and what complexities one has to account for in order to do 
that. You can back off some corner cases or (slightly) degrade things, 
but even after you are done with that, there will be scripts where the 
"more or less compromises" forces by the design parameters you gave will 
mean an utterly unacceptable display.


That said, there are scripts that had "passable" typewriter 
implementations and it may be possible to tweak things to approach that 
level support. Don't know for sure, it depends on the details for each 
script.





Or do you mean to say that because it can't be made perfect, there's
no point at all in partially improving? I don't think I agree with
that.



It's more a question of being upfront with your goal.

At this point I understand it as accepting some design parameters as 
fundamental and seeing whether there are some tweaks that allow more 
scripts to work with or to "survive" given the constraints.


That's not a totally useless effort, but it is a far cry from Unicode's 
universal support for ALL writing systems.


A./

PS: also we have been seriously hijacking a thread related to bidi




e.





Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Asmus Freytag (c) via Unicode

On 2/9/2019 11:48 AM, Egmont Koblinger wrote:

Hi Asmus,


On quick reading this appears to be a strong argument why such emulators will
never be able to be used for certain scripts. Effectively, the model described 
works
well with any scripts where characters are laid out (or can be laid out) in 
fixed
width cells that are linearly adjacent.

I'm wondering if you happen to know:

Are there any (non-CJK) scripts for which a mechanical typewriter does
not exist due to the complexity of the script?


Egmont,

are you excluding CJK because of the difficulty handling a large
repertoire with mechanical means? However, see:

https://en.wikipedia.org/wiki/Chinese_typewriter




Are there any (non-CJK) scripts for which crossword puzzles don't exist?

For scripts where these do exist, is it perhaps an acceptable tradeoff
to keep their limitations in the terminal emulator world as well, to
combine the terminal emulator's power with these scripts?



I agree with you that crossword puzzles and scrabble have a similar
limitation to the design that you sketched for us. However, take a script
that is written in syllables (each composed of 1-5 characters, say).

In a "crossword" I could write this script so that each syllable occupies
a cell. It would be possible to read such a puzzle, but trying to use 
such a draconian
technique for running text would be painful, to say the least. (We are 
not even

talking about pretty, here).

Here's an example for Hindi:
https://vargapaheli.blogspot.com/2017/
I don't read Hindi, but 5 vertical in the top puzzle, cell 2, looks like 
it contains

both a consonant and a vowel.

To force Hindi crosswords mode you need to segment the string into 
syllables,
each having a variable number of characters, and then assign a single 
display

position to them. Now some syllables are wider than others, so you could use
the single/double width paradigm. The result may be somewhat legible for
Devanagari, but even some of the closely related scripts may not fit 
that well.


Now there are some scripts where the same syllable can be written in more
than one form; the forms differing by how the elements are fused (or 
sometimes
not fused) into a single shape. Sometimes, these differences are more 
"stylistic",
more like an 'fi' ligature in English, sometimes they really indicate 
different words,
or one of the forms is simply not correct (like trying to spell lam-alif 
in Arabic using

two separate letters).

I'm sure there are scripts that work rather poorly (effectively not at 
all) in cross-

word mode. The question then becomes one of goals.

Are you defining as your goal to have some kind of "line by line" 
display that
can survive any Unicode text thrown at it, or are you trying to extend a 
given
design with rather specific limitations, so that it survives / can be 
used with,

just a few more scripts than European + CJK?



Honestly, even with English, all I have to do is "cat some_text_file",
and chances are that a word is split in half at some random place
where it hits the right margin. Even with just English, a terminal
emulator isn't something that gives me a grammatically and
typographically super pleasing or correct environment. It gives me
something that I personally find grammatically and typographically
"good enough", and in the mean time a powerful tool to get my work
done.



The discrepancies would be more like throwing random blank spaces in the
middle of every word, writing letters out of order, or overprinting. So, 
more

fundamental, not just "not perfect".

To give you an idea, here is an Arabi crossword. It uses the isolated 
shape of

all letters and writes all words unconnected. That's two things that may be
acceptable for a puzzle, but not for text output.

http://www.everyday-arabic.com/2013/12/crossword1.html

(try typing 3 vertical as a word to see the difference - it's 4x U+062A)



Obviously the more complex the script, the more tradeoffs there will
be. I think it's a call each user has to make whether they prefer a
terminal emulator or a graphical app for a certain kind of task. And
if terminal emulators have a lower usage rate in these scripts, that's
not necessarily a problem. If we can improve by small incremental
changes, sure, let's do. If we'd need to heavily redesign plenty of
fundamentals in order to improve, it most likely won't happen.

You may begin to see the limitations and that they may well prevent you 
from
reaching even your limited goal for speakers of at least three of the 
top ten languages

worldwide.

A./



Re: Encoding italic

2019-01-25 Thread Asmus Freytag (c) via Unicode

On 1/25/2019 3:49 PM, Andrew Cunningham wrote:
Assuming some mechanism for italics is added to Unicode,  when 
converting between the new plain text and HTML there is insufficient 
information to correctly convert to HTML. many elements may have 
italic stying and there would be no meta information in Unicode to 
indicate the appropriate HTML element.




So, we would be creating an interoperability issue.

A./





On Friday, 25 January 2019, wjgo_10...@btinternet.com 
 via Unicode > wrote:


Asmus Freytag wrote;

Other schemes, like a VS per code point, also suffer from
being different in philosophy from "standard" rich text
approaches. Best would be as standard extension to all the
messaging systems (e.g. a common markdown language, supported
by UI).     A./


Yet that claim of what would be best would be stateful and
statefulness is the very thing that Unicode seeks to avoid.

Plain text is the basic system and a Variation Selector mechanism
after each character that is to become italicized is not stateful
and can be implemented using existing OpenType technology.

If an organization chooses to develop and use a rich text format
then that is a matter for that organization and any changing of
formatting of how italics are done when converting between plain
text and rich text is the responsibility of the organization that
introduces its rich text format.

Twitter was just an example that someone introduced along the way,
it was not the original request.

Also this is not only about messaging. Of primary importance is
the conservation of texts in plain text format, for example, where
a printed book has one word italicized in a sentence and the text
is being transcribed into a computer.

William Overington
Friday 25 January 2019



--
Andrew Cunningham
lang.supp...@gmail.com 







Re: Encoding italic

2019-01-25 Thread Asmus Freytag (c) via Unicode

On 1/25/2019 1:06 AM, wjgo_10...@btinternet.com wrote:

Asmus Freytag wrote;

Other schemes, like a VS per code point, also suffer from being 
different in philosophy from "standard" rich text approaches. Best 
would be as standard extension to all the messaging systems (e.g. a 
common markdown language, supported by UI). A./


Yet that claim of what would be best would be stateful and 
statefulness is the very thing that Unicode seeks to avoid. 


All rich text is stateful, and rich text is very widely used and 
cut tends to work rather well among applications that support it, 
as do conversions of entire documents. Trying to duplicate it with "yet 
another mechanism" is a doubtful achievement, even if it could be made 
"stateless".


A./



Re: Encoding italic

2019-01-24 Thread Asmus Freytag (c) via Unicode

On 1/24/2019 11:14 PM, Tex wrote:


I am surprised at the length of this debate, especially since the 
arguments are repetitive…


That said:

Twitter was offered as an example, not the only example just one of 
the most ubiquitous. Many messaging apps and other apps would benefit 
from italics. The argument is not based on adding italics to twitter.


Most apps today have security protections that filter or translate 
problematic characters. If the proposal would cause “normalization” 
problems, adding the proposed characters to the filter lists or 
substitution lists would not be a big burden.


The biggest burden would be to the apps that would benefit, to add 
italicizing and editing capabilities.


The "normalization" is when you import to rich text, you don't want 
competing formatting instructions. Getting styled character codes 
normalized to styling of character runs is the most difficult, that's 
why the abuse of math italics really is abuse in terms of interoperability.


Other schemes, like a VS per code point, also suffer from being 
different in philosophy from "standard" rich text approaches. Best would 
be as standard extension to all the messaging systems (e.g. a common 
markdown language, supported by UI).


A./


tex

*From:*Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of 
*Asmus Freytag via Unicode

*Sent:* Thursday, January 24, 2019 10:34 PM
*To:* unicode@unicode.org
*Subject:* Re: Encoding italic

On 1/24/2019 9:44 PM, Garth Wallace via Unicode wrote:

But the root problem isn't the kludge, it's the lack of
functionality in these systems: if Twitter etc. simply implemented
some styling on their own, the whole thing would be a moot point.
Essentially, this is trying to add features to Twitter without
waiting for their development team.

Interoperability is not an issue, since in modern computers
copying and pasting styled text between apps works just fine.

Yep, that's what this is: trying to add features to some platforms 
that could very simply be added by the  respective developers while in 
the process causing a normalization issue (of sorts) everywhere else.


A./





Re: New ideas (from: wws dot org)

2019-01-16 Thread Asmus Freytag (c) via Unicode

On 1/16/2019 9:30 AM, wjgo_10...@btinternet.com wrote:

Asmus Freytag wrote as follows:

 PS: of course, if a contemplated change, such as the one alluded to, 
should be ill advised, its negative effects could have wide ranging 
impacts...but that's not the topic here.


If you object to encoding italics please say so and if possible please 
provide some reasons.


It's not the topic of this thread. Let's keep the discussion in one place.

A./








Re: Aleph-umlaut

2018-11-11 Thread Asmus Freytag (c) via Unicode

On 11/11/2018 1:37 PM, Hans Åberg wrote:

On 11 Nov 2018, at 22:16, Asmus Freytag via Unicode  wrote:

On 11/11/2018 12:32 PM, Hans Åberg via Unicode wrote:

On 11 Nov 2018, at 07:03, Beth Myre via Unicode 
  wrote:

Hi Mark,

This is a really cool find, and it's interesting that you might have a relative 
mentioned in it.  After looking at it more, I'm more convinced that it's German 
written in Hebrew letters, not Yiddish.  I think that explains the umlauts.  
Since the text is about Jewish subjects, it also includes Hebrew words like you 
mentioned, just like we would include beit din or p'sak in an English text.

Here's a paragraph from page 22:


Actually page 21.





I (re-)transliterated it, and it reads:


Taking a picture in the Google Translate app, and then pasting the Hebrew 
character string it identifies into translate.google.com for Yiddish gives the 
text:



Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), noch 
Einsicht zu erwarten ist und dass sie die Konsequenzen dieser rabbinischen 
Gutachten von sich abschüttelen werden mit der Motivierung, dass:


vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye , nakh 
eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer rabbinishen 
gutakhten fon zikh abshittelen verden mit der motivirung ,  dass :

This agrees rather well with Beth's retranslation.
Mapping "z" to "s",  "f" to "v" and "v" to "w" would match the way these pronunciations are spelled in German (with a few outliers like "izt" for "ist", where the "s" isn't voiced in German). There's also a clear 
convention of using "kh" for "ch" (as in English "loch" but also for other pronunciation of the German "ch"). The one apparent mismatch is "ge- gefarthey" for "Gegenpartei". Presumably what is transliterated as "f" can stand for phonetic 
"p". "Parthey" might be how Germans could have written "Partei" in earlier centuries (when "th" was commonly used for "t" and "ey" alternated with "ei", as in my last name).  So, perhaps it's closer than it looks, superficially.
 From context, "Reue" is by far the best match for "Reye" and seems to match a 
tendency elsewhere in the sample where the transliteration, if pronounced as German, would result 
in a shifted quality for the vowels (making them sound more Yiddish, for lack of a better 
description).

"abschüttelen" - here the second "e" would not be part of Standard German orthography. 
It's either an artifact of the transcription system or possibly reflects that the writer is familiar with a 
different spelling convention (to my eyes the spelling "abshittelen" looks somehow more Yiddish, 
but I'm really not familiar enough with that language).

But still, the text is unquestionably intended to be in German.

One should not rely too much these autotranslation tools, but it may be quicker 
using some OCR program and then correct by hand, than entering it all by hand. 
The setup did not admit transliterating Hebrew script directly into German. It 
seems that the translator program recognizes it as Yiddish, though it might be 
as a result of an assumption it makes.



Well, the OCR does a much better job than the "translation".



The German translation it gives:
Unsere Sünde kommt von der Seite der Verletzten, nachdem sie darauf gewartet 
hat, erwartet zu werden, und nachdem sie die Vorstellungen dieser rabbinischen 
Andachten kennengelernt haben, haben sie begonnen, mit der Motivation zu 
schließen:



This is simply utter nonsense and does not even begin to correlate with 
the transliteration.




And in English:
Our sin is coming out of the side of the injured side, after waiting to be 
expected, and having the concepts of these rabbinical devotiones, they have 
begun to agree with the motivation:



In fact, the English translation makes somewhat more sense. For example, 
"Gegenpartei" in many legal contexts (which this sample isn't, by the 
way) can in fact be translated as "injured party", which in turn might 
correlate with an "injured side" as rendered. However "Seite der 
Verletzten" makes no sense in this context, unless there's a Hebrew word 
that accidentally matches and got picked up.


(I'm suspicious that some of the auto translation does in fact work like 
many real translations which often are not direct, but involve an 
intermediate language - simply because it's not possible to find 
sufficient translators between random pairs of languages.).




 From the original Hebrew script, in case someone wants to try out more 
possibilities:
וויר זינד אונס דעססען בעוואוסט דאסס פֿאָן זייטע דער גע־ געפארטהיי וועדער רייע , 
נאך איינזיכט צו ערווארטען איזט אונד דאסט זיא דיא קאַנסעקווענצען דיעזער 
ראבבינישען גוטאכטען פֿאָן זיך אבשיטטעלען ווערדען מיט דער מאָטיווירונג , דאסס :


I don't know what that will tell you. You have a rendering that produces 
coherent text which closely matches a phonetic transliteration. What 
else do you hope to learn?


A./



Re: A sign/abbreviation for "magister"

2018-10-31 Thread Asmus Freytag (c) via Unicode

On 10/31/2018 4:11 PM, Khaled Hosny wrote:

On Wed, Oct 31, 2018 at 03:32:09PM -0700, Asmus Freytag via Unicode wrote:

On 10/31/2018 9:03 AM, Khaled Hosny via Unicode wrote:

 A while I was localizing some application to Arabic and the developer
 “helpfully” used m² for square meter, but that does not work for Arabic
 because there is no superscript ٢ in Unicode, so I had to contact the
 developer and ask for markup to be used for the superscript so that O
 can use it as well.

This just pushes the issue down one level.

Because it assumes that the presence/absence of markup is locale-independent.

For translation of general text I know this is not true. There are instances
where some words in certain languages are customarily italicized in a way that
is not lexical, therefore not something where the source language would ever
supply markup.

That was a while ago, but IIRC, the markup was enabled for that
particular widget unconditionally. The localizer is now free to use the
markup or not use it, the string was translatable as whole with the
embedded markup. It should be possible to enable markup for any widget,
it is just an option to tick off in the UI designer, but may experience
is that markup is seldom needed in computer UIs, but I may be biased
with the kind of UIs and locales I’m most familiar with.


All makes sense now.

A./



Regards,
Khaled





Re: Aw: Re: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process)

2018-08-23 Thread Asmus Freytag (c) via Unicode

On 8/23/2018 3:28 AM, "Jörg Knappen" wrote:

Asmus,
I know your style of humor, but to keep it straight:
All known human languages, even Piraha, have pronouns for "I" and "you".


And languages like Japanese, tend to use them - mostly not.

Even if the concepts are known, and can be named, there are deep 
differences across languages concerning the need  or conventions for 
demarcating them with words in any given context.


Replacing words by symbols is not going to fix this - the only way to 
get a 'universal' system of symbolic expression is to invent a new 
language, with its own conventions for use of these symbols in any given 
context.


A./


--Jörg Knappen
*Gesendet:* Montag, 20. August 2018 um 16:20 Uhr
*Von:* "Asmus Freytag via Unicode" 
*An:* unicode@unicode.org
*Betreff:* Re: Thoughts on working with the Emoji Subcommittee (was 
Re: Thoughts on Emoji Selection Process)


What about languages that don't have or don't use personal pronouns. 
Their speakers might find their use odd or awkward.


The same for many other grammatical concepts: they work reasonably 
well if used by someone from a related language, or for linguists 
trained in general concepts, but languages differ so much in what they 
express explicitly that if any native speaker transcribes the features 
that are exposed (and not implied) in their native language it may not 
be what a reader used to a different language is expecting to see.


A./





Re: Unicode 11 Georgian uppercase vs. fonts

2018-07-27 Thread Asmus Freytag (c) via Unicode

If that's the case then we shouldn't have this discussion.

Which got started by the ICU folks observing that if they implemented 
the changes in property for library functions it would lead to 
unintelligible text (at least in the short run -- because there was no 
plan to get font support ready in time and deployed) and text that was 
formatted in perhaps unintended ways (most likely permanently -- because 
there was no plan to solve the issues around usage differences).


What you describe as "plan" seems to have been just empty words.

A real plan would have consisted of documentation suggesting how to roll 
out library update, whether to change/augment CSS styling keywords, what 
types of locale adaptations of case transforms should be implemented, 
how to get OSs to deliver fonts to people, etc., etc..


If such a plan had existed, and been implemented, then we would not have 
an e-mail thread started with "OMG we have a crisis".


A./

On 7/27/2018 6:45 PM, Peter Constable wrote:


Just an observation on these issues: When the Mtavruli proposal was 
first presented to UTC, several UTC members voiced strong reservation 
because of the kind of issues mentioned for case mapping, and in 
particular on database indexing and querying. Several months later, 
various UTC members participated in a teleconference with 
representation from Georgian institutions, including IT people from 
Bank of Georgia and TBC Bank. During that meeting, the representatives 
of the Georgian enterprises (i) demonstrated an understanding of those 
issues and the implications, (ii) gave an indication of support from 
those enterprises and a commitment to update their applications as may 
be required, and (iii) gave indication of intent to develop a plan of 
action for preparing their institutions for this change as well as 
communicating that within Georgian industry and society. It was only 
after that did UTC feel it was viable to proceed with encoding 
Mtavruli characters.


Peter

*From:*Unicode  *On Behalf Of *Asmus 
Freytag via Unicode

*Sent:* Friday, July 27, 2018 7:01 AM
*To:* unicode@unicode.org
*Subject:* Re: Unicode 11 Georgian uppercase vs. fonts

On 7/27/2018 3:42 AM, Michael Everson via Unicode wrote:

Yes and it explains clearly that “effectively caseless Georgian” is 
incorrect. Georgian has case. Georgian uses case differently from other 
scripts. This is an orthographic distinction, not a structural one. In fact as 
it is also stated in the proposal, there are 19th-century texts which do 
titlecase. It’s just that that orthography is no longer in use and that 
behaviour no longer desirable.

"Georgian uses case differently from other scripts"

That's one of the key issues here for developers (and users) of 
libraries. Because it means that any implicit assumptions about the 
applicability of a certain case-transform is now broken.


This goes beyond whether fonts are actually installed now or at the 
end of some transition period, or ever: if functions like ToUpper, 
which used to have no effect on Georgian before, suddenly do - in ways 
that the users of the script do not expect, then your application is 
broken, from one day to the next.


The current situation prior to the change is perhaps best 
characterized by saying that there was support for some locale 
differences in the way certain characters were mapped, but not in 
whether or not to do a given mapping at all.


If, as has been suggested, the use of case in Georgian is more similar 
to that of smallcaps in other scripts, then, instead of ToUpper doing 
a case transformation for Georgian, what would be need is something 
like a "ToSmallCaps" function (better name here, because the Georgian 
letters aren't actually "small caps").


That way, the existing "ToUpper" could retain its implicit semantic of 
"uppercase transformation in those scripts where such transformations 
are used in a common way".


This would solve 1/2 of the problem, which is to prevent uppercasing 
where users of Georgian do not expect it. However, it does not work in 
plain text for the other scripts, because there, small caps are not 
encoded, so there's no plain-text solution.


To get back to Markus' original question on how to handle this for 
ICU: it seems more and more that Georgian should be exempted from 
standard library functions and that a new function needs to be added 
that just transforms Georgian and leaves all other scripts alone (or 
one that takes a language/local parameter).


A./





Re: Variation Sequences (and L2-11/059)

2018-07-18 Thread Asmus Freytag (c) via Unicode

On 7/17/2018 8:56 PM, Janusz S. "Bień" wrote:

On Tue, Jul 17 2018 at  8:34 -0700, Asmus Freytag writes:

On 7/16/2018 10:04 PM, Janusz S. Bień via Unicode wrote:

  I understand there is no sufficient demand for the Unicode Consortium
maintaining a supplementary non-ideographic variation database. Hence
for the time being  a kind of Private Use variation database seems to be
the only solution - am I right?

The question comes down to resources, among other things. As well as to whether
there are actual users / implementers waiting for and ready to adopt such a 
database
as solution to their problems.

I hope the resources are sufficient to improve wording of the variation
sequence FAQ. Do we agree that at present users/implementers are rather
misled by it?


Sure, we can go either of two ways: we can state that Unicode has no, 
and will not have any, solution to the issue of such variants for 
non-ideographic scripts. That part is easy.


Or, alternatively we could figure out, what the solution space might be 
(in the right circumstances), including some external resources for 
maintaining a database on an ongoing basis, and a larger well-identified 
community of scholars or archivists that sign up to use and support it.


If a non-zero solution space exists, simply saying that there will never 
be any solution would be equally wrong as the current wording which 
points at something that is not longer part of the solution space . . . 
(although at one point, people thought it might be).



A strawman proposal could identify these issues and some ways that they might be
addressed and then ask for criteria of what the UTC might deem sufficient.

Perhaps this statement should be put into FAQ, instead of "you should
propose your addition as a variation sequence"?


There are some additions that should be proposed for standardization, 
but the bar is relatively high.



A./


Re: Uppercase ß

2018-05-29 Thread Asmus Freytag (c) via Unicode

On 5/29/2018 2:46 PM, Werner LEMBERG wrote:

I very much dislike the approach that just for the sake of
`simplistic standardization for uppercase' the use if `ẞ' should be
enforced in German.  [...]

Hmm, don't see anyone calling for that in this discussion.

Well, I hear an implicit ”Great, there is now an `ẞ' character!  Let's
use it as the uppercase version of `ß' everywhere so that this nasty
German peculiarity is finally gone.“


The ALL-CAPS "SS" really has little to recommend it, intrinsically. It 
is de-facto a fall-back; one that competed with "SZ" as used in 
telegrams (while they still were a thing). Not being able to know how to 
hyphenate MASSE without knowing the meaning of the word is also not 
something that I consider a "benefit".


Uppercase forms for `ß' have been kicking around in fonts for a long 
time as was documented around the time that the character was encoded. 
It is possible mainly because running text in ALL CAPS is  indeed 
uncommon (and in the time of Fraktur was effectively not viable because 
the Fraktur capitals don't lend themselves to it. (If SS had ever 
occurred in Title-Case, I doubt it would have survived as long, other 
than the "Swiss solution" of making it the only form, also in lower case).


Saving an uppercase form for a non-initial letter was a godsend on 
typewriters -- adding to the factors that made the "SS" solution 
acceptable. But sign writers, type designers and typesetters did not 
find it so universally attractive - also documented exhaustively.


With changing environment (starting with influence from Anglo-Saxon use 
of type and not ending with the way the character is treated in relation 
to phonetics) I've been expecting so see usage evolving; and not 
necessarily driven by software engineers.


A./



Re: Uppercase ß

2018-05-29 Thread Asmus Freytag (c) via Unicode

On 5/29/2018 12:15 PM, Werner LEMBERG wrote:

Overlooked in this discussion is the fact that the revised
orthography of 1996 introduces for the first time a systematic
difference in pronunciation for the vowel preceding SS vs. ẞ (short
vs. long).  As users of the old orthography age out, I would not be
surprised if the SS fallback were to become less acceptable over
time because it would be at odds with how the word is to be
pronounced. I'm also confidently expecting the use of ALL CAPS to
become (somewhat) more prevalent under the continued influence of
English usage.

It's not that simple.

* `ß' is never used in Switzerland; it's always `ss' (and `SS').  Even
   ambiguous cases like `Masse' are always written like that.  This
   means that for Swiss users `ẞ' is even more alien than for most
   German and Austrian users.  In particular, there doesn't exist a
   `unity SS' in Swiss German at all!  For example, the word `Maße' if
   capitalized to `MASSE' is hyphenated as `MA-SSE' in Germany and
   Austria (since `SS' is treated in this case as a unity).  However,
   the word is hyphenated as `MAS-SE' in Switzerland, since `ss', as a
   replacement for `ß', is *not* treated as a unity.


So the Swiss don't have that issue. What do they do for names?



* There are dialectic differences between northern and southern
   Germany (and Austria).  Example: `Geschoß' vs. `Geschoss', which
   means exactly the same – and both orthographies are allowed.  For
   such cases, `GESCHOSS' is a much better uppercase version since it
   covers both dialectic forms.
I don't see the claimed benefit; if you allow two different spellings in 
lowercase to
track the phonetic difference, then that would rather seem to support my 
argument
that there is now a tension in the orthography (for standard German) 
that may well

resolve itself by greater use of the distinct uppercase form.

Users who will end up "resolving" this would be those who grew up only 
with the
revised orthography. Older users are used to a different principle of 
selecting

between SS and ß and that isn't tied to pronunciation of preceding vowel.



I very much dislike the approach that just for the sake of `simplistic
standardization for uppercase' the use if `ẞ' should be enforced in
German.  It's not the job of a language to fit computer usage.  It's
rather the job of computers to fit language usage.

Hmm, don't see anyone calling for that in this discussion.

A./



 Werner





Re: Unicode characters unification

2018-05-29 Thread Asmus Freytag (c) via Unicode

On 5/29/2018 1:08 AM, Richard Wordingham wrote:

On Mon, 28 May 2018 21:40:49 -0700
Asmus Freytag via Unicode  wrote:


But such exceptions prove the rule, which leads back to where we
started: the default position is that Unicode encodes a character
identity that is not the same as encoding the concept that said
character is used to represent in writing.

And the problem remains that of determining the 'identity'.  It is
rather like distinguishing species - biologists have dozens of
different concepts.

Richard.


Totally. Never said that encoding is a simple algorithmic process. :)

A./



Re: Submissions open for 2020 Emoji

2018-04-19 Thread Asmus Freytag (c) via Unicode

On 4/19/2018 9:36 AM, Mark Davis ☕️ wrote:
The UTC didn't want to burden the doc registry with all the emoji 
proposals.


The question of whether the registry should be divided is independent on 
whether proposals are public or private in nature.


Proposals in private have no place in the context of public standard.

A./


Mark
//

On Thu, Apr 19, 2018 at 6:22 PM, Asmus Freytag via Unicode 
> wrote:


On 4/19/2018 5:32 AM, Mark Davis ☕️ via Unicode wrote:

> imagine I discover that someone has already proposed the emoji
that I am interested in

In some cases we've have contacted people to see if they want to
engage with other proposers. But to handle larger numbers we'd
need a simple, light-weight way to let people know, while
maintaining people's privacy when they want it.


I would tend to think that actual proposals are a matter of public
record. Emoji should not be handled differently than other
proposals for character encoding in that regard.

Why should there be an assumption that these are "proposals in
private" in this case?

A./






Re: Translating the standard

2018-03-13 Thread Asmus Freytag (c) via Unicode

On 3/13/2018 12:55 PM, Philippe Verdy wrote:
It is then a version of the matching standards from Canadian and 
French standard bodies. This does not make a big difference, except 
that those national standards (last editions in 2003) are not kept in 
sync with evolutions of the ISO/IEC standard. So it can be said that 
this was a version for the 2003 version of the ISO/IEC standard, 
supported and sponsored by some of their national members.


There is a way to transpose international standards to national 
standards, but they then pick up a new designation, e.g. ANSI for US or 
DIN for German or EN for European Norm.


A./


2018-03-13 19:38 GMT+01:00 Asmus Freytag via Unicode 
>:


On 3/13/2018 11:20 AM, Marcel Schneider via Unicode wrote:

On Mon, 12 Mar 2018 14:55:28 +, Michel Suignard wrote:

Time to correct some facts.
The French version of ISO/IEC 10646 (2003 version) were done in a separate 
effort by Canada and France NBs and not within SC2 proper.
...

Then it can be referred to as “French version of ISO/IEC 10646” but I’ve 
got Andrew’s point, too.

Correction: if a project is not carried out by SC2 (the proper
ISO/IEC subcommittee) then it is not a "version" of the ISO/IEC
standard.

A./







Re: 0027, 02BC, 2019, or a new character?

2018-02-21 Thread Asmus Freytag (c) via Unicode

On 2/21/2018 9:23 AM, Philippe Verdy wrote:
2018-02-21 18:10 GMT+01:00 Asmus Freytag via Unicode 
>:


Feeling a bit curmudgeony, are we, today? :-)

Don't know what it means, never heard that word, not found in 
dictionaries. Probably alocalUS jargon or typo in your strange word.



Sorry for the typo. Dropped an "l". :-[

curmudgeonly from curmudgeon+ly

The word is attested from the late 1500s in the forms /curmudgeon/ and 
/curmudgen/, and during the 17th century in numerous spelling variants, 
including /cormogeon, cormogion, cormoggian, cormudgeon, curmudgion, 
curmuggion, curmudgin, curr-mudgin, curre-megient/.


Don't think the US existed in the late 1500s...

A./





Re: IDC's versus Egyptian format controls

2018-02-16 Thread Asmus Freytag (c) via Unicode

On 2/16/2018 11:10 AM, Ken Whistler wrote:


It's the "may either" which is not the same as "may also".
A./



On 2/16/2018 11:00 AM, Asmus Freytag via Unicode wrote:

On 2/16/2018 8:00 AM, Richard Wordingham via Unicode wrote:

That doesn't square well with, "An implementation *may* render a valid
Ideographic Description Sequence either by rendering the individual
characters separately or by parsing the Ideographic Description
Sequence and drawing the ideograph so described." (TUS 10.0 p704, in
Section 18.2)


Emphasis on the "may". In point of fact, no widespread layout engine 
or set of fonts does parse IDS'es to turn them into single ideographs 
for display. That would be a highly specialized display.




Should we ask t make the default behavior (visible IDS characters) 
more explicit?


Ask away.

--Ken



I don't mind allowing the other as an option (it's kind of the 
reverse of the "show invisible"

mode, which we also allow, but for which we do have a clear default).






Re: 0027, 02BC, 2019, or a new character?

2018-01-19 Thread Asmus Freytag (c) via Unicode

On 1/19/2018 5:42 AM, Philippe Verdy wrote:
Hmmm that character exists already at 0+0315 (a combining comma 
above right). It would work for the new Kazah orthographic system, 
including for collation purpose.  I don't think IDN rejects this 
combining version.


This is also ineligible for the Root Zone.
A./



2018-01-19 14:37 GMT+01:00 Philippe Verdy >:


May be the IDN could accept a new combining diacritic (sort of
right-side acute accent). After all the Kazakh intent is not to
define a new separate character but a modification of base letter
to create a single letter in their alphabet.
So a proposal for COMBINING APOSTROPHE (whose spacing
non-combining version is 02BC), so that SPACE+COMBINING APOSTROPHE
will render exactly like 02BC

2018-01-18 19:51 GMT+01:00 Asmus Freytag via Unicode
>:

Top level IDN domain names can not contain 02BC, nor 0027 or
2019.

(RFC 6912 gives the rationale and RZ-LGR the implementation,
see MSR-3
)

A./


On 1/18/2018 3:00 AM, Andre Schappo via Unicode wrote:




On 18 Jan 2018, at 08:21, Andre Schappo via Unicode
> wrote:




On 16 Jan 2018, at 08:00, Richard Wordingham via Unicode
> wrote:

On Mon, 15 Jan 2018 20:16:21 -0800
James Kass via Unicode > wrote:


It will probably be the ASCII apostrophe.  The stated
intent favors
the apostrophe over diacritics or special characters to
ensure that
the language can be input to computers with standard
keyboards.


Typing U+0027 into a word processor takes planning.  Of the
three, it
should obviously be the modifier letter U+02BC, but I think
what gets
stored will be U+0027 or the single quotation mark U+2019.

However, we shouldn't overlook the diacritic mark U+0315
COMBINING COMMA
ABOVE RIGHT.

Richard.


I have just tested twitter hashtags and as one would expect,
U+02BC does not break hashtags. See
twitter.com/andreschappo/status/953903964722024448




...and, just in case
twitter.com/andreschappo/status/953944089896083456



André Schappo









Re: 0027, 02BC, 2019, or a new character?

2018-01-19 Thread Asmus Freytag (c) via Unicode

On 1/19/2018 5:37 AM, Philippe Verdy wrote:
May be the IDN could accept a new combining diacritic (sort of 
right-side acute accent). After all the Kazakh intent is not to define 
a new separate character but a modification of base letter to create a 
single letter in their alphabet.
So a proposal for COMBINING APOSTROPHE (whose spacing non-combining 
version is 02BC), so that SPACE+COMBINING APOSTROPHE will render 
exactly like 02BC.




In the case of TLD IDNs what is at issue is the fact that it "renders 
exactly like" 02BC (which renders exactly like 2019).


You can see the issue when you look at Andre's twitter tags: you can 
create two strings that look the same, but the part that is a hashtag is 
different. That is deemed an unacceptable security risk for TLD IDNs.


If you encoded such a combining character, it would also not be eligible 
for TLD IDNs.

A./

2018-01-18 19:51 GMT+01:00 Asmus Freytag via Unicode 
>:


Top level IDN domain names can not contain 02BC, nor 0027 or 2019.

(RFC 6912 gives the rationale and RZ-LGR the implementation, see
MSR-3 )

A./


On 1/18/2018 3:00 AM, Andre Schappo via Unicode wrote:




On 18 Jan 2018, at 08:21, Andre Schappo via Unicode
> wrote:




On 16 Jan 2018, at 08:00, Richard Wordingham via Unicode
> wrote:

On Mon, 15 Jan 2018 20:16:21 -0800
James Kass via Unicode > wrote:


It will probably be the ASCII apostrophe. The stated intent favors
the apostrophe over diacritics or special characters to ensure
that
the language can be input to computers with standard keyboards.


Typing U+0027 into a word processor takes planning.  Of the
three, it
should obviously be the modifier letter U+02BC, but I think
what gets
stored will be U+0027 or the single quotation mark U+2019.

However, we shouldn't overlook the diacritic mark U+0315
COMBINING COMMA
ABOVE RIGHT.

Richard.


I have just tested twitter hashtags and as one would expect,
U+02BC does not break hashtags. See
twitter.com/andreschappo/status/953903964722024448




...and, just in case
twitter.com/andreschappo/status/953944089896083456



André Schappo








Re: Emoji for major planets at least?

2018-01-18 Thread Asmus Freytag (c) via Unicode

On 1/18/2018 10:01 AM, John H. Jenkins wrote:
Well, you can go with Venus = white planet, Mercury = grey planet, 
Uranus = greenish planet, Neptune = bluish planet, Jupiter = striped 
planet.


As you say, though, without a context, none of them convey much and 
Venus, at least, would just be a circle.


Plus there's the question of the context in which someone would want 
to send little pictures of the planets. This sounds like it would be 
adding emoji just because.


"Earth" as in "a blue ball in space" is something that reached iconic 
status after the famous photo taken during the early Apollo missions. I 
could definitely see that used in a variety of possible contexts. And 
the recognition value is higher than for many recent emoji.


Saturn, with its rings (even though it's no longer the only one known 
with rings) also is iconic and highly recognizable. I lack imagination 
as to when someone would want to use it in communication, but I have the 
same issue with quite a few recent emoji, some of which are far less 
iconic or recognizable. I think it does lend itself to describe a 
"non-earth" type planet, or even the generic idea of a planet (as 
opposed to a star/sun).


Mars and Venus have tons of connotations, which could be expressed by 
using an emoji (as opposed to the astrological symbol for each), but 
only Mars is reasonably recognizable without lots of pre-established 
context. That red color.


In a detailed enough rendering, Jupiter, as a shaded "ball" with stripes 
and red dot would more recognizable than any of the remaining planets 
(on par or better with many recent emoji), but I see even less scope for 
using it metaphorically or in extended contexts.


If someone were to make a proposal, I would suggest to them to limit it 
to these four and to provide more of a suggestion as to how these might 
show up in use.


A./


On Jan 18, 2018, at 10:44 AM, Asmus Freytag via Unicode 
> wrote:


On 1/18/2018 6:55 AM, Shriramana Sharma via Unicode wrote:

Hello people.

We have sun, earth and moon emoji (3 for the earth and more for the
moon's phases). But we don't have emoji for the rest of the planets.

We have astrological symbols for all the planets and a few
non-existent imaginary "planets" as well.

Given this, would it be impractical to encode proper emoji characters
for the rest of the planets, at least the major ones whose physical
characteristics are well known and identifiable?

I mean for example identifying Sedna and Quaoar
(https://en.wikipedia.org/wiki/File:EightTNOs.png) is probably not
going to be practical for all those other than astronomy buffs but the
physical shapes of the major planets are known to all high school
students…


Earth = blue planet (with clouds)

Mars = red planet

Saturn = planet with rings

I don't think any of the other ones are identifiable in a 
context-free setting, unless you draw a "big planet with red dot" for 
Jupiter.


Earth would have to be depicted in a way that doesn't focus on 
"hemispheres", or you miss the idea of it as "planet".



A./








Re: ASCII v Unicode

2017-11-03 Thread Asmus Freytag (c) via Unicode

On 11/3/2017 9:12 AM, William_J_G Overington wrote:

GS1-128 barcode technology is being introduced into National Health Service 
hospitals in the United Kingdom.


This is so off-topic and unrelated to the discussion.

A./


http://www.scan4safety.nhs.uk/

As barcode scanners will be in use, a not unrealistic scenario is that 
localizable sentences encoded in GS1-128 barcodes could be used for some 
everyday communication through the language barrier.

For example, a whole sentence, such as, here localized into English,

Would you like a drink of water?

could be encoded as

::781:;

within Application Identifier 97 of a GS1-128 barcode.

Suppose that this system were being implemented.

For localization into English, the sentence.dat text file could contain the 
following line of text for localizing that particuar localizable sentence.

::781:;|Would you like a drink of water?

If the sentence.dat file and the software to handle it were implemented in 
7-bit ASCII the system would work fine for localization into English.

If many sentence.dat files, one for each language, and the software to handle 
them were implemented in 8-bit ASCII the system would work fine for 
localization into English and for localization into many of the languages of 
Western Europe and Scandinavia.

If many sentence.dat files, one for each language, and the software to handle 
them were implemented in Unicode using the UTF-16 text file format for each 
sentence.dat file, the system would work fine for localization into many 
languages of the world.

This seems to me to be a very good example of why Unicode is so much better 
than ASCII.

William Overington

Friday 3 November 2017





Re: Should U+3248 ... U+324F be wide characters?

2017-08-17 Thread Asmus Freytag (c) via Unicode

On 8/17/2017 7:24 AM, Mike FABIAN wrote:

Asmus Freytag via Unicode  さんはかきました:


On 8/16/2017 6:26 AM, Mike FABIAN via Unicode wrote:

 EastAsianWidth.txt contains:
 
 3248..324F;A # No [8] CIRCLED NUMBER TEN ON BLACK SQUARE..CIRCLED NUMBER EIGHTY ON BLACK SQUARE
 
 i.e. it classifies the width of the characters at codepoints

 between 3248 and 324F as ambiguous.
 
 Is this really correct? Shouldn’t they be “W”, i.e. wide?
 
 In most fonts these characters seem to be square shaped wide characters.


"W" not only implies display width, but also a different treatment in the 
context of line
breaking and vertical layout of text.

"W" characters behave more like Ideographs, for the most part, while "N" are 
treated as
forming words (for the most part).

Most emoji now have "W", for example:

1F600..1F64F;W   # So[80] GRINNING FACE..PERSON WITH FOLDED HANDS

That seems correct because emoji behave more like Ideographs.

Isn’t this the same for “CIRCLED NUMBER TEN ON BLACK SQUARE”?
This seems to me also more like an Ideograph.


"A" means, you get to decide whether to treat these as "W" or "N" based on 
context. If
used in a non ideographic context, they behave like all other symbols (but 
happen to fill
an EM square).


"A" means, you get to decide whether to treat these as "W" or "N" based on 
context.

There's really not strong need to change an "A" towards "W", because "A" doesn't get in 
your way if you decided that "W" works better for you.

Remember that all the EAW properties ares supposed to be "resolved" down to W 
or N. For some, like Na that resolution is deterministic, for A it is context/application 
dependent, but when you finally process your data, only W(ide) or N(arrow) remain after 
resolution.

A./



A./





Re: Should U+3248 ... U+324F be wide characters?

2017-08-17 Thread Asmus Freytag (c) via Unicode

On 8/17/2017 7:47 AM, Philippe Verdy wrote:



2017-08-17 16:24 GMT+02:00 Mike FABIAN via Unicode 
>:


Asmus Freytag via Unicode > さんはかきました:
Most emoji now have "W", for example:

1F600..1F64F;W   # So[80] GRINNING FACE..PERSON WITH FOLDED HANDS

That seems correct because emoji behave more like Ideographs.

Isn’t this the same for “CIRCLED NUMBER TEN ON BLACK SQUARE”?
This seems to me also more like an Ideograph.

Not really. They have existed since extremely long without being bound 
to ideographs or sinographic requirements on metrics. Notably their 
baseline and vertical extension do not follow the sinographic 
em-square layout convention (except when they are rendered with CJK 
fonts, or were encoded in documents with legacy CJK encodings, also 
rendered with suitable CJK fonts being then prefered to Latin fonts 
which won't use the large siongraphic metrics).


If they were like emojis, they would actually be larger : I think it 
is a case for definining a Emoji-variant for them (where they could 
also be colored or have some 3D-like look)


There's an emoji variant for the standard digits.

A./




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Asmus Freytag (c) via Unicode

On 6/1/2017 11:53 AM, Shawn Steele wrote:


But those are IETF definitions.  They don’t have to mean the same 
thing in Unicode - except that people working in this field probably 
expect them to.




That's the thing. And even if Unicode had it's own version of RFC 2119 
one would considered it recommended for Unicode to follow widespread 
industry practice (there's that "r" word again!).


A./


*From:*Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of 
*Asmus Freytag via Unicode

*Sent:* Thursday, June 1, 2017 11:44 AM
*To:* unicode@unicode.org
*Subject:* Re: Feedback on the proposal to change U+FFFD generation 
when decoding ill-formed UTF-8


On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote:

I think that the (or a) key problem is that the current "best practice" is treated as 
"SHOULD" in RFC parlance.  When what this really needs is a "MAY".

People reading standards tend to treat "SHOULD" and "MUST" as the same 
thing.


It's not that they "tend to", it's in RFC 2119:


SHOULD

  This word, or the adjective "RECOMMENDED", mean that there

may exist valid reasons in particular circumstances to ignore a

particular item, but the full implications must be understood and

carefully weighed before choosing a different course.

The clear inference is that while the non-recommended practice is not 
prohibited, you better have some valid reason why you are deviating 
from it (and, reading between the lines, it would not hurt if you 
documented those reasons).



  So, when an implementation deviates, then you get bugs (as we see here).  Given 
that there are very valid engineering reasons why someone might want to choose a 
different behavior for their needs - without harming the intent of the standard at all in 
most cases - I think the current/proposed language is too "strong".


Yes and no. ICU would be perfectly fine deviating from the existing 
recommendation and stating their engineering reasons for doing so. 
That would allow them to close their bug ("by documentation").


What's not OK is to take an existing recommendation and change it to 
something else, just to make bug reports go away for one 
implementations. That's like two sleepers fighting over a blanket 
that's too short. Whenever one is covered, the other is exposed.


If it is discovered that the existing recommendation is not based on 
anything like truly better behavior, there may be a case to change it 
to something that's equivalent to a MAY. Perhaps a list of nearly 
equally capable options.


(If that language is not in the standard already, a strong "an 
implementation MUST not depend on the use of a particular strategy for 
replacement of invalid code sequences", clearly ought to be added).


A./


-Shawn

-Original Message-

From: Alastair Houghton [mailto:alast...@alastairs-place.net]

Sent: Thursday, June 1, 2017 4:05 AM

To: Henri Sivonen 

Cc: unicode Unicode Discussion ; Shawn 
Steele 

Subject: Re: Feedback on the proposal to change U+FFFD generation when 
decoding ill-formed UTF-8

On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode 
  wrote:

On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode

   wrote:

* As far as I can tell, there are two (maybe three) sane approaches 
to this problem:

* Either a "maximal" emission of one U+FFFD for every byte 
that exists outside of a good sequence

* Or a "minimal" version that presumes the lead byte was 
counting trail bytes correctly even if the resulting sequence was invalid.  In that case 
just use one U+FFFD.

* And (maybe, I haven't heard folks arguing for this one) 
emit one U+FFFD at the first garbage byte and then ignore the input until valid 
data starts showing up again.  (So you could have 1 U+FFFD for a string of a 
hundred garbage bytes as long as there weren't any valid sequences within that 
group).

I think it's not useful to come up with new rules in the abstract.

The first two aren’t “new” rules; they’re, respectively, the current “Best 
Practice”, the proposed “Best Practice” and one other potentially reasonable 
approach that might make sense e.g. if the problem you’re worrying about is 
serial data slip or corruption of a compressed or encrypted file (where 
corruption will occur until re-synchronisation happens, and as a result you 
wouldn’t expect to have any knowledge whatever of the number of characters 
represented in the data in question).

All of these approaches are explicitly allowed by the standard at present.  
All three are reasonable, and each has its own pros and cons in a 

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Asmus Freytag (c) via Unicode

On 5/23/2017 10:45 AM, Markus Scherer wrote:
On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode 
> wrote:


So, if the proposal for Unicode really was more of a "feels right"
and not a "deviate at your peril" situation (or necessary escape
hatch), then we are better off not making a RECOMMEDATION that
goes against collective practice.


I think the standard is quite clear about this:

Although a UTF-8 conversion process is required to never consume
well-formed subsequences as part of its error handling for
ill-formed subsequences, such a process is not otherwise
constrained in how it deals with any ill-formed subsequence
itself. An ill-formed subsequence consisting of more than one code
unit could be treated as a single error or as multiple errors.


And why add a recommendation that changes that from completely up to the 
implementation (or groups of implementations) to something where one way 
of doing it now has to justify itself?


If the thread has made one thing clear is that there's no consensus in 
the wider community that one approach is obviously better. When it comes 
to ill-formed sequences, all bets are off. Simple as that.


Adding a "recommendation" this late in the game is just bad standards 
policy.


A./




Re: Coloured Punctuation and Annotation

2017-04-10 Thread Asmus Freytag (c) via Unicode

On 4/10/2017 9:30 AM, Peter Constable wrote:

From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Asmus Freytag
Sent: Wednesday, April 5, 2017 5:30 PM


There are certainly MSS (in many languages) where some punctuation made of dots 
have some of the dots red and some black.

Agreed, those would be a challenge to reproduce with standard font technology 
and in plain text.

Not at all. This capability has existed in all major OS platforms for some 
years now.


It may be in the platforms, but of the few clients I've tried this with, 
only one is reliably supporting this.



  It is what has enabled the growth of interest in Unicode emoji, but it is by 
no means limited to Unicode emoji: it can be used for multi-color rendering of 
any text in ways defined within a font. The OpenType spec supports this through 
a few techniques:

- Decomposing a glyph into several glyphs that are layered (z-ordered) with 
colour assignments.
- Glyphs expressed as embedded colour bitmaps.
- Glyphs expressed as embedded SVG.


Khaled gave a very nice demonstration of that on this list (which 
allowed me to test this).





But for the same reason, they are out of scope for plain text (and therefore a 
bit irrelevant to the current discussion).

I agree, the rendering aspect is completely orthogonal to Unicode plain-text 
encoding.


The problem with multicolored fonts would be the integration into font 
color selection via styling.


http://www.amirifont.org/fatiha-colored.html

If you select a section of this text, the black ink will invert as you 
select it, but the other colors remain the same, which is different from 
selecting a multicolored image or different from selecting multiple runs 
of fonts in different colors.


I wonder whether high-end tools like Indesign would be able to allow 
styling of individual color levels. For rendering emoji colors via fonts 
that wouldn't matter, but for the kind of annotated text example, it 
could be interesting to be able to tweak these layer colors.


A./




Peter