RE: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2020-08-17 Thread Shawn Steele via Unicode
IMO, encodings, particularly ones depending on state such as this, may have 
multiple ways to output the same, or similar, sequences.  When means that 
pretty much any time an encoding transforms data any previous security or other 
validation style checks are no longer valid and any security/validation must be 
checked for again.  I've seen numerous mistakes due to people expecting 
encodings to play nicely, particularly if there are different endpoints that 
may use different implementations with slightly different behaviors.

-Shawn

-Original Message-
From: Unicode  On Behalf Of Henri Sivonen via 
Unicode
Sent: Sunday, August 16, 2020 11:39 PM
To: Mark Davis ☕️ 
Cc: Unicode Public 
Subject: Re: Generating U+FFFD when there's no content between ISO-2022-JP 
escape sequences

Sorry about the delay. There is now
https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf

On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️  wrote:
>
> I tend to agree with your analysis that emitting U+FFFD when there is no 
> content between escapes in "shifting" encodings like ISO-2022-JP is 
> unnecessary, and for consistency between implementations should not be 
> recommended.
>
> Can you file this at http://www.unicode.org/reporting.html so that the 
> committee can look at your proposal with an eye to changing 
> http://www.unicode.org/reports/tr36/?
>
> Mark
>
>
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode 
>  wrote:
>>
>> We're about to remove the U+FFFD generation for the case where there 
>> is no content between two ISO-2022-JP escape sequences from the 
>> WHATWG Encoding Standard.
>>
>> Is there anything wrong with my analysis that U+FFFD generation in 
>> that case is not a useful security measure when unnecessary 
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>>
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen  wrote:
>> >
>> > Context: https://github.com/whatwg/encoding/issues/115
>> >
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> >
>> > Character encoding conversion must also not simply skip an illegal 
>> > input byte sequence. Instead, it must stop with an error or 
>> > substitute a replacement character (such as U+FFFD (   ) 
>> > REPLACEMENT CHARACTER) or an escape sequence in the output. (See 
>> > also Section 3.5 Deletion of Code Points.) It is important to do 
>> > this not only for byte sequences that encode characters, but also for 
>> > unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next 
>> > shift sequence. The formal syntaxes for HZ and most CJK ISO-2022 
>> > variants require at least one character in a text segment between 
>> > shift sequences. Security software written to the formal 
>> > specification may not detect malicious text  (for example, "delete" 
>> > with a shift-to-double-byte then an immediate shift-to-ASCII in the 
>> > middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> >
>> > The WHATWG Encoding Standard bakes this requirement by the means of 
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into 
>> > its ISO-2022-JP decoder algorithm 
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> >
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements 
>> > the WHATWG spec.
>> >
>> > After Gecko switched to encoding_rs from an implementation that 
>> > didn't implement this U+FFFD generation behavior (uconv), a bug has 
>> > been logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> >
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-44066140
>> > 3
>> >
>> > The root problem seems to be that the requirement gives ISO-2022-JP 
>> > the unusual and surprising property that concatenating two 
>> > ISO-2022-JP outputs from a conforming encoder can result in a byte 
>> > sequence that is non-conforming as input to a ISO-2022-JP decoder.
>> >
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP 
>> > escape sequence is immediately followed by another ISO-2022-JP 
>> > escape sequence. Chrome and Safari do, but their implementations of 
>> > ISO-2022-JP 

Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2020-08-17 Thread Harriet Riddle via Unicode
In terms of deployed ISO-2022-JP encoders which don't follow WHATWG behaviour, 
here's Python's (apparently contributed to Python by one Hye-Shik Chang):

>>> "a¥bc~¥d".encode("iso-2022-jp")
b'a\x1b(J\\\x1b(Bbc~\x1b(J\\\x1b(Bd'

This is so far as I can tell valid per the RFC (and of course ECMA-35 itself), 
but not per the WHATWG, whose output would be (to use another bytestring 
literal) b'a\x1b(J\\bc\x1b(B~\x1b(J\\d\x1b(B'. The difference being that 
Python's encoder appears to be using a preference order of codesets, with ASCII 
being before JIS-Roman, while the WHATWG logic is to encode the next character 
in the current codeset if possible, and switch to another if it is not.

-- Har


From: Unicode  on behalf of Henri Sivonen via 
Unicode 
Sent: 17 August 2020 08:38
To: Mark Davis ☕️ 
Cc: Unicode Public 
Subject: Re: Generating U+FFFD when there's no content between ISO-2022-JP 
escape sequences

Sorry about the delay. There is now
https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf

On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️  wrote:
>
> I tend to agree with your analysis that emitting U+FFFD when there is no 
> content between escapes in "shifting" encodings like ISO-2022-JP is 
> unnecessary, and for consistency between implementations should not be 
> recommended.
>
> Can you file this at http://www.unicode.org/reporting.html so that the 
> committee can look at your proposal with an eye to changing 
> http://www.unicode.org/reports/tr36/?
>
> Mark
>
>
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode 
>  wrote:
>>
>> We're about to remove the U+FFFD generation for the case where there
>> is no content between two ISO-2022-JP escape sequences from the WHATWG
>> Encoding Standard.
>>
>> Is there anything wrong with my analysis that U+FFFD generation in
>> that case is not a useful security measure when unnecessary
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>>
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen  wrote:
>> >
>> > Context: https://github.com/whatwg/encoding/issues/115
>> >
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> >
>> > Character encoding conversion must also not simply skip an illegal
>> > input byte sequence. Instead, it must stop with an error or substitute
>> > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER)
>> > or an escape sequence in the output. (See also Section 3.5 Deletion of
>> > Code Points.) It is important to do this not only for byte sequences
>> > that encode characters, but also for unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next shift
>> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
>> > require at least one character in a text segment between shift
>> > sequences. Security software written to the formal specification may
>> > not detect malicious text  (for example, "delete" with a
>> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> >
>> > The WHATWG Encoding Standard bakes this requirement by the means of
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
>> > ISO-2022-JP decoder algorithm
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> >
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
>> > WHATWG spec.
>> >
>> > After Gecko switched to encoding_rs from an implementation that didn't
>> > implement this U+FFFD generation behavior (uconv), a bug has been
>> > logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> >
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403
>> >
>> > The root problem seems to be that the requirement gives ISO-2022-JP
>> > the unusual and surprising property that concatenating two ISO-2022-JP
>> > outputs from a conforming encoder can result in a byte sequence that
>> > is non-conforming as input to a ISO-2022-JP decoder.
>> >
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
>> > sequence is immediately followed by another ISO-2022-JP

Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2020-08-17 Thread Henri Sivonen via Unicode
Sorry about the delay. There is now
https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf

On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️  wrote:
>
> I tend to agree with your analysis that emitting U+FFFD when there is no 
> content between escapes in "shifting" encodings like ISO-2022-JP is 
> unnecessary, and for consistency between implementations should not be 
> recommended.
>
> Can you file this at http://www.unicode.org/reporting.html so that the 
> committee can look at your proposal with an eye to changing 
> http://www.unicode.org/reports/tr36/?
>
> Mark
>
>
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode 
>  wrote:
>>
>> We're about to remove the U+FFFD generation for the case where there
>> is no content between two ISO-2022-JP escape sequences from the WHATWG
>> Encoding Standard.
>>
>> Is there anything wrong with my analysis that U+FFFD generation in
>> that case is not a useful security measure when unnecessary
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>>
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen  wrote:
>> >
>> > Context: https://github.com/whatwg/encoding/issues/115
>> >
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> >
>> > Character encoding conversion must also not simply skip an illegal
>> > input byte sequence. Instead, it must stop with an error or substitute
>> > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER)
>> > or an escape sequence in the output. (See also Section 3.5 Deletion of
>> > Code Points.) It is important to do this not only for byte sequences
>> > that encode characters, but also for unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next shift
>> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
>> > require at least one character in a text segment between shift
>> > sequences. Security software written to the formal specification may
>> > not detect malicious text  (for example, "delete" with a
>> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> >
>> > The WHATWG Encoding Standard bakes this requirement by the means of
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
>> > ISO-2022-JP decoder algorithm
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> >
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
>> > WHATWG spec.
>> >
>> > After Gecko switched to encoding_rs from an implementation that didn't
>> > implement this U+FFFD generation behavior (uconv), a bug has been
>> > logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> >
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403
>> >
>> > The root problem seems to be that the requirement gives ISO-2022-JP
>> > the unusual and surprising property that concatenating two ISO-2022-JP
>> > outputs from a conforming encoder can result in a byte sequence that
>> > is non-conforming as input to a ISO-2022-JP decoder.
>> >
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
>> > sequence is immediately followed by another ISO-2022-JP escape
>> > sequence. Chrome and Safari do, but their implementations of
>> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's
>> > decoder implementations generally are informed by the Encoding
>> > Standard (though the ISO-2022-JP decoder specifically might not be
>> > yet), and I suspect that Safari's implementation (ICU) is either
>> > informed by Unicode Security Considerations or vice versa.
>> >
>> > The example given as rationale in Unicode Security Considerations,
>> > obfuscating the ASCII string "delete", could be accomplished by
>> > alternating between the ASCII and Roman states to that every other
>> > character is in the ASCII state and the rest of the Roman state.
>> >
>> > Is the requirement to generate U+FFFD when there is no content between
>> > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII
>> > transitions or useless transitions between ASCII and Roman are not
>> > also required to generate U+FFFD? Would it even be feasible (in terms
>> > of interop with legacy encoders) to make useless transitions between
>> > ASCII and Roman generate U+FFFD?
>> >
>> > --
>> > Henri Sivonen
>> > hsivo...@hsivonen.fi
>> > https://hsivonen.fi/
>>
>>
>>
>> --
>> Henri Sivonen
>> hsivo...@hsivonen.fi
>> https://hsivonen.fi/
>>


-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



RE: Emoji map of Colorado

2020-04-02 Thread Doug Ewell via Unicode
Karl Williamson shared:
 
> https://www.reddit.com/r/Denver/comments/fsmn87/quarantine_boredom_my_emoji_map_of_colorado/?mc_cid=365e908e08_eid=0700c8706b
 
It's too bad this was only made available as an image, not as text, which of 
course it is.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org
 




Emoji map of Colorado

2020-04-01 Thread Karl Williamson via Unicode

https://www.reddit.com/r/Denver/comments/fsmn87/quarantine_boredom_my_emoji_map_of_colorado/?mc_cid=365e908e08_eid=0700c8706b


How is meaning changed by context and typgraphy - in art, emoji and language

2020-04-01 Thread wjgo_10...@btinternet.com via Unicode
I received a circulated email from MoMA, the Museum of Modern Art in New 
York. I am, at my request, on their mailing list.


There is a link to a web page.

https://www.moma.org/magazine/articles/257

There is a video embedded in the web page, 8 minutes.

I watched the video and found it interesting.

There is one part where two identical images each have a different 
title.


I noticed that both titles were in English.

With typography today it has become almost obligatory these days for a 
proposal for a new emoji character to become encoded, for the emoji 
character to be suggested as having multiple possible meanings, possibly 
linked to context, or maybe just anyway.


The beginnings of this phenomenon and the problems of ambiguity of 
meaning of emoji characters was discussed in a talk at the Unicode 
conference in 2015.


https://www.youtube.com/watch?v=9ldSVbXbjl4

There was mention of the possibility of "precise emoji".

Yet these days  imprecision of emoji meaning has become widespread. Yet 
has the possibility of QID emoji brought back the possibility of precise 
emoji? Decoding could be to an image, or to language-localized speech or 
language-localized text, or even all three at once. Yet only if QID 
emoji are allowed to flourish, perhaps after a few careful modifications 
to the original proposal so as to minimize, or at least limit, the 
possibility of encoding chaos.


I have long been fascinated by what I regard as subtle changes of 
meaning that setting a piece of text in different fonts produces, though 
some other people opine that the meaning is unchanged, regardless of the 
font.


 Also, can some meanings not be expressed from one language to another? 
If so, is that due to the nature of the languages or the culture where 
the original text was produced, or some of each. Does the general shape 
of the way that a particular script has developed reflect, or influence, 
the original literature written in that script? Do words that rhyme in 
one language produce imagery that does not arise in a language where 
their translations do not rhyme? For example, boaco and erinaco rhyme In 
Esperanto, yet their translations in English, reindeer and hedgehog, do 
not rhyme.


The art works in the MoMA video also reminded me of something that was 
in this mailing list probably in the early 2000s.


The post was about translations linked to an art project.

It was an art project about some orange blocks and people were taking 
photographs of art works where one of the orange blocks was presented in 
some context.


Maybe it was a student project, I don't know.

I have looked on the web and thus far found nothing about it, not even 
the original post in this mailing list thus far.


Since then technology has changed a lot, much more is now possible for 
more people. There are now widespread emoji, there is Google street 
view, and so on.


New art possibilities.

Does anyone else remember the orange blocks please? Maybe an interesting 
stepping stone in the history of art.


William Overington

Tuesday 31 March 2020


Base character plus tag sequences (from RE: Is the binaryness/textness of a data format a property?)

2020-03-23 Thread wjgo_10...@btinternet.com via Unicode


Doug Ewell wrote:

When 137,468 private-use characters aren't enough?
In my opinion, a base character plus tag sequence has the potential to 
be used for many large scale applications for the future.
A base character plus tag sequence encoding has the advantage over a 
Private Use Area encoding (except for a prompt experimental use or for 
some applications) that the encoding can be unique and thus 
interoperability is possible amongst people generally.


QID emoji is just the very start of applications, some not even dreamed 
of yet, for which a base character sequence encoding could be used.


Once restrictions of the result of a specific encoding of being only 
allowed to be a fixed image are removed, then new information technology 
applications will be possible within text streams.


There is the QID Emoji Public Review and issues like this can be 
explored there so that they will be before the Unicode Technical 
Committee when it assesses the responses to the public review.


In my response of Monday 2 March 2020 I put forward an idea that could 
allow the idea of QID emoji to proceed yet without the disadvantages.


No comment after that has been published as of the time of sending this 
post.


https://www.unicode.org/review/pri408/

Whatever your view on whether such ideas should be allowed to flourish 
and become mainstream in the future I opine that it would be good for 
there to be more responses to the public review so that as wide a range 
of views as possible are before the Unicode Technical Committee when it 
assesses the responses to the public review, not on just QID emoji as 
such but on whether the underlying method of encoding of a base 
character and tag character sequence for  large sets of items should be 
encouraged.


William Overington

Monday 23 March 2020






Re: Is the binaryness/textness of a data format a property?

2020-03-22 Thread Martin J . Dürst via Unicode
On 23/03/2020 03:56, Markus Scherer via Unicode wrote:
> On Sat, Mar 21, 2020 at 12:35 PM Doug Ewell via Unicode 
> wrote:
> 
>> I thought the whole premise of GB18030 was that it was Unicode mapped into
>> a GB2312 framework. What characters exist in GB18030 that don't exist in
>> Unicode, and have they been proposed for Unicode yet, and why was none of
>> the PUA space considered appropriate for that in the meantime?
>>
> 
> My memory of GB18030 is that its code space has 1.6M code points, of which
> 1.1M are a permutation of Unicode. For the rest you would have to go beyond
> the Unicode code space for 1:1 round-trip mappings.

This matches my recollection. What's more, there are no characters 
allocated in the parts of the GB 18030 codespace that doesn't map to 
Unicode, and there is as far as I understand no plan to use that space. 
It's just there because that was the most straightforward way to extend 
GB 2312/GBK.

Regards,   Martin.



Re: Is the binaryness/textness of a data format a property?

2020-03-22 Thread Markus Scherer via Unicode
On Sat, Mar 21, 2020 at 12:35 PM Doug Ewell via Unicode 
wrote:

> I thought the whole premise of GB18030 was that it was Unicode mapped into
> a GB2312 framework. What characters exist in GB18030 that don't exist in
> Unicode, and have they been proposed for Unicode yet, and why was none of
> the PUA space considered appropriate for that in the meantime?
>

My memory of GB18030 is that its code space has 1.6M code points, of which
1.1M are a permutation of Unicode. For the rest you would have to go beyond
the Unicode code space for 1:1 round-trip mappings.

Just please don't call it UTF-8.

markus


Re: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Richard Wordingham via Unicode
On Sat, 21 Mar 2020 13:33:18 -0600
Doug Ewell via Unicode  wrote:

> Eli Zaretskii wrote:

> > Emacs uses some of that for supporting charsets that cannot be
> > mapped into Unicode.  GB18030 is one example of such charsets.  The
> > internal representation of characters in Emacs is UTF-8, so it uses
> > 5-byte UTF-8 like sequences to represent such characters.  

> When 137,468 private-use characters aren't enough?

But they aren't private use!  I haven't made any agreement with anyone
about using them.

Additionally, just as some people seem to think that stray UTF-16 code
units should be supported (and occasionally declaring UTF-8
implementations of Unicode standard algorithms to be automatically
non-compliant), there is a case for supporting stray UTF-8 code units.
Emacs supports the full range of 8-bit byte values - 128 unified with
ASCII and the other 128 with high bit set.

> What characters exist in GB18030 that don't
> exist in Unicode, and have they been proposed for Unicode yet, and
> why was none of the PUA space considered appropriate for that in the
> meantime?

Doesn't GB18030 appropriate some of the PUA for Tibetan (and quite
possibly other complex scripts)?  I haven't looked up how Emacs
handles this. 

Richard.


RE: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Doug Ewell via Unicode
Eli Zaretskii wrote:

>> When 137,468 private-use characters aren't enough?
>
> Why is that relevant to the issue at hand?

You're right. I did ask what the uses of non-standard UTF-8 were, and you gave 
me an example.

> I don't remember off hand, but last time I looked at GB18030, there
> were a lot of them not in Unicode.

I'd forgotten that there were still about two dozen GB18030 characters mapped, 
more or less officially, into the Unicode PUA. But again, I changed the 
subject. Sorry about that.

--
Doug Ewell | Thornton, CO, US | ewellic.org






Re: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Julian Bradfield via Unicode
On 2020-03-21, Eli Zaretskii via Unicode  wrote:
>> Date: Sat, 21 Mar 2020 11:13:40 -0600
>> From: Doug Ewell via Unicode 
>> 
>> Adam Borowski wrote:
>> 
>> > Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF
>> > or U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has
>> > its uses but is not well-formed Unicode.
>> 
>> I'd be interested in your elaboration on what these uses are.
>
> Emacs uses some of that for supporting charsets that cannot be mapped
> into Unicode.  GB18030 is one example of such charsets.  The internal
> representation of characters in Emacs is UTF-8, so it uses 5-byte
> UTF-8 like sequences to represent such characters.

My own (now >10 year old) Unicode adaptation of XEmacs does the same,
even for charsets that can be mapped into Unicode. To ensure complete
backward compatibility, it distinguishes "legacy" charsets from Unicode,
and only does conversion when requested.



Re: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Eli Zaretskii via Unicode
> From: "Doug Ewell" 
> Cc: 
> Date: Sat, 21 Mar 2020 13:33:18 -0600
> 
> > Emacs uses some of that for supporting charsets that cannot be mapped
> > into Unicode.  GB18030 is one example of such charsets.  The internal
> > representation of characters in Emacs is UTF-8, so it uses 5-byte
> > UTF-8 like sequences to represent such characters.
> 
> When 137,468 private-use characters aren't enough?

Why is that relevant to the issue at hand?

> I thought the whole premise of GB18030 was that it was Unicode mapped into a 
> GB2312 framework. What characters exist in GB18030 that don't exist in 
> Unicode, and have they been proposed for Unicode yet

I don't remember off hand, but last time I looked at GB18030, there
were a lot of them not in Unicode.

> and why was none of the PUA space considered appropriate for that in the 
> meantime?

Because many fonts already use them?  I don't really know why it was
decided to use codepoints above 0x1F, it's just that this is how
Emacs works for quite some time.  You asked for examples of usage, and
I provided one.


RE: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Doug Ewell via Unicode
Eli Zaretskii wrote:

>>> Also, UTF-8 can carry more than Unicode -- for example,
>>> U+D800..U+DFFF or U+11000..U+7FFF (or possibly even up to 2³⁶ or
>>> 2⁴²), which has its uses but is not well-formed Unicode.
>>
>> I'd be interested in your elaboration on what these uses are.
>
> Emacs uses some of that for supporting charsets that cannot be mapped
> into Unicode.  GB18030 is one example of such charsets.  The internal
> representation of characters in Emacs is UTF-8, so it uses 5-byte
> UTF-8 like sequences to represent such characters.

When 137,468 private-use characters aren't enough?

I thought the whole premise of GB18030 was that it was Unicode mapped into a 
GB2312 framework. What characters exist in GB18030 that don't exist in Unicode, 
and have they been proposed for Unicode yet, and why was none of the PUA space 
considered appropriate for that in the meantime?

--
Doug Ewell | Thornton, CO, US | ewellic.org





Re: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Eli Zaretskii via Unicode
> Date: Sat, 21 Mar 2020 11:13:40 -0600
> From: Doug Ewell via Unicode 
> 
> Adam Borowski wrote:
> 
> > Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF
> > or U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has
> > its uses but is not well-formed Unicode.
> 
> I'd be interested in your elaboration on what these uses are.

Emacs uses some of that for supporting charsets that cannot be mapped
into Unicode.  GB18030 is one example of such charsets.  The internal
representation of characters in Emacs is UTF-8, so it uses 5-byte
UTF-8 like sequences to represent such characters.


Re: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Doug Ewell via Unicode
Adam Borowski wrote:

> Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF
> or U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has
> its uses but is not well-formed Unicode.

I'd be interested in your elaboration on what these uses are.

--
Doug Ewell | Thornton, CO, US | ewellic.org





Re: Is the binaryness/textness of a data format a property?

2020-03-20 Thread Martin J . Dürst via Unicode
On 20/03/2020 23:41, Adam Borowski via Unicode wrote:

> Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF or
> U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has its uses
> but is not well-formed Unicode.

This would definitely no longer be UTF-8!   Martin.



Re: Is the binaryness/textness of a data format a property?

2020-03-20 Thread Richard Wordingham via Unicode
On Fri, 20 Mar 2020 13:46:25 +0100
Adam Borowski via Unicode  wrote:

> On Fri, Mar 20, 2020 at 12:21:26PM +, Costello, Roger L. via
> Unicode wrote:
> > [Definition] Property: an attribute, quality, or characteristic of
> > something.
> > 
> > JPEG is a binary data format.
> > CSV is a text data format.
> > 
> > Question #1: Is the binaryness/textness of a data format a
> > property? 
> > 
> > Question #2: If the answer to Question #1 is yes, then what is the
> > name of this binaryness/textness property?  

I'd suggest 'texthood' as the correct English term.

> I'm afraid this question is too fuzzy to have a proper answer.
> 
> For example, most Unix-heads will tell you that UTF16LE is a binary
> rather than text format.  Microsoft employees and some members of
> this list will disagree.

Some files change type on changing operating system.  Digital's old RMS
formats included as basic text files in which each record (roughly a
line) started with a binary 2-byte length field.  Text records on
magnetic tape typically started with an ASCII length count!

> Then you have Postscript -- nothing but basic ASCII, yet utterly
> unreadable for a (sane) human.

No worse than a hex dump - in fact, a lot more readable.  Indeed, are
you not aware of the concept of a write-only programming language? 

> If you want _my_ definition of a file being _technically_ text, it's:
> * no bytes 0..31 other than newlines and tabs (even form feeds are out
>   nowadays)
> * correctly encoded for the expected charset (and nowadays, if that's
> not UTF-8 Unicode, you're doing it wrong)
> * no invalid characters

Unassigned characters are perfectly reasonable in a text file.  Surely
you aren't saying that a text file using the characters new to Unicode
13.0 should, at present, usually be regarded as a binary file?

> But besides this narrow technical meaning -- is a Word document
> "text"? And if it is, why not Powerpoint?  This all falls apart.

Well, a .docx file isn't text - it's a variety of ZIP file, which is
binary.  Indeed, as word files naturally include pictures, it very much
isn't a text file.  A .doc file is more like an image dump of a file
system.  A .rtf file on the other hand, probably is a text file -
though I've a feeling there are variants that aren't *A*SCII.

Richard.


Re: Is the binaryness/textness of a data format a property?

2020-03-20 Thread Adam Borowski via Unicode
On Fri, Mar 20, 2020 at 07:22:45AM -0700, J Decker via Unicode wrote:
> On Fri, Mar 20, 2020 at 5:48 AM Adam Borowski via Unicode <
> > For example, most Unix-heads will tell you that UTF16LE is a binary rather
> > than text format.  Microsoft employees and some members of this list will
> > disagree.
[...]
> > If you want _my_ definition of a file being _technically_ text, it's:
> > * no bytes 0..31 other than newlines and tabs (even form feeds are out
> >   nowadays)
> > * correctly encoded for the expected charset (and nowadays, if that's not
> >   UTF-8 Unicode, you're doing it wrong)
> > * no invalid characters
> 
> Just a minor note...
> In the case of UTF8, this means no bytes 0xF8-0xFF will ever be used; every
> valid utf8 codeunit has at least 1 bit off.

Yeah, but I allowed for ancient encodings, some of which do use these bytes.
(I do discriminate against UTF16 and shift-state ones, they're too broken.)

Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF or
U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has its uses
but is not well-formed Unicode.

> I wouldn't be so picky about 'no bytes 0-31' because \t, \n, \x1b(ANSI
> codes) are all quite usable...

\t is tab, \n a newline (blah blah blah \r).

As for \e (\x1b), that's higher-level markup.  I do use it -- hey, you can
"apt/dnf install colorized-logs" for my tools -- but that's beyond plain
text.


喵!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ in the beginning was the boot and root floppies and they were good.
⢿⡄⠘⠷⠚⠋⠀   --  on #linux-sunxi
⠈⠳⣄


Re: Is the binaryness/textness of a data format a property?

2020-03-20 Thread J Decker via Unicode
On Fri, Mar 20, 2020 at 5:48 AM Adam Borowski via Unicode <
unicode@unicode.org> wrote:

> On Fri, Mar 20, 2020 at 12:21:26PM +, Costello, Roger L. via Unicode
> wrote:
> > [Definition] Property: an attribute, quality, or characteristic of
> something.
> >
> > JPEG is a binary data format.
> > CSV is a text data format.
> >
> > Question #1: Is the binaryness/textness of a data format a property?
> >
> > Question #2: If the answer to Question #1 is yes, then what is the name
> of
> > this binaryness/textness property?
>
> I'm afraid this question is too fuzzy to have a proper answer.
>
> For example, most Unix-heads will tell you that UTF16LE is a binary rather
> than text format.  Microsoft employees and some members of this list will
> disagree.
>
> Then you have Postscript -- nothing but basic ASCII, yet utterly unreadable
> for a (sane) human.
>
> If you want _my_ definition of a file being _technically_ text, it's:
> * no bytes 0..31 other than newlines and tabs (even form feeds are out
>   nowadays)
> * correctly encoded for the expected charset (and nowadays, if that's not
>   UTF-8 Unicode, you're doing it wrong)
> * no invalid characters
>

Just a minor note...
In the case of UTF8, this means no bytes 0xF8-0xFF will ever be used; every
valid utf8 codeunit has at least 1 bit off.
I wouldn't be so picky about 'no bytes 0-31' because \t, \n, \x1b(ANSI
codes) are all quite usable...



>
> But besides this narrow technical meaning -- is a Word document "text"?
> And if it is, why not Powerpoint?  This all falls apart.
>
>
> Meow!
> --
> ⢀⣴⠾⠻⢶⣦⠀
> ⣾⠁⢠⠒⠀⣿⡁ in the beginning was the boot and root floppies and they were good.
> ⢿⡄⠘⠷⠚⠋⠀   --  on #linux-sunxi
> ⠈⠳⣄
>


Re: Is the binaryness/textness of a data format a property?

2020-03-20 Thread Adam Borowski via Unicode
On Fri, Mar 20, 2020 at 12:21:26PM +, Costello, Roger L. via Unicode wrote:
> [Definition] Property: an attribute, quality, or characteristic of something.
> 
> JPEG is a binary data format.
> CSV is a text data format.
> 
> Question #1: Is the binaryness/textness of a data format a property? 
> 
> Question #2: If the answer to Question #1 is yes, then what is the name of
> this binaryness/textness property?

I'm afraid this question is too fuzzy to have a proper answer.

For example, most Unix-heads will tell you that UTF16LE is a binary rather
than text format.  Microsoft employees and some members of this list will
disagree.

Then you have Postscript -- nothing but basic ASCII, yet utterly unreadable
for a (sane) human.

If you want _my_ definition of a file being _technically_ text, it's:
* no bytes 0..31 other than newlines and tabs (even form feeds are out
  nowadays)
* correctly encoded for the expected charset (and nowadays, if that's not
  UTF-8 Unicode, you're doing it wrong)
* no invalid characters

But besides this narrow technical meaning -- is a Word document "text"?
And if it is, why not Powerpoint?  This all falls apart.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ in the beginning was the boot and root floppies and they were good.
⢿⡄⠘⠷⠚⠋⠀   --  on #linux-sunxi
⠈⠳⣄


AW: Is the binaryness/textness of a data format a property?

2020-03-20 Thread Dreiheller, Albrecht via Unicode
#1: Yes.
#2: [ my suggestion ]  File type category

A.D.

-Ursprüngliche Nachricht-
Von: Unicode  Im Auftrag von Costello, Roger L. 
via Unicode
Gesendet: Freitag, 20. März 2020 13:21
An: unicode@unicode.org
Betreff: Is the binaryness/textness of a data format a property?

Hello Data Format Experts!

[Definition] Property: an attribute, quality, or characteristic of something.

JPEG is a binary data format.
CSV is a text data format.

Question #1: Is the binaryness/textness of a data format a property? 

Question #2: If the answer to Question #1 is yes, then what is the name of this 
binaryness/textness property?

Question #3: Here is another way of asking Question #2: Please fill in the 
following blanks with the property name (both blanks should be filled with the 
same thing):

For the JPEG data format:  _ = binary.
For the CSV data format:  _ = text. 

/Roger




Is the binaryness/textness of a data format a property?

2020-03-20 Thread Costello, Roger L. via Unicode
Hello Data Format Experts!

[Definition] Property: an attribute, quality, or characteristic of something.

JPEG is a binary data format.
CSV is a text data format.

Question #1: Is the binaryness/textness of a data format a property? 

Question #2: If the answer to Question #1 is yes, then what is the name of this 
binaryness/textness property?

Question #3: Here is another way of asking Question #2: Please fill in the 
following blanks with the property name (both blanks should be filled with the 
same thing):

For the JPEG data format:  _ = binary.
For the CSV data format:  _ = text. 

/Roger



EGYPTIAN HIEROGLYPH MAN WITH A ROLL OF TOILET PAPER

2020-03-11 Thread Karl Williamson via Unicode

On 2/12/20 11:12 AM, Frédéric Grosshans via Unicode wrote:

Dear Unicode list members (CC Michel Suignard),

   the Unicode proposal L2/20-068 
<https://www.unicode.org/L2/L2020/20068-n5128-ext-hieroglyph.pdf>, 
“Revised draft for the encoding of an extended Egyptian Hieroglyphs 
repertoire, Groups A to N” ( 
https://www.unicode.org/L2/L2020/20068-n5128-ext-hieroglyph.pdf ) by 
Michel Suignard contains a very interesting hieroglyph at position 
*U+13579 EGYPTIAN HIEROGLYPH A-12-054, which seems to represent a man 
with a laptop, as can be obvious in the attached image.




Someone suggested today that this would be the more up-to-date character



Re: UAX #29 and WB4

2020-03-09 Thread Andy Heninger via Unicode
 daniel.buenzli wrote:

I think the behaviour of → rules should be clarified


I wholeheartedly agree.

If I understand correctly if the match [or a "treat-as" rule] spans over
> the [candidate] boundary position candidate that simply turns it into a
> non-boundary. Otherwise you apply the rule on the left of the boundary
> position candiate.


I have considered the extent of a left-side treat-as match to not continue
beyond the candidate boundary position. This comes into play following a
ZWJ, where it may be absorbed into a "treat as" on the left (WB4), while
some other rule triggers on the right side (WB3C). At any rate, this is
what I do in ICU. It gets very confusing, and is tricky to implement.

Reconsidering how ZWJ rules work could also be a help, if we could figure
out how to keep them out of the "treat as" rules, but use explicit no-break
rules on both sides instead.

  -- Andy

On Wed, Mar 4, 2020 at 4:01 PM Mark Davis ☕️ via Unicode <
unicode@unicode.org> wrote:

> One thing we have considered for a while is whether to do a rewrite of the
> rules to simplify the processing (and avoid the "treat as" rules), but it
> would take a fair amount of design work that we haven't had time to do. If
> you (or others) are interested in getting involved, please let us know.
>
> Mark
>
>
> On Wed, Mar 4, 2020 at 11:30 AM Daniel Bünzli via Unicode <
> unicode@unicode.org> wrote:
>
>> On 4 March 2020 at 18:48:09, Daniel Bünzli (daniel.buen...@erratique.ch)
>> wrote:
>>
>> > On 4 March 2020 at 18:01:25, Daniel Bünzli (daniel.buen...@erratique.ch)
>> wrote:
>> >
>> > > Re-reading the text I suspect I should not restart the rules from the
>> first one when a
>> > WB4
>> > > rewrite occurs but only apply the subsequent rules. Is that correct ?
>> >
>> > However even if that's correct I don't understand how this test case
>> works:
>> >
>> > ÷ 1F6D1 × 200D × 1F6D1 ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPict) × [4.0]
>> ZERO WIDTH JOINER (ZWJ_FE)
>> > × [3.3] OCTAGONAL SIGN (ExtPict) ÷ [0.3]
>> >
>> > Here the first two chars get rewritten with WB4 to ExtPic then if only
>> subsequent rules
>> > are applied we end up in WB999 and a break between 200D and 1F6D1.
>>
>> That's nonsense and not the operational model of the algorithm which IIRC
>> was once clearly stated on this list by Mark Davis (sorry I failed to dig
>> out the message) which is to take each boundary position candidate and
>> apply the rule in sequences taking the first one that matches and then
>> start over with the next one.
>>
>> In that case applying the rules bewteen 1F6D1 and 200D leads to WB4 but
>> then that implicitely adds a non boundary condition -- this is not really
>> evident from the formalism but see the comment above WB4, for that boundary
>> position that settles the non boundary condition. Then we start again
>> applying the rules between 200D and the last 1F6D1 and WB3c matches before
>> WB4 quicks.
>>
>> I think the behaviour of → rules should be clarified: it's not clear on
>> which data you apply it w.r.t. the boundary position candiate. If I
>> understand correctly if the match spans over the boundary position
>> candidate that simply turns it into a non-boundary. Otherwise you apply the
>> rule on the left of the boundary position candiate.
>>
>> Regarding the question of my original message it seems at a certain point
>> I knew better:
>>
>>   https://www.unicode.org/mail-arch/unicode-ml/y2016-m11/0151.html
>>
>> Sorry for the noise.
>>
>> Daniel
>>
>> P.S. I still think the UAX29 and UAX14 could benefit from clarifiying the
>> operational model of the rules a bit (I also have the impression that the
>> formalism to express all that may not be the right one, but then I don't
>> have something better to propose at the time). Also it would be nicer for
>> implementers if they didn't have to factorize rules themselves (e.g. like
>> in the new LB30 rules of UAX14) so that correctness of implemented rules is
>> easier to assert.
>>
>>
>>
>>


Reminder about reporting bugs, errors, and other feedback

2020-03-07 Thread Rick McGowan via Unicode

Hello everyone...

This is just a little public service reminder that discussions on the 
Unicode mail list are not considered official feedback, and are not 
reviewed by UTC members or staff as a source for bug reports.


If you want to make sure your feedback and/or report gets into the UTC 
process, it is best to submit it through our reporting form, which can 
be found here:


https://www.unicode.org/reporting.html

Cheers,







UAX #29 6.2

2020-03-07 Thread Zack Newman via Unicode
According to 6.2, "thus ignoring Extend is sufficient to disallow breaking
within a grapheme cluster." However the sequence of Unicode scalar values
(U+0600, U+0020) is considered a single grapheme cluster due to rule GB9,
but the sequence is parsed into two words according to 4.1.1. While it
would be ideal to not have sequences of Unicode scalar values that can be
parsed into more words than grapheme clusters, I think it's more
understandable if section 6.2 didn't explicitly state that this isn't
possible.


Re: UAX #29 and WB4

2020-03-04 Thread Mark Davis ☕️ via Unicode
One thing we have considered for a while is whether to do a rewrite of the
rules to simplify the processing (and avoid the "treat as" rules), but it
would take a fair amount of design work that we haven't had time to do. If
you (or others) are interested in getting involved, please let us know.

Mark


On Wed, Mar 4, 2020 at 11:30 AM Daniel Bünzli via Unicode <
unicode@unicode.org> wrote:

> On 4 March 2020 at 18:48:09, Daniel Bünzli (daniel.buen...@erratique.ch)
> wrote:
>
> > On 4 March 2020 at 18:01:25, Daniel Bünzli (daniel.buen...@erratique.ch)
> wrote:
> >
> > > Re-reading the text I suspect I should not restart the rules from the
> first one when a
> > WB4
> > > rewrite occurs but only apply the subsequent rules. Is that correct ?
> >
> > However even if that's correct I don't understand how this test case
> works:
> >
> > ÷ 1F6D1 × 200D × 1F6D1 ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPict) × [4.0] ZERO
> WIDTH JOINER (ZWJ_FE)
> > × [3.3] OCTAGONAL SIGN (ExtPict) ÷ [0.3]
> >
> > Here the first two chars get rewritten with WB4 to ExtPic then if only
> subsequent rules
> > are applied we end up in WB999 and a break between 200D and 1F6D1.
>
> That's nonsense and not the operational model of the algorithm which IIRC
> was once clearly stated on this list by Mark Davis (sorry I failed to dig
> out the message) which is to take each boundary position candidate and
> apply the rule in sequences taking the first one that matches and then
> start over with the next one.
>
> In that case applying the rules bewteen 1F6D1 and 200D leads to WB4 but
> then that implicitely adds a non boundary condition -- this is not really
> evident from the formalism but see the comment above WB4, for that boundary
> position that settles the non boundary condition. Then we start again
> applying the rules between 200D and the last 1F6D1 and WB3c matches before
> WB4 quicks.
>
> I think the behaviour of → rules should be clarified: it's not clear on
> which data you apply it w.r.t. the boundary position candiate. If I
> understand correctly if the match spans over the boundary position
> candidate that simply turns it into a non-boundary. Otherwise you apply the
> rule on the left of the boundary position candiate.
>
> Regarding the question of my original message it seems at a certain point
> I knew better:
>
>   https://www.unicode.org/mail-arch/unicode-ml/y2016-m11/0151.html
>
> Sorry for the noise.
>
> Daniel
>
> P.S. I still think the UAX29 and UAX14 could benefit from clarifiying the
> operational model of the rules a bit (I also have the impression that the
> formalism to express all that may not be the right one, but then I don't
> have something better to propose at the time). Also it would be nicer for
> implementers if they didn't have to factorize rules themselves (e.g. like
> in the new LB30 rules of UAX14) so that correctness of implemented rules is
> easier to assert.
>
>
>
>


Re: UAX #29 and WB4

2020-03-04 Thread Daniel Bünzli via Unicode
On 4 March 2020 at 18:48:09, Daniel Bünzli (daniel.buen...@erratique.ch) wrote:

> On 4 March 2020 at 18:01:25, Daniel Bünzli (daniel.buen...@erratique.ch) 
> wrote:
>  
> > Re-reading the text I suspect I should not restart the rules from the first 
> > one when a  
> WB4
> > rewrite occurs but only apply the subsequent rules. Is that correct ?
>  
> However even if that's correct I don't understand how this test case works:
>  
> ÷ 1F6D1 × 200D × 1F6D1 ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPict) × [4.0] ZERO 
> WIDTH JOINER (ZWJ_FE)  
> × [3.3] OCTAGONAL SIGN (ExtPict) ÷ [0.3]
>  
> Here the first two chars get rewritten with WB4 to ExtPic then if only 
> subsequent rules  
> are applied we end up in WB999 and a break between 200D and 1F6D1. 

That's nonsense and not the operational model of the algorithm which IIRC was 
once clearly stated on this list by Mark Davis (sorry I failed to dig out the 
message) which is to take each boundary position candidate and apply the rule 
in sequences taking the first one that matches and then start over with the 
next one.

In that case applying the rules bewteen 1F6D1 and 200D leads to WB4 but then 
that implicitely adds a non boundary condition -- this is not really evident 
from the formalism but see the comment above WB4, for that boundary position 
that settles the non boundary condition. Then we start again applying the rules 
between 200D and the last 1F6D1 and WB3c matches before WB4 quicks. 

I think the behaviour of → rules should be clarified: it's not clear on which 
data you apply it w.r.t. the boundary position candiate. If I understand 
correctly if the match spans over the boundary position candidate that simply 
turns it into a non-boundary. Otherwise you apply the rule on the left of the 
boundary position candiate. 

Regarding the question of my original message it seems at a certain point I 
knew better: 

  https://www.unicode.org/mail-arch/unicode-ml/y2016-m11/0151.html

Sorry for the noise. 

Daniel

P.S. I still think the UAX29 and UAX14 could benefit from clarifiying the 
operational model of the rules a bit (I also have the impression that the 
formalism to express all that may not be the right one, but then I don't have 
something better to propose at the time). Also it would be nicer for 
implementers if they didn't have to factorize rules themselves (e.g. like in 
the new LB30 rules of UAX14) so that correctness of implemented rules is easier 
to assert. 





Re: UAX #29 and WB4

2020-03-04 Thread Daniel Bünzli via Unicode
On 4 March 2020 at 18:01:25, Daniel Bünzli (daniel.buen...@erratique.ch) wrote:

> Re-reading the text I suspect I should not restart the rules from the first 
> one when a WB4  
> rewrite occurs but only apply the subsequent rules. Is that correct ?

However even if that's correct I don't understand how this test case works:

÷ 1F6D1 × 200D × 1F6D1 ÷ #  ÷ [0.2] OCTAGONAL SIGN (ExtPict) × [4.0] ZERO WIDTH 
JOINER (ZWJ_FE) × [3.3] OCTAGONAL SIGN (ExtPict) ÷ [0.3]

Here the first two chars get rewritten with WB4 to ExtPic then if only 
subsequent rules are applied we end up in WB999 and a break between 200D and 
1F6D1. The justification in the comment indicates to use WB3c on the ZWJ but 
that one should have been rewritten to ExtPict by WB4. 

Best,

Daniel





UAX #29 and WB4

2020-03-04 Thread Daniel Bünzli via Unicode
Hello, 

My implementation of word break chokes only on the following test case from the 
file [1]: 

÷ 0020 × 0308 ÷ 0020 ÷ #  ÷ [0.2] SPACE (WSegSpace) × [4.0] COMBINING DIAERESIS 
(Extend_FE) ÷ [999.0] SPACE (WSegSpace) ÷ [0.3] 

I find: 

÷ 0020 × 0308 × 0020 ÷

Basically my implementation uses WB4 to rewrite the first two characters to 
WSegSpace and then applies WB3ad resulting in the non-break between 0308 and 
0020.

Re-reading the text I suspect I should not restart the rules from the first one 
when a WB4 rewrite occurs but only apply the subsequent rules. Is that correct 
? 

Best, 

Daniel

[1]: https://unicode.org/Public/13.0.0/ucd/auxiliary/WordBreakTest.txt








Re: UAX #14 for 13.0.0: LB27 first's line is obsolete

2020-03-03 Thread Andy Heninger via Unicode
I agree. The LB27 first part rule

(JL | JV | JT | H2 | H3) × IN

appears to be redundant.

Good catch.

  -- Andy

On Tue, Mar 3, 2020 at 1:53 PM Daniel Bünzli 
wrote:

> Hello,
>
> I think (more precisely my compiler thinks [1]) the first line of LB27 is
> already handled by the new LB22 rule and can be removed.
>
> Best,
>
> Daniel
>
> [1]
> File "uuseg_line_break.ml", line 206, characters 38-40:
>
> 206 |   | (* LB27 *)  _, (JL|JV|JT|H2|H3), (IN|PO) -> no_boundary s
> ^^
> Warning 12: this sub-pattern is unused.
>


UAX #14 for 13.0.0: LB27 first's line is obsolete

2020-03-03 Thread Daniel Bünzli via Unicode
Hello, 

I think (more precisely my compiler thinks [1]) the first line of LB27 is 
already handled by the new LB22 rule and can be removed. 

Best, 

Daniel

[1]
File "uuseg_line_break.ml", line 206, characters 38-40:

206 |   | (* LB27 *)  _, (JL|JV|JT|H2|H3), (IN|PO) -> no_boundary s
                                            ^^
Warning 12: this sub-pattern is unused.



Re: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Hans Åberg via Unicode


> On 21 Feb 2020, at 13:21, Costello, Roger L. via Unicode 
>  wrote:
> 
> There are binary files and there are text files.

In C, when opening a file as binary with the function fopen, the newlines are 
untranslated [1]. If not using this option, the file is informally text, which 
means that internally in the program, one can assume that the newline [2] is 
the character U+000A LINE FEED (LF).

1. https://en.cppreference.com/w/cpp/io/c/fopen
2. https://en.wikipedia.org/wiki/Newline





RE: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Doug Ewell via Unicode
Costello, Roger L. wrote: > Text files may indeed contain binary (i.e., bytes that are not> interpretable as characters). Namely, text files may contain newlines,> tabs, and some other invisible things.>> Question: "characters" are defined as only the visible things, right? In addition to this being incorrect, as Ken and Richard (so far) have pointed out, this isn't the distinction you are looking for. All file formats contain data which is relevant to that file format. Zip files, executables, JPEGs, MP4s, all contain specific data structured in a specific way. If any of them has that structure interrupted by random bytes, the format has been broken and the file is corrupt. It is no different for text data, which is expected to contain certain bytes and is normally not expected to be interrupted by a series of ranëH‰UÀHƒÈÿH Does that help? --Doug Ewell | Thornton, CO, US | ewellic.org 


Re: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Ken Whistler via Unicode


On 2/21/2020 7:53 AM, Costello, Roger L. via Unicode wrote:


Text files may indeed contain binary (i.e., bytes that are not 
interpretable as characters). Namely, text files may contain newlines, 
tabs, and some other invisible things.


Question: "characters" are defined as only the visible things, right?

No. You've gone astray right there. Please read Chapter 2 of the Unicode 
Standard, and in particular, Section 2.4, Code Points and Characters:


https://www.unicode.org/versions/Unicode12.0.0/ch02.pdf#G25564

All of those types of characters can occur in Unicode plain text. (With 
the exception of surrogate code points.)



I conclude:

Binary files may contain arbitrary text.


Binary files can contain *whatever*, including text.


Text files may contain binary, but only a restricted set of binary.

The distinction is definitional. A text file contains *only* characters, 
interpretable by a specific character encoding (usually Unicode, these 
days).


But a text file need not be "plain text". An HTML file is an example of 
a text file (it contains only a sequence of characters, whose identity 
and interpretation is all clearly specified by looking them up in the 
Unicode Standard), but it is not *plain* text. It is *rich* text, 
consisting of markup tags interspersed with runs of plain text.


Another distinction that may be leading you astray is the distinction 
between binary file transfer and text file transfer. If you are using 
ftp, for example, you can specify use of binary file transfer, *even if* 
the file you are transferring is actually a text file. That simply means 
that the file transfer will agree to treat the entire file as a binary 
blob and transfer it byte-for-byte intact. A text file transfer, on the 
other hand, may look for "lines" in a text file and may adjust line 
endings to suit the receiving platform conventions.



Do you agree?


No.

--Ken



Re: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Richard Wordingham via Unicode
On Fri, 21 Feb 2020 15:53:52 +
"Costello, Roger L. via Unicode"  wrote:

> Based on a private correspondence, I now realize that this statement:
> 
> 
> 
> > Text files do not contain binary  
> 
> 
> 
> is  not correct.
> 
> 
> 
> Text files may indeed contain binary (i.e., bytes that are not
> interpretable as characters). Namely, text files may contain
> newlines, tabs, and some other invisible things.
> 
> 
> 
> Question: "characters" are defined as only the visible things, right?

No, white space (e.g. spaces, tabs and newlines) is normally considered
to be composed of characters.  And then there are much harder to discern
things, such as zero-width spaces, line-break suppressors such as
U+2060 WORD JOINER, and soft hyphens (interpreted as line-break
opportunities).

Richard.


RE: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Costello, Roger L. via Unicode
Based on a private correspondence, I now realize that this statement:



> Text files do not contain binary



is  not correct.



Text files may indeed contain binary (i.e., bytes that are not interpretable as 
characters). Namely, text files may contain newlines, tabs, and some other 
invisible things.



Question: "characters" are defined as only the visible things, right?



I conclude:



Binary files may contain arbitrary text.

Text files may contain binary, but only a restricted set of binary.



Do you agree?



/Roger


From: Costello, Roger L. 
Sent: Friday, February 21, 2020 7:22 AM
To: unicode@unicode.org
Subject: Why do binary files contain text but text files don't contain binary?

Hi Folks,

There are binary files and there are text files.

Binary files often contain portions that are text. For example, the start of 
Windows executable files is the text MZ.

To the best of my knowledge, text files never contain binary, i.e., bytes that 
cannot be interpreted as characters. (Of course, text files may contain a 
text-encoding of binary, such as base64-encoded text.)

Why the asymmetry?

/Roger


Re: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread via Unicode

Dear Roger,

because in when unicode is used in real life, utf8 etc then

  text ⊂ binary

John Knightley

On 2020-02-21 20:21, Costello, Roger L. via Unicode wrote:

Hi Folks,

There are binary files and there are text files.

Binary files often contain portions that are text. For example, the
start of Windows executable files is the text MZ.

To the best of my knowledge, text files never contain binary, i.e.,
bytes that cannot be interpreted as characters. (Of course, text files
may contain a text-encoding of binary, such as base64-encoded text.)

Why the asymmetry?

/Roger




Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Costello, Roger L. via Unicode
Hi Folks,

There are binary files and there are text files.

Binary files often contain portions that are text. For example, the start of 
Windows executable files is the text MZ.

To the best of my knowledge, text files never contain binary, i.e., bytes that 
cannot be interpreted as characters. (Of course, text files may contain a 
text-encoding of binary, such as base64-encoded text.)

Why the asymmetry?

/Roger


Re: What should or should not be encoded in Unicode? (from Re: Egyptian Hieroglyph Man with a Laptop)

2020-02-15 Thread wjgo_10...@btinternet.com via Unicode


Joel Kalvesmaki asks nine questions, six in the first block and three in 
the second block.
Numbering from 1 through to 9 in the order that they are asked, I do 
not, at present understand the question for many of them and I can, at 
present, only answer question 7 definitively. Some questions may need an 
answer in two parts, one of the parts about my specific project, and the 
other part about if one or more people also decide to have his or her 
own encoding space in a similar manner.
I realize that not even understanding the question at this time may not 
sound very good to just some of the people who do understand the 
question, but I am not someone who knowingly purports that he knows what 
he is talking about when he does not. I am a researcher and as I am now 
on awareness of these questions.I need to find out so that in the future 
I can answer such questions with a sound background knowledge of the 
topic.
It might be that I know of some matters but that I am not aware of the 
parlance used to describe them in the post to which I am replying..

So now to my thoughts on some of the questions.
1 to 4. I do not at present understand the question.
5. Perhaps, independent of each other, you bind !123 to a character 
semantically identical to one I've bound to !234. What rules are in 
place to allow interchangeability?
I am not sure this is the best possible answer, but with care the 
problem should not happen in the first place. I am thinking that people 
could perhaps avoid it happening in the first place by using an informal 
discussion method similar that used when proposing a new alt. group in 
the usenet system that was in widespread use before the web was 
invented.

6. I do not at present understand the question.
7. Or maybe you're not so much concerned about interoperability as are 
you are with extending the PUA block beyond its current limits?
No, absolutely not. I have used the Private Use Areas on a number of 
occasions and found them extremely useful to have available. Yet any 
assignment in not unique and, except in very limited special limited 
prearranged circumstances, interoperability is not possible. My research 
project is very much about interoperability with provenance. 
Interoperabilty with provenance is central to what I am trying fo 
achieve.

8. Something like SGML/XML entities?
Until mention in the post to which I am replying, I had never known of 
them.
9.  Couldn't you simply capitalize on the rules that already exist for 
entities?
From what I have read about them today, well, I suppose that I could, 
but that is not my approach and I am not going to use them.
My items are not emoji, but emoji are either expressed by an atomic 
character or by a sequence of atomic characters, such sequences decoded 
upon reception to produce a glyph. My proposed system uses sequences of 
atomic character such that such sequences could be decoded upon 
reception to produce localized output. A similar yet different process. 
I simply do not want, as a design choice, all that angled bracket stuff, 
it is just not what I am trying to do.


If anyone on this mailing list who understands some or all of what I do 
not, your comments in this thread would be very welcome please.
The first three links on my webspace are relevant to my research 
project.

http://www.users.globalnet.co.uk/~ngo/
The website is safe to use. It is hosted on a server run these days by 
Plusnet PLC, a United Kingdom internet service provider. It is not 
hosted on my computer.

William Overington
Saturday 15 February 2020



-- Original Message --
From: "via Unicode" 
To: wjgo_10...@btinternet.com
Cc: unicode@unicode.org
Sent: Saturday, 2020 Feb 15 At 10:11
Subject: Re: What should or should not be encoded in Unicode? (from Re: 
Egyptian Hieroglyph Man with a Laptop)

Hi William,

I don't fully understand your proposed encoding scheme (e.g., Is there a 
namespace each encoding scheme is bound to? How do namespaces get 
encoded? How are syntax strictures encoded?), but even then, presuming 
it's sound, you've said in the message before that this encoding space 
will enhance interoperability. What mechanism is in place to make my 
encoding space interoperable with yours? Perhaps, independent of each 
other, you bind !123 to a character semantically identical to one I've 
bound to !234. What rules are in place to allow interchangeability? What 
about one-to-many or many-to-many or vague or ambiguous mappings across 
encoding schemes, or mappings that we might reasonably contest?


Or maybe you're not so much concerned about interoperability as are you 
are with extending the PUA block beyond its current limits? Something 
like SGML/XML entities? Couldn't you simply capitalize on the rules that 
already exist for entities?


Best wishes,

jk
--
Joel Kalvesmaki
Director, Text Alignment Network
http://textalign.net <http://textalign.net>

On 2020-02-14 15:52, wjgo_10...@btinternet.com v

Re: What should or should not be encoded in Unicode? (from Re: Egyptian Hieroglyph Man with a Laptop)

2020-02-15 Thread via Unicode

Hi William,

I don't fully understand your proposed encoding scheme (e.g., Is there a 
namespace each encoding scheme is bound to? How do namespaces get 
encoded? How are syntax strictures encoded?), but even then, presuming 
it's sound, you've said in the message before that this encoding space 
will enhance interoperability. What mechanism is in place to make my 
encoding space interoperable with yours? Perhaps, independent of each 
other, you bind !123 to a character semantically identical to one I've 
bound to !234. What rules are in place to allow interchangeability? What 
about one-to-many or many-to-many or vague or ambiguous mappings across 
encoding schemes, or mappings that we might reasonably contest?


Or maybe you're not so much concerned about interoperability as are you 
are with extending the PUA block beyond its current limits? Something 
like SGML/XML entities? Couldn't you simply capitalize on the rules that 
already exist for entities?


Best wishes,

jk
--
Joel Kalvesmaki
Director, Text Alignment Network
http://textalign.net

On 2020-02-14 15:52, wjgo_10...@btinternet.com via Unicode wrote:
The solution is to invent my own encoding space. This sits on top of 
Unicode, could be (perhaps?) called markup, but it works!


It may be perilous, because some software may enforce the strict 
official code point limits.


I  have now realized that what I wrote before is ambiguous.

When I wrote "sits on top of Unicode" I was not meaning at some code
points above U+10 in the Unicode map, though I accept that it
could quite reasonably be read as meaning that.

My encoding space sits on top of Unicode in the sense that it uses a
sequence of regular Unicode characters for each code point in my
encoding space.

For example

∫⑦⑧①

or

!781

or

a character sequence of a base character, followed by a tag
exclamation mark followed by three tag digits and a cancel tag.

All three examples above have the same meaning.

∫⑦⑧① is useful as more unlikely otherwise than !123, though !123 is
easier to use and could be used in a GS1-128 barcode.

The tag sequence has the potential to become incorporated into Unicode
for universal standardization of unambiguous interoperability
everywhere. That is a long term goal for me.

The example above uses a three-digit code number. My encoding space
allows for various numbers of digits, with a minimum of three digits
and a much larger theoretical maximum. The most digits in use at
present in my research project in any one code number is six.

William Overington

Friday 14 February 2020


Re: What should or should not be encoded in Unicode? (from Re: Egyptian Hieroglyph Man with a Laptop)

2020-02-14 Thread wjgo_10...@btinternet.com via Unicode
The solution is to invent my own encoding space. This sits on top of 
Unicode, could be (perhaps?) called markup, but it works!


It may be perilous, because some software may enforce the strict 
official code point limits.


I  have now realized that what I wrote before is ambiguous.

When I wrote "sits on top of Unicode" I was not meaning at some code 
points above U+10 in the Unicode map, though I accept that it could 
quite reasonably be read as meaning that.


My encoding space sits on top of Unicode in the sense that it uses a 
sequence of regular Unicode characters for each code point in my 
encoding space.


For example

∫⑦⑧①

or

!781

or

a character sequence of a base character, followed by a tag exclamation 
mark followed by three tag digits and a cancel tag.


All three examples above have the same meaning.

∫⑦⑧① is useful as more unlikely otherwise than !123, though !123 is 
easier to use and could be used in a GS1-128 barcode.


The tag sequence has the potential to become incorporated into Unicode 
for universal standardization of unambiguous interoperability 
everywhere. That is a long term goal for me.


The example above uses a three-digit code number. My encoding space 
allows for various numbers of digits, with a minimum of three digits and 
a much larger theoretical maximum. The most digits in use at present in 
my research project in any one code number is six.


William Overington

Friday 14 February 2020




Re: What should or should not be encoded in Unicode? (from Re: Egyptian Hieroglyph Man with a Laptop)

2020-02-14 Thread Hans Åberg via Unicode

> On 13 Feb 2020, at 16:41, wjgo_10...@btinternet.com via Unicode 
>  wrote:
> 
> Yet a Private Use Area encoding at a particular code point is not unique. 
> Thus, except with care amongst people who are aware of the particular 
> encoding, there is no interoperability, such as with regular Unicode encoded 
> characters.
> 
> However faced with a need for interoperability for my research project, I 
> have found a solution making use of the Glyph Substitution capability of an 
> OpenType font.
> 
> The solution is to invent my own encoding space. This sits on top of Unicode, 
> could be (perhaps?) called markup, but it works!

It may be perilous, because some software may enforce the strict official code 
point limits.



Re: Egyptian Hieroglyph Man with a Laptop

2020-02-14 Thread Adam Borowski via Unicode
On Thu, Feb 13, 2020 at 09:15:18PM +, Richard Wordingham via Unicode wrote:
> On Thu, 13 Feb 2020 20:15:07 +
> Shawn Steele via Unicode  wrote:
> 
> > I confess that even though I know nothing about Hieroglyphs, that I
> > find it fascinating that such a thoroughly dead script might still be
> > living in some way, even if it's only a little bit.
> 
> Plenty of people have learnt how to write their name in hieroglyphs.
> However, it is rare enough that my initials suffice to label my milk at
> work.
> 
> What's more striking is the implication that people are still
> exchanging messages in Middle Egyptian.

I don't think non-Egyptologist recipients are even aware what language that
is, or even that it's actual meaningful message rather than an hieroglyph-
looking doodle.  It's like maker's marks done by/for illiterate people
(such as most artisans in the past) -- as long as it's a distinct symbol,
it does its job.

For example, I end my work emails with "ᛗᛖᛟᚹ" and everyone so far assumed
it's either my initials or at most some greeting.


喵!
-- 
⢀⣴⠾⠻⢶⣦⠀ Latin:   meow 4 characters, 4 columns,  4 bytes
⣾⠁⢠⠒⠀⣿⡁ Greek:   μεου 4 characters, 4 columns,  8 bytes
⢿⡄⠘⠷⠚⠋  Runes:   ᛗᛖᛟᚹ 4 characters, 4 columns, 12 bytes
⠈⠳⣄ Chinese: 喵   1 character,  2 columns,  3 bytes <-- best!


Aw: RE: Egyptian Hieroglyph Man with a Laptop

2020-02-14 Thread Marius Spix via Unicode
That glyph is coded on position U+1F5B3 OLD PERSONAL COMPUTER, see 
http://users.teilar.gr/~g1951d/Aegyptus.pdf
 
 

Gesendet: Donnerstag, 13. Februar 2020 um 07:58 Uhr
Von: "うみほたる via Unicode" 
An: unicode@unicode.org
Betreff: RE: Egyptian Hieroglyph Man with a Laptop
The early versions of the font Aegyptus (http://users.teilar.gr/~g1951d/) has 
the glyph as one of "Dingbats" distinguished from general characters.
The attached image is from the PDF file for Aegyptus.ttf version 3.17 (2012).



Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread via Unicode




Strange, has several meanings, not all positive. Perhaps the term 
outlier is less ambiguous. One definition is unfamiliar, some outliers 
over time become widespread in use, become famliar we no longer consider 
them strange, but as they are still different are still outliers. CJK is 
a living script so new characters come and go, not all become widespread 
in there use.


"Egyptologist" is certainly an outlier, an certainly strange to me. One 
question is what do "Egyptologist" think of it.


John

On 2020-02-14 08:13, Ken Whistler via Unicode wrote:

Well, no, in this case "strange" means strange, as Ken Lunde notes.
I'm just pointing to his list, because it pulls together quite a few
Han characters that *also* have dubious cases for encoding.

Or you could turn the argument around, I suppose, and note that just
because the hieroglyph for "Egyptologist" is strange, that doesn't
necessarily mean that the case for encoding it is dubious. ;-)

--Ken

On 2/13/2020 3:47 PM, j...@koremail.com wrote:
An interesting comparison, if strange means dubious, then the name 
kstrange should be changed or some of the content removed because many 
of the characters in the set are not dubious in the least.






Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Ken Whistler via Unicode
Well, no, in this case "strange" means strange, as Ken Lunde notes. I'm 
just pointing to his list, because it pulls together quite a few Han 
characters that *also* have dubious cases for encoding.


Or you could turn the argument around, I suppose, and note that just 
because the hieroglyph for "Egyptologist" is strange, that doesn't 
necessarily mean that the case for encoding it is dubious. ;-)


--Ken

On 2/13/2020 3:47 PM, j...@koremail.com wrote:
An interesting comparison, if strange means dubious, then the name 
kstrange should be changed or some of the content removed because many 
of the characters in the set are not dubious in the least.




Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread via Unicode

Dear Ken

An interesting comparison, if strange means dubious, then the name 
kstrange should be changed or some of the content removed because many 
of the characters in the set are not dubious in the least.


Regards
John

On 2020-02-14 04:08, Ken Whistler via Unicode wrote:

You want "dubious"?!

You should see the hundreds of strange characters already encoded in
the CJK *Unified* Ideographs blocks, as recently documented in great
detail by Ken Lunde:

https://www.unicode.org/L2/L2020/20059-unihan-kstrange-update.pdf

Compared to many of those, a hieroglyph of a man (or woman) holding a
laptop is positively orthodox!

--Ken

On 2/13/2020 11:47 AM, Phake Nick via Unicode wrote:
Those characters could also be put into another block for the same 
script similar to how dubious characters in CJK are included by 
placing them into "CJK Compatibility Ideographs" for round trip 
compatibility with source encoding.




Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Richard Wordingham via Unicode
On Thu, 13 Feb 2020 20:15:07 +
Shawn Steele via Unicode  wrote:

> I confess that even though I know nothing about Hieroglyphs, that I
> find it fascinating that such a thoroughly dead script might still be
> living in some way, even if it's only a little bit.

Plenty of people have learnt how to write their name in hieroglyphs.
However, it is rare enough that my initials suffice to label my milk at
work.

What's more striking is the implication that people are still
exchanging messages in Middle Egyptian.

Richard.


Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Asmus Freytag via Unicode

  
  
On 2/12/2020 3:26 PM, Shawn Steele via
  Unicode wrote:


  
From the point of view of Unicode, it is simpler: If the character is in use or have had use, it should be included somehow.

  
  
That bar, to me, seems too low.  Many things are only used briefly or in a private context that doesn't really require encoding.

The term "use" clearly should be understood as "used in active
  public interchange".
From that point on, its gets tricky. Generally, in order to
  standardize something presupposes a community with shared, active
  conventions of usage. However, sometimes, what the community would
  like is to represent faithfully somebody's private convention, or
  some convention that's fallen out of use.
Such scenarios  may require exceptions to the general statement,
  but the distinction between truly ephemeral use and use that,
  while limited in time, should be digitally archivable in plain
  text is and always should be a matter of judgment.


  

The hieroglyphs discussion is interesting because it presents them as living (in at least some sense) even though they're a historical script.  Apparently modern Egyptologists are coopting them for their own needs.  There are lots of emoji for professional fields.  In this case since hieroglyphs are pictorial, it seems they've blurred the lines between the script and emoji.  Given their field, I'd probably do the same thing.

Focusing on the community of scholars (and any other current
  users) rather than the historical community of original users
  seems rather the appropriate thing to do. Whenever a modern
  community uses a historic script, new conventions will emerge.
  These may even include conventions around transcribing existing
  documents (because the historic communities had no conventions
  around digitizing their canon).


  

I'm not opposed to the character if Egyptologists use it amongst themselves, though it does make me wonder if it belongs in this set?  Are there other "modern" hieroglyphs?  (Other than the errors, etc mentioned earlier, but rather glyphs that have been invented for modern use).

I think the proposed location is totally fine. Trying to
  fine-tune a judgement about characters by placing them in specific
  way is a fools game. If needed, distinctions can be expressed via
  character properties.
A./




  

-Shawn 






  



RE: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Shawn Steele via Unicode
I'm not opposed to a sub-bloc for "Modern Hieroglyphs"  

I confess that even though I know nothing about Hieroglyphs, that I find it 
fascinating that such a thoroughly dead script might still be living in some 
way, even if it's only a little bit.

-Shawn

-Original Message-
From: Unicode  On Behalf Of Ken Whistler via 
Unicode
Sent: Thursday, February 13, 2020 12:08 PM
To: Phake Nick 
Cc: unicode@unicode.org
Subject: Re: Egyptian Hieroglyph Man with a Laptop

You want "dubious"?!

You should see the hundreds of strange characters already encoded in the CJK 
*Unified* Ideographs blocks, as recently documented in great detail by Ken 
Lunde:

https://www.unicode.org/L2/L2020/20059-unihan-kstrange-update.pdf

Compared to many of those, a hieroglyph of a man (or woman) holding a laptop is 
positively orthodox!

--Ken

On 2/13/2020 11:47 AM, Phake Nick via Unicode wrote:
> Those characters could also be put into another block for the same 
> script similar to how dubious characters in CJK are included by 
> placing them into "CJK Compatibility Ideographs" for round trip 
> compatibility with source encoding.



Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Ken Whistler via Unicode

You want "dubious"?!

You should see the hundreds of strange characters already encoded in the 
CJK *Unified* Ideographs blocks, as recently documented in great detail 
by Ken Lunde:


https://www.unicode.org/L2/L2020/20059-unihan-kstrange-update.pdf

Compared to many of those, a hieroglyph of a man (or woman) holding a 
laptop is positively orthodox!


--Ken

On 2/13/2020 11:47 AM, Phake Nick via Unicode wrote:
Those characters could also be put into another block for the same 
script similar to how dubious characters in CJK are included by 
placing them into "CJK Compatibility Ideographs" for round trip 
compatibility with source encoding.


Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Phake Nick via Unicode
Those characters could also be put into another block for the same script
similar to how dubious characters in CJK are included by placing them into
"CJK Compatibility Ideographs" for round trip compatibility with source
encoding.

在 2020年2月14日週五 03:35,Richard Wordingham via Unicode 
寫道:

> On Thu, 13 Feb 2020 10:18:40 +0100
> Hans Åberg via Unicode  wrote:
>
> > > On 13 Feb 2020, at 00:26, Shawn Steele 
> > > wrote:
> > >> From the point of view of Unicode, it is simpler: If the character
> > >> is in use or have had use, it should be included somehow.
> > >
> > > That bar, to me, seems too low.  Many things are only used briefly
> > > or in a private context that doesn't really require encoding.
> >
> > That is a private use area for more special use.
>
> Writing the plural ('Egyptologists') by writing the plural strokes below
> the glyph could be difficult if the renderer won't include them in the
> same script run.
>
> Richard.
>
>


Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Richard Wordingham via Unicode
On Thu, 13 Feb 2020 10:18:40 +0100
Hans Åberg via Unicode  wrote:

> > On 13 Feb 2020, at 00:26, Shawn Steele 
> > wrote: 
> >> From the point of view of Unicode, it is simpler: If the character
> >> is in use or have had use, it should be included somehow.  
> > 
> > That bar, to me, seems too low.  Many things are only used briefly
> > or in a private context that doesn't really require encoding.  
> 
> That is a private use area for more special use.

Writing the plural ('Egyptologists') by writing the plural strokes below
the glyph could be difficult if the renderer won't include them in the
same script run.

Richard.



What should or should not be encoded in Unicode? (from Re: Egyptian Hieroglyph Man with a Laptop)

2020-02-13 Thread wjgo_10...@btinternet.com via Unicode
Hans Åberg >>> From the point of view of Unicode, it is simpler: If the 
character is in use or have had use, it should be included somehow.


Shawn Steele >> That bar, to me, seems too low.  Many things are only 
used briefly or in a private context that doesn;t really require 
encoding.


Hans Åberg > That is a private use area for more special use.

I have used the Private Use Area, quite a lot over many years.

I have a licence for a fontmaking program, FontCreator. A good feature 
of the Windows operating system is that all installed fonts can be used 
in most installed programs. Private Use Area code points are official 
Unicode code points. These three factors together allow me to design and 
produce TrueType fonts for new symbols each encoded at a Private Use 
Area code point (a different code point for each such novel symbol), 
install the fonts, and use them in various programs, including a desktop 
publishing program and thereby make PDF (Portable Document Format) 
documents that include both ordinary text and the novel symbols. These 
PDF documents are then suitable for placing on the web and for Legal 
Deposit with The British Library.


Yet a Private Use Area encoding at a particular code point is not 
unique. Thus, except with care amongst people who are aware of the 
particular encoding, there is no interoperability, such as with regular 
Unicode encoded characters.


However faced with a need for interoperability for my research project, 
I have found a solution making use of the Glyph Substitution capability 
of an OpenType font.


The solution is to invent my own encoding space. This sits on top of 
Unicode, could be (perhaps?) called markup, but it works!


I am hoping that at some future time the results of my research will 
become encoded as an International Standard, and that my encoding space 
will then after that become integrated into Unicode, thus achieving 
fully standardized unique interoperable encoding as part of Unicode. 
Quite a dream, but the way to achieve such a fully standardized unique 
interoperable encoding as part of Unicode is from a technological point 
of view, quite straightforward. There are details of this in the 
Accumulated Feedback on Public Review Issue #408.


https://www.unicode.org/review/pri408/

Yet having my encoding space in this manner is just something that I 
have done on my own initiative. Anybody can have his or her own encoding 
space if he or she so chooses. With a little care and consideration for 
others these encodings need not clash one with another and all could 
even coexist in one document.


Having my own encoding space has enabled me to make progress with my 
research project.


William Overington

Thursday 13 February 2020





Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Frédéric Grosshans via Unicode

Le 12/02/2020 à 23:30, Michel Suignard a écrit :


Interesting that a single character is creating so much feedback, but 
it is not the first time.


Extrapolating from my own case, I guess it’s because hieroglyphs have a 
strong cultural significance — especially to people following unicode 
encoding — but that very few are qualified enough to emit a judgement, 
except maybe for this character.



It is true that the glyph in question was not in the base 
Hieroglyphica glyph set (that is why I referenced it as an 
'extension'). Its presence though raises an interesting point 
concerning abstraction of Egyptian hieroglyphs in general. All 
Egyptian hieroglyphs proposals imply some abstraction from the 
original evidences found on stone, wood, papyrus. At some point you 
have to decide some level where you feel confident that you created 
enough glyphs to allow meaningful interaction among Egyptologists. 
Because the set represents an extinct system you probably have to be a 
bit liberal in allowing some visual variants (because we can never be 
completely sure two similar looking signs are 100% equivalent in all 
their possible functions in the writing system and are never used in 
contrast).


This is clearly a problem difficult to tackle, with both extinct and 
logographic script, and hieroglyphics is both. It is obvious to me (and 
probably to anyone following unicode encoding) that the work you have 
been doing over the last few tear is a very difficult one. By the way, 
you expalin this approach very well explained on page 6, when you take 
the “disunification” on *U+14828 N-19-016 and the already encoded 
U+1321A N037A (Which would be N-19-017)


These abstract collections have started to appear in the first part of 
the nineteen century (Champollion starting in 1822). Interestingly 
these collections have started to be useful on their own even if in 
some case the main use of  parts is self-referencing, either because 
the glyph is a known mistake, or a ghost (character for which 
attestation is now firmly disputed). For example, it would be very 
difficult to create a new set not including the full Gardiner set, 
even if some of the characters are not necessarily justified. To a 
large degree, Hieroglyphica (and its related collection JSesh) has 
obtained that status as well. The IFAO (Institut Français 
d’Archéologie Orientatle) set is another one, although there is no 
modern font representing all of it (although many of the IFAO glyphs 
should not be encoded separately).


I  see this as variant of the “round-trip compatibility” principle of 
unicode adapted to ancient scripts, where the role of “legacy standards” 
is often taken by old scholarly litterature.



There is obviously no doubt that the character in question is a modern 
invention and not based on historical evidence. But interestingly 
enough it has started to be used as a pictogram with some content 
value, describing in fact an Egyptologist. It may not belong to that 
block, but it actually describes an use case and has been used a 
symbol in some technical publication.


I think the main problem I see with this character is that it seems to 
be sneaked in the main proposal. The text of the proposal seems to imply 
that the charcters proposed where either in use in ancient egypt or 
correspond to abstractions used by modern (=Champollion and later) 
egyptologists intended to reflect them.


This character does not fit in this picture, but that does not mean it 
does not belong to the hieroglyphic bloc: I think modern use of 
hieroglyphs (like e.g. the ones described in Hieroglyphs For Your Eyes 
Only: Samuel K. Lothrop and His Use of Ancient Egyptian as Cipher, by 
Pierre//http://www.mesoweb.com/articles/meyrat/Meyrat2014.pdf, 2014) 
should use the standard unicode encoding. There is a precedent in 
encoding modern characters in an extinct script with the encoding of 
Tolkienian characters U+16F1 to U+16F3 in the Runic block.


But I feel the encoding of such a character needs at the very to be 
explicitly discussed in the text of the proposal., e.g. by giving 
evidence of its modern use.


Concerning:

The question is then: was this well known about people reading 
hieroglyphs who checked this proposal? If not, it is very difficult to 
trust other hieroglyphs, especially if the first explanation is the good


one: some trap characters could actually look like real ones. Except 
of course if we accept some hieroglyphs for compatibility purpose, but 
this is not mentioned as a valid reason in any propoal yet.


> In my opinion, this is an invalid character, which should not be

> included in Unicode.

I agree.

You are allowed to have your own opinion, but I can tell you I have 
spent a lot of times checking attestation from many sources for the 
proposed repertoire. It won’t be perfect, but perfection (or a closer 
reach) would probably cost decades in study while preventing current 
research to have a communication platform. I 

Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Hans Åberg via Unicode


> On 13 Feb 2020, at 00:26, Shawn Steele  wrote:
> 
>> From the point of view of Unicode, it is simpler: If the character is in use 
>> or have had use, it should be included somehow.
> 
> That bar, to me, seems too low.  Many things are only used briefly or in a 
> private context that doesn't really require encoding.

That is a private use area for more special use.





RE: Egyptian Hieroglyph Man with a Laptop

2020-02-12 Thread Shawn Steele via Unicode
> From the point of view of Unicode, it is simpler: If the character is in use 
> or have had use, it should be included somehow.

That bar, to me, seems too low.  Many things are only used briefly or in a 
private context that doesn't really require encoding.

The hieroglyphs discussion is interesting because it presents them as living 
(in at least some sense) even though they're a historical script.  Apparently 
modern Egyptologists are coopting them for their own needs.  There are lots of 
emoji for professional fields.  In this case since hieroglyphs are pictorial, 
it seems they've blurred the lines between the script and emoji.  Given their 
field, I'd probably do the same thing.

I'm not opposed to the character if Egyptologists use it amongst themselves, 
though it does make me wonder if it belongs in this set?  Are there other 
"modern" hieroglyphs?  (Other than the errors, etc mentioned earlier, but 
rather glyphs that have been invented for modern use).

-Shawn 




Re: Egyptian Hieroglyph Man with a Laptop

2020-02-12 Thread Hans Åberg via Unicode


> On 12 Feb 2020, at 23:30, Michel Suignard via Unicode  
> wrote:
> 
> These abstract collections have started to appear in the first part of the 
> nineteen century (Champollion starting in 1822). Interestingly these 
> collections have started to be useful on their own even if in some case the 
> main use of  parts is self-referencing, either because the glyph is a known 
> mistake, or a ghost (character for which attestation is now firmly disputed). 
> For example, it would be very difficult to create a new set not including the 
> full Gardiner set, even if some of the characters are not necessarily 
> justified. To a large degree, Hieroglyphica (and its related collection 
> JSesh) has obtained that status as well. The IFAO (Institut Français 
> d’Archéologie Orientatle) set is another one, although there is no modern 
> font representing all of it (although many of the IFAO glyphs should not be 
> encoded separately).
> 
> There is obviously no doubt that the character in question is a 
> modern invention and not based on historical evidence. But interestingly 
> enough it has started to be used as a pictogram with some content value, 
> describing in fact an Egyptologist. It may not belong to that block, but it 
> actually describes an use case and has been used a symbol in some technical 
> publication.

>From the point of view of Unicode, it is simpler: If the character is in use 
>or have had use, it should be included somehow.





RE: Egyptian Hieroglyph Man with a Laptop

2020-02-12 Thread Michel Suignard via Unicode
d in Unicode.



I agree.



  Frédéric



>

> On Thu, 12 Feb 2020 19:12:14 +0100

> Frédéric Grosshans via Unicode 
> mailto:unicode@unicode.org>> wrote:

>

>> Dear Unicode list members (CC Michel Suignard),

>>

>> the Unicode proposal L2/20-068

>> <https://www.unicode.org/L2/L2020/20068-n5128-ext-hieroglyph.pdf>,

>> “Revised draft for the encoding of an extended Egyptian Hieroglyphs

>> repertoire, Groups A to N” (

>> https://www.unicode.org/L2/L2020/20068-n5128-ext-hieroglyph.pdf ) by

>> Michel Suignard contains a very interesting hieroglyph at position

>> *U+13579 EGYPTIAN HIEROGLYPH A-12-054, which seems to represent a man

>> with a laptop, as can be obvious in the attached image.

>>

>> I am curious about the source of this hieroglyph: in the table

>> acompannying the document, its sources are said to be “Hieroglyphica

>> extension (various sources)” with number A58C and “Hornung & Schenkel

>> (2007, last modified in 2015)”, but with no number (A;), which seems

>> unique in the table. It leads me to think this glyph only exist in

>> some modern font, either as a joke, or for some computer related

>> modern use. Can anyone infirm or confirm this intuition ?

>>

>>  Frédéric

>>

>>




Re: Egyptian Hieroglyph Man with a Laptop

2020-02-12 Thread Markus Scherer via Unicode
On Wed, Feb 12, 2020 at 11:37 AM Marius Spix via Unicode <
unicode@unicode.org> wrote:

> In my opinion, this is an invalid character, which should not be
> included in Unicode.
>

Please remember that feedback that you want the committee to look at needs
to go through http://www.unicode.org/reporting.html

Best regards,
markus


Re: Egyptian Hieroglyph Man with a Laptop

2020-02-12 Thread Joe Becker via Unicode



I assume this glyph was created to honor Cleo Huggins, the designer of 
Sonata at Adobe, who decades ago created a similar hieroglyph of a 
*woman* in front of her computer.


Joe






Re: Egyptian Hieroglyph Man with a Laptop

2020-02-12 Thread Frédéric Grosshans via Unicode

Le 12/02/2020 à 20:38, Marius Spix a écrit :

That is a pretty interesting finding. This glyph was not part of
http://www.unicode.org/L2/L2018/18165-n4944-hieroglyphs.pdf


It is, as *U+1355A EGYPTIAN HIEROGLYPH A-12-051



but has been first seen in
http://www.unicode.org/L2/L2019/19220-n5063-hieroglyphs.pdf

The only "evidence" for this glyph I could find, is a stock photo,
which is clearly made in the 21th century.
https://www.alamy.com/stock-photo-egyptian-hieroglyphics-with-notebook-digital-illustration-57472465.html
I don’t even think it could qualify, since I think the woman in this 
picture would correspond to another hieroglyph, from the B series 
(B-04), not a A-12.


I know, that some font creators include so-called trap characters,
similar to trap streets which are often found in maps to catch copyright
violations. But it is also possible that the someone wanted to smuggle
an easter-egg into Unicode or just test if the quality assurance works.


The question is then: was this well known about people reading 
hieroglyphs who checked this proposal? If not, it is very difficult to 
trust other hieroglyphs, especially if the first explanation is the good 
one: some trap characters could actually look like real ones. Except of 
course if we accept some hieroglyphs for compatibility purpose, but this 
is not mentioned as a valid reason in any propoal yet.



In my opinion, this is an invalid character, which should not be
included in Unicode.


I agree.

  Frédéric



On Thu, 12 Feb 2020 19:12:14 +0100
Frédéric Grosshans via Unicode  wrote:


Dear Unicode list members (CC Michel Suignard),

    the Unicode proposal L2/20-068
<https://www.unicode.org/L2/L2020/20068-n5128-ext-hieroglyph.pdf>,
“Revised draft for the encoding of an extended Egyptian Hieroglyphs
repertoire, Groups A to N” (
https://www.unicode.org/L2/L2020/20068-n5128-ext-hieroglyph.pdf ) by
Michel Suignard contains a very interesting hieroglyph at position
*U+13579 EGYPTIAN HIEROGLYPH A-12-054, which seems to represent a man
with a laptop, as can be obvious in the attached image.

    I am curious about the source of this hieroglyph: in the table
acompannying the document, its sources are said to be “Hieroglyphica
extension (various sources)” with number A58C and “Hornung & Schenkel
(2007, last modified in 2015)”, but with no number (A;), which seems
unique in the table. It leads me to think this glyph only exist in
some modern font, either as a joke, or for some computer related
modern use. Can anyone infirm or confirm this intuition ?

     Frédéric






Re: Egyptian Hieroglyph Man with a Laptop

2020-02-12 Thread Marius Spix via Unicode
That is a pretty interesting finding. This glyph was not part of
http://www.unicode.org/L2/L2018/18165-n4944-hieroglyphs.pdf
but has been first seen in
http://www.unicode.org/L2/L2019/19220-n5063-hieroglyphs.pdf

The only "evidence" for this glyph I could find, is a stock photo,
which is clearly made in the 21th century.
https://www.alamy.com/stock-photo-egyptian-hieroglyphics-with-notebook-digital-illustration-57472465.html

I know, that some font creators include so-called trap characters,
similar to trap streets which are often found in maps to catch copyright
violations. But it is also possible that the someone wanted to smuggle
an easter-egg into Unicode or just test if the quality assurance works.

In my opinion, this is an invalid character, which should not be
included in Unicode.


On Thu, 12 Feb 2020 19:12:14 +0100
Frédéric Grosshans via Unicode  wrote:

> Dear Unicode list members (CC Michel Suignard),
> 
>    the Unicode proposal L2/20-068 
> <https://www.unicode.org/L2/L2020/20068-n5128-ext-hieroglyph.pdf>, 
> “Revised draft for the encoding of an extended Egyptian Hieroglyphs 
> repertoire, Groups A to N” ( 
> https://www.unicode.org/L2/L2020/20068-n5128-ext-hieroglyph.pdf ) by 
> Michel Suignard contains a very interesting hieroglyph at position 
> *U+13579 EGYPTIAN HIEROGLYPH A-12-054, which seems to represent a man 
> with a laptop, as can be obvious in the attached image.
> 
>    I am curious about the source of this hieroglyph: in the table 
> acompannying the document, its sources are said to be “Hieroglyphica 
> extension (various sources)” with number A58C and “Hornung & Schenkel 
> (2007, last modified in 2015)”, but with no number (A;), which seems 
> unique in the table. It leads me to think this glyph only exist in
> some modern font, either as a joke, or for some computer related
> modern use. Can anyone infirm or confirm this intuition ?
> 
>     Frédéric
> 
> 




Egyptian Hieroglyph Man with a Laptop

2020-02-12 Thread Frédéric Grosshans via Unicode

Dear Unicode list members (CC Michel Suignard),

  the Unicode proposal L2/20-068 
, 
“Revised draft for the encoding of an extended Egyptian Hieroglyphs 
repertoire, Groups A to N” ( 
https://www.unicode.org/L2/L2020/20068-n5128-ext-hieroglyph.pdf ) by 
Michel Suignard contains a very interesting hieroglyph at position 
*U+13579 EGYPTIAN HIEROGLYPH A-12-054, which seems to represent a man 
with a laptop, as can be obvious in the attached image.


  I am curious about the source of this hieroglyph: in the table 
acompannying the document, its sources are said to be “Hieroglyphica 
extension (various sources)” with number A58C and “Hornung & Schenkel 
(2007, last modified in 2015)”, but with no number (A;), which seems 
unique in the table. It leads me to think this glyph only exist in some 
modern font, either as a joke, or for some computer related modern use. 
Can anyone infirm or confirm this intuition ?


   Frédéric




RE: Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask

2020-02-12 Thread Sławomir Osipiuk via Unicode
On Wed, Feb 12, 2020 at 11:28 AM wjgo_10...@btinternet.com via Unicode 
 wrote:
>
> I am reminded of the teletext system (with brand names such as Ceefax and 
> Oracle) in the United KIngdom, which was a broadcasting technology introduced 
> in the 1970s and which became very much a part of British culture during the 
> 1980s and 1990s. A digital signal of a special purpose 7-bit character set 
> was broadcast in the vertical blanking interval of a 625 line analogue 
> television signal.
[...]
> It seems to me that there could be, in the future, a type of thing that sends 
> out a continuous signal over a wire of, say, a temperature reading at its 
> location, all formatted in several languages. So, no passwords, no input from 
> an end user, just a continuous feeding into The Internet of Things its 
> output, with the numerical value in the messages changed as the temperature 
> changes. This would allow the digits to be expressed in the digits used in 
> the particular script of the particular language used in an individual 
> message.

Teletext had a data rate of 7 kilobits/s (less than 1 kilobyte/s), was cleverly 
grafted onto a system never designed for it, and the terminals to display it 
couldn't handle modern markup. Language tags, or something very like them, 
would make sense for very low-rate transmissions like Teletext (or the similar 
Line 21 closed captions in NTSC). It's too late for them, though.

The proposal is for "Internet of Things". In 2020, 1kpbs transmissions are 
laughably slow, unless you're talking to the Voyager space probes. Receiving 
equipment, even at the lowest end, has more than enough processing power to 
interpret a proper markup language. If for some reason you really do want to 
minimize data rate, you're better off with data compression rather than saving 
bytes by using Unicode language tags instead of XML. The receiving equipment 
can handle a decompression step at basically no cost (that wasn't true in the 
1970s), and markup languages compress very well.

The particular circumstances that would encourage unicode tag characters don't 
exist today: Razor-thin data rate and miniscule receiver processing power. With 
the resources we have now, anything done by tag characters can be done BETTER 
with proper encapsulating protocols and markup.

With all that said, there is no Unicode Police that will come banging on your 
door if you make a system that uses the tag characters. If you, or anyone, 
thinks it's the best solution for a particular project, then do it. Deprecation 
just means, "There are better ways of doing this. Seriously, please look 
around." And I think that message is still valid.

(This reply may read overly critical, but I'm very much enjoying this 
discussion.)

Sławomir Osipiuk





RE: Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask

2020-02-12 Thread wjgo_10...@btinternet.com via Unicode


Hi

At the time, I thought that my post yesterday concluded the thread. 
However, later something occurred to me as a result of something in the 
post by Sławomir Osipiuk.


The gentleman wrote as follows:

Sending multiples of the same message in different languages is really 
only applicable to broadcast/multicast scenarios, where you have a 
transmission going out live to multiple recipients who have different 
language demands. I can't immediately think of any examples where this 
is done with plain-text only, though I'd be glad to learn about them, 
if they exist.
Whilst I do not know of anything of where this is presently done, I 
realized that this would be a practical proposition for some of the 
things in the Internet of things.
I am reminded of the teletext system (with brand names such as Ceefax 
and Oracle) in the United KIngdom, which was a broadcasting technology 
introduced in the 1970s and which became very much a part of British 
culture during the 1980s and 1990s. A digital signal of a special 
purpose 7-bit character set was broadcast in the vertical blanking 
interval of a 625 line analogue television signal. Basically in some 
lines normally used for the colour picture but some lines were not used 
during the time allowed for the scan go back to the top of the picture 
once it reached the lower edge of the picture. So this digital 
information service got a free ride in the picture signal going out to 
receivers all over the country. The information was organised into pages 
and an end user could go to "text" and then wait for a selected page to 
come round again in the continuous cyclic broadcasting of pages. Pages 
could be arranged by the broadcaster so that, say, the news headlines 
page came around maybe four times in each, say, 20 second cycle and some 
pages only once. It was very effective as the special purpose 7-bit 
character set, while being basically ASCII, had control characters that 
were stateful and displayed each as a space yet some of them switched 
the colour of the following text until a new control character for a 
colour were received, if it indeed one were received; or until the end 
of the 40 character line of the display. Each line started  with white 
text, though if the first character of the line switched to a colour, 
the end user would not see any white text. The control codes set also 
included switching to chunky graphics mode. There was also a facility to 
use the system for subtitles to the television programme, optional 
subtitles so that end users could have them on if desired yet other 
users were not thereby forced to have subtitles. It was good, as various 
participants in a discussion - whether news or drama - could each have a 
colour for their speaking, such as green, yellow, cyan, white. No return 
link was needed to send information from the end user to the central 
broadcasting computer.
A system with the same format of display was a viewdata system (brand 
name Prestel) but that was very different from teletext and used a 
two-way telephone line connection. In a viewdata system, the end user 
selected a page from a menu then a message requesting that page was sent 
to the central computer and just that page was sent to the end user. A 
fee for a page was often charged and the system never really took off. 
Teletext thrived because economy of scale brought the cost of 
teletext-capable electronics down and it was installed using a set of 
for-the-purpose integrated circuits during manufacture of most colour 
television sets in that era, and once installed then it was a free 
add-on with no ongoing cost apart from the ordinary television licence.
It seems to me that there could be, in the future, a type of thing that 
sends out a continuous signal over a wire of, say, a temperature reading 
at its location, all formatted in several languages. So, no passwords, 
no input from an end user, just a continuous feeding into The Internet 
of Things its output, with the numerical value in the messages changed 
as the temperature changes. This would allow the digits to be expressed 
in the digits used in the particular script of the particular language 
used in an individual  message.

William Overington
Wednesday 12 February 2020




Re: Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask

2020-02-11 Thread wjgo_10...@btinternet.com via Unicode

Hi

Thank you to everybody who replied to this thread, both online and 
offline.


Sławomir Osipiuk wrote:

As for "concatenation of such plain text sequences" where each 
sequence is in a different language, ...


Actually I was meaning the concatenation of a number of messages, one 
from each of a number "things", where each message includes text in 
several languages. The result being a report in several languages, just 
by simple concatenation of the number of reports. That is, if there are 
seven sensors, the final report has seven uses of the language code for 
English, seven for French, seven for German, seven for Polish, and so 
on.


Mark E. Shoulson wrote:

So at least this particular application would be a solution to a 
problem that's already been solved.


Well, maybe it is now a solution that is out there and maybe some day a 
problem will arise for which this would be a solution worth considering. 
So for now it drifts into the archives.


Best regards,

William Overington

Tuesday 11 February 2020




Re: Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask

2020-02-10 Thread Mark E. Shoulson via Unicode

On 2/10/20 6:14 PM, Sławomir Osipiuk via Unicode wrote:

As for "concatenation of such plain text sequences" where each sequence is in a 
different language, I must again ask: Is there a system that actually does this, that 
does not have a higher-level protocol that can carry metadata about the natural language 
of the text sequences?
Indeed, it seems to me that concatenating such sequences *is* in itself 
a higher-level protocol.  After all, it isn't  "plain text" anymore when 
you have to suppress printing out some of it.  And we already have other 
higher-level protocols that can do the job about as efficiently.  So at 
least this particular application would be a solution to a problem 
that's already been solved.


~mark



RE: Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask

2020-02-10 Thread Sławomir Osipiuk via Unicode
The examples given don't convince me that "higher-level protocols" would not be 
sufficient.

There are very few messages being sent in the "Internet of Things" that are 
truly plain-text. Even those that use a text base (as opposed to binary data) 
are still in some kind of structured computer language, be it HTML, XML, JSON, 
etc. The intended natural language can be specified using that structure.

Sending multiples of the same message in different languages is really only 
applicable to broadcast/multicast scenarios, where you have a transmission 
going out live to multiple recipients who have different language demands. I 
can't immediately think of any examples where this is done with plain-text 
only, though I'd be glad to learn about them, if they exist. 

For any peer-to-peer or client-server interaction, as in your password example, 
it makes more sense to have the recipient request a specific language (e.g. 
using HTTP's "Accept-Language" header) and the sender to send its message in 
that language automatically.

As for "concatenation of such plain text sequences" where each sequence is in a 
different language, I must again ask: Is there a system that actually does 
this, that does not have a higher-level protocol that can carry metadata about 
the natural language of the text sequences?

Basically, I doubt Unicode language tags would be useful here because there 
simply is no Internet-based system that transmits human-readable text, in 
multiple natural languages, in such a rudimentary way, with no encapsulating 
protocol or metadata. And I doubt there will be; it seems like such a strange 
design choice in this day and age. Though I'd be glad to be corrected if 
someone has an example.

Sławomir Osipiuk





Re: Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask

2020-02-10 Thread Steffen Nurpmeso via Unicode
wjgo_10...@btinternet.com via Unicode wrote in
<141cecf1.23e.1702ea529c1.webtop@btinternet.com>:
 |Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good 
 |reason why I ask
 |
 |There is a German song, Lorelei, and I searched to find an English 
 |translation.

Regarding Rhine and this thing of yours, there is also the German
joke from the middle of the 1950s, i think, with "Tünnes und
Schäl".

  Tünnes und Schäl stehen auf der Rheinbrücke.
  Da fällt Tünnes die Brille in den Fluß und er sagt
  "Da schau, jetzt ist mir die Brille in die Mosel gefallen",
  worauf Schäl sagt, "Mensch, Tünnes, dat is doch de Ring!",
  und Tünnes antwortet "Da kannste mal sehen wie schlecht ich ohne
  Brille sehen kann!"

  Tuennes und Schael stand on the Rhine bridge.
  Then Tuennes glasses fall into the river, and he says
  "Look, now i lost my glasses to the Moselle",
  whereupon Schael says "Crumbs!, Tuennes, that is the Rhine!",
  and Tuennes responds "There you can say how bad i can see
  without glasses!"

P.S.: i cannot speak "Kösch" aka Cologne dialect.
P.P.S.: i think i got you wrong.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask

2020-02-10 Thread wjgo_10...@btinternet.com via Unicode

Hi

Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good 
reason why I ask


There is a German song, Lorelei, and I searched to find an English 
translation.


I found the following video.

https://www.youtube.com/watch?v=lJ3JhxOUbw0

The video is an instrumental version and is particularly interesting is 
that there are lyrics displayed in four languages, with two versions of 
the translation in English.


Being a native speaker of English and living in England I first watched 
the video viewing just the version labelled British:. Later I played the 
video again and I just viewed the version labelled U.S..


Remembering that I had some time ago heard a version in Esperanto, I 
searched nd found the two following videos.


https://www.youtube.com/watch?v=reUpdGgdBsA

https://www.youtube.com/watch?v=7dHhTXDmP0k

They may be of the same recording. This first has in its notes the text 
of the lyrics.


The song in Esperanto has the rather expressive Esperanto word belega in 
it. This single word, an adjective, is composed from the Esperanto word 
bela which means beautiful augmented with the Esperanto word-building 
component -eg- that modifies the word to which it is an augmentation to 
indicate greatness. So the word belega expresses in one three-syllable 
Esperanto word the concept that is in English "greatly beautiful".


http://esperanto.davidgsimpson.com/eo-affixes.html

Thinking of the first video to which I linked, it occurred to me that if 
a plain text message were sent containing each of two or more versions 
of the same text, for whatever text, probably a short message in 
practice, each in a different language from the other or others, with 
the language of a particular version preceded by a tag sequence: then 
software at the receiving end could be set to a chosen language and only 
text in that language would be displayed.


Thinking around this idea I thought that this could be very useful in 
The Internet of Things for machine to human communication, whereby, if, 
say, an end user (human) is wanting to dialogue with a device (thing) 
then the technique could be used to send the message


Please enter the password

from the thing in a number of languages. The decoding software in the 
end user's computer could use the first message in the list as the 
default if the sequence sent by the thing does not have a version for 
the particular language set by the end user in his or her computer.


The list of languages supported by a particular thing would not be 
specified by a universal standard, but could perhaps have English, 
French, German and one or more others depending up the location and 
application of the thing. Any language expressible in Unicode could be 
included in the list.


Support for Unicode characters beyond plane 0 is much more obtainable in 
software these days.


I know that people have been urged to use a higher level protocol for 
indicating in  language documents, but please consider if one is wanting 
to assemble automatically a status report by combing reports from each 
of a number of mutually independent sensors on the Internet of Things, 
each of relatively small size, located in a variety of physical 
locations perhaps miles apart. In such a case the concatenation of such 
plain text sequences would be straightforward.


Such an undeprecating of U+E0001 LANGUAGE TAG would, in my opinion, 
contribute to the development of The Internet of Things.


William Overington

Monday 10 February 2020



Re: Combining Marks and Variation Selectors

2020-02-02 Thread Asmus Freytag via Unicode

  
  
On 2/2/2020 5:22 PM, Richard Wordingham
  via Unicode wrote:


  On Sun, 2 Feb 2020 16:20:07 -0800
Eric Muller via Unicode  wrote:


  
That would imply some coordination among variations sequences on
different code points, right?

E.g. <0B48> ≡ <0B47, 0B56>, so a variation sequence on 0B56 (Mn,
ccc=0) would imply the existence of a variation sequence on 0B48 with
the same variation selector, and the same effect.

  
  
That particular case oughtn't to be impossible, as in NFD everything in
sight has ccc=0.  However TUS 12.0 Section 23.4 does contain an
additional prohibition against meaningfully applying a variation
selector to a 'canonical decomposable character'. (Scare quotes because
'ly' seems to be missing from the phrase.)

Richard.

So, let's look at what that would look like with some variation
  selector

<0B48, Fxxx> ≡ <0B47, 0B56, Fxxx>


If the variant in the shape of 0B48 is well-described by a
  variation on the contribution due to 0B56 in the decomposed
  sequence then this might make sense. But if the variant would be
  better described as a variation in the 0B47 component, then it
  would be a prime example of poor "pseudo encoding": where some
  random sequence is assigned to a a shape (in this case) without
  being properly analyzable into constituent characters with their
  own identity.
Which would it be in this example?
And this example only works, of course, because with ccc=0, 0B56
  cannot be reordered.
The prohibition as worded may perhaps be slightly more broad than
  necessary, but I can understand that the UTC didn't want to parse
  it more finely in the absence of any good examples that could be
  used to better understand what the actual limitations should be.
  Better safe than sorry, and all that.

A./


  


  
On 2/2/2020 11:43 AM, Mark Davis ☕️ via Unicode wrote:
I don't think there is a technical reason for disallowing variation
selectors after any starters (ccc=000); the normalization algorithm
doesn't care about the general category of characters.

Mark

  
  




  



Re: Combining Marks and Variation Selectors

2020-02-02 Thread Richard Wordingham via Unicode
On Sun, 2 Feb 2020 16:20:07 -0800
Eric Muller via Unicode  wrote:

> That would imply some coordination among variations sequences on
> different code points, right?
> 
> E.g. <0B48> ≡ <0B47, 0B56>, so a variation sequence on 0B56 (Mn,
> ccc=0) would imply the existence of a variation sequence on 0B48 with
> the same variation selector, and the same effect.

That particular case oughtn't to be impossible, as in NFD everything in
sight has ccc=0.  However TUS 12.0 Section 23.4 does contain an
additional prohibition against meaningfully applying a variation
selector to a 'canonical decomposable character'. (Scare quotes because
'ly' seems to be missing from the phrase.)

Richard.

> On 2/2/2020 11:43 AM, Mark Davis ☕️ via Unicode wrote:
> I don't think there is a technical reason for disallowing variation
> selectors after any starters (ccc=000); the normalization algorithm
> doesn't care about the general category of characters.
> 
> Mark



Re: Combining Marks and Variation Selectors

2020-02-02 Thread Eric Muller via Unicode

  
  
That would imply some coordination
  among variations sequences on different code points, right?
  
  E.g. <0B48> ≡ <0B47, 0B56>, so a variation sequence on
  0B56 (Mn, ccc=0) would imply the existence of a variation sequence
  on 0B48 with the same variation selector, and the same effect.
  
  Eric.
  
  On 2/2/2020 11:43 AM, Mark Davis ☕️ via Unicode wrote:


  
  
I don't think there is a technical reason for
  disallowing variation selectors after any starters (ccc=000);
  the normalization algorithm doesn't care about the general
  category of characters.



  

  

  

  

  

Mark
  
  

  

  

  

  

  

  


  
  
  
On Sun, Feb 2, 2020 at 10:09
  AM Richard Wordingham via Unicode <unicode@unicode.org>
  wrote:

On
  Sun, 2 Feb 2020 07:51:56 -0800
      Ken Whistler via Unicode <unicode@unicode.org> wrote:
  
  > What it comes down to is avoidance of conundrums
  involving canonical 
  > reordering for normalization. The effect of variation
  selectors is 
  > defined in terms of an immediate adjacency. If you
  allowed variation 
  > selectors to be defined for combining marks of ccc!=0,
  then 
  > normalization of sequences could, in principle, move the
  two apart.
  > That would make implementation of the intended rendering
  much more
  > difficult.
  
  I can understand that for non-starters.  However, a lot of
  non-spacing
  combining marks are starters (i.e. ccc=0), so they would not
  be a
  problem.   is an
  unbreakable block in
  canonical equivalence-preserving changes.  Is this restriction
  therefore
  just a holdover from when canonical equivalence could be
  corrected?
  
  Richard.

  


  



Re: Combining Marks and Variation Selectors

2020-02-02 Thread Mark Davis ☕️ via Unicode
I don't think there is a technical reason for disallowing variation
selectors after any starters (ccc=000); the normalization algorithm doesn't
care about the general category of characters.

Mark


On Sun, Feb 2, 2020 at 10:09 AM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Sun, 2 Feb 2020 07:51:56 -0800
> Ken Whistler via Unicode  wrote:
>
> > What it comes down to is avoidance of conundrums involving canonical
> > reordering for normalization. The effect of variation selectors is
> > defined in terms of an immediate adjacency. If you allowed variation
> > selectors to be defined for combining marks of ccc!=0, then
> > normalization of sequences could, in principle, move the two apart.
> > That would make implementation of the intended rendering much more
> > difficult.
>
> I can understand that for non-starters.  However, a lot of non-spacing
> combining marks are starters (i.e. ccc=0), so they would not be a
> problem.   is an unbreakable block in
> canonical equivalence-preserving changes.  Is this restriction therefore
> just a holdover from when canonical equivalence could be corrected?
>
> Richard.
>


Re: Combining Marks and Variation Selectors

2020-02-02 Thread Richard Wordingham via Unicode
On Sun, 2 Feb 2020 07:51:56 -0800
Ken Whistler via Unicode  wrote:

> What it comes down to is avoidance of conundrums involving canonical 
> reordering for normalization. The effect of variation selectors is 
> defined in terms of an immediate adjacency. If you allowed variation 
> selectors to be defined for combining marks of ccc!=0, then 
> normalization of sequences could, in principle, move the two apart.
> That would make implementation of the intended rendering much more
> difficult.

I can understand that for non-starters.  However, a lot of non-spacing
combining marks are starters (i.e. ccc=0), so they would not be a
problem.   is an unbreakable block in
canonical equivalence-preserving changes.  Is this restriction therefore
just a holdover from when canonical equivalence could be corrected?

Richard.


Re: Combining Marks and Variation Selectors

2020-02-02 Thread Ken Whistler via Unicode

Richard,

What it comes down to is avoidance of conundrums involving canonical 
reordering for normalization. The effect of variation selectors is 
defined in terms of an immediate adjacency. If you allowed variation 
selectors to be defined for combining marks of ccc!=0, then 
normalization of sequences could, in principle, move the two apart. That 
would make implementation of the intended rendering much more difficult.


That is basically why the UTC, from the start, ruled out using variation 
selectors to try to make graphic distinctions between different styles 
of acute accent marks explicit, for example.


--Ken

On 2/1/2020 7:30 PM, Richard Wordingham via Unicode wrote:

Ah, I missed that change from Version 5.0, where the restriction was,
'The base character in a variation sequence is never a combining
character or a decomposable character'.  I now need to rephrase the
question.  Why are marks other than spacing marks prohibited?



Re: Combining Marks and Variation Selectors

2020-02-01 Thread Richard Wordingham via Unicode
On Sat, 1 Feb 2020 17:59:57 -0800
Roozbeh Pournader via Unicode  wrote:

> They are actually allowed on combining marks of ccc=0. We even define
> one such variation sequence for Myanmar, IIRC.
> 
> On Sat, Feb 1, 2020, 2:12 PM Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:  
> 
> > Why are variation selectors not allowed for combining marks?  I can
> > see a reason for them not being allowed on characters with non-zero
> > canonical combining classes, but not for them being prohibited for
> > combining marks that are starters, i.e. have ccc=0.

Ah, I missed that change from Version 5.0, where the restriction was,
'The base character in a variation sequence is never a combining
character or a decomposable character'.  I now need to rephrase the
question.  Why are marks other than spacing marks prohibited?

Richard. 



Re: Combining Marks and Variation Selectors

2020-02-01 Thread Roozbeh Pournader via Unicode
They are actually allowed on combining marks of ccc=0. We even define one
such variation sequence for Myanmar, IIRC.

On Sat, Feb 1, 2020, 2:12 PM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> Why are variation selectors not allowed for combining marks?  I can see
> a reason for them not being allowed on characters with non-zero
> canonical combining classes, but not for them being prohibited for
> combining marks that are starters, i.e. have ccc=0.
>
> Richard.
>


Combining Marks and Variation Selectors

2020-02-01 Thread Richard Wordingham via Unicode
Why are variation selectors not allowed for combining marks?  I can see
a reason for them not being allowed on characters with non-zero
canonical combining classes, but not for them being prohibited for
combining marks that are starters, i.e. have ccc=0.

Richard.


Re: Adding Experimental Control Characters for Tai Tham

2020-01-29 Thread Ken Whistler via Unicode

Richard,

Given that those particular two variation selectors have already given 
very specific semantics for emoji sequences, and would now be expected 
to occur *only* in emoji sequences:


https://www.unicode.org/reports/tr51/#def_text_presentation_selector

usurping them to do something unrelated would probably not be a good idea.

For experimentation purposes, VS13 and VS14 would be safer.

--Ken

On 1/25/2020 10:41 AM, Richard Wordingham via Unicode wrote:

How inappropriate would it be to usurp a pair of variation selectors
for this purpose?  For mnemonic purposes, I would suggest usurping

FE0E VARIATION SELECTOR-15 for *1A8E TAI THAM SIGN INITIAL
FE0F VARIATION SELECTOR-16 for *1A8F TAI THAM SIGN FINAL


Adding Experimental Control Characters for Tai Tham

2020-01-25 Thread Richard Wordingham via Unicode
This topic is very similar to the recent topic "How to make custom
combining diacritical marks for arabic letters?".

There is a suggestion that the encoding of Tai Tham syllables be
changed
(https://www.unicode.org/L2/L2019/19365-tai-tham-structure.pdf, by
Martin Hosken), and there is a strong desire to experiment with it.
However, unless it is to proscribe good rendering, it needs at least
two extra 'control' characters, which have been suggested as:

1A8E TAI THAM SIGN INITIAL
1A8F TAI THAM SIGN FINAL

These would follow a subscript character.  In simple cases, they
would indicate whether the subscript is part of the onset or part of
the coda of a syllable.

The idea that has been floated is that the experimentation be done by
changing the renderer, which is invoked by various applications.

However, there is the problem of script runs - these characters are not
yet in the Tai Tham script, and most applications lack a mechanism
for assigning PUA characters to a script.

However, there is a set of inherited characters which in a Tai Tham
context have not yet been assigned any meaning - the variation
selectors.  I have experimented with them, and at least in the older
versions of the HarfBuzz renderer (near Version 1.2.7), they do not
cause any problems with the implementation of the USE - no dotted
characters arise, and they can interact in shaping as suggested by a
font.

How inappropriate would it be to usurp a pair of variation selectors
for this purpose?  For mnemonic purposes, I would suggest usurping

FE0E VARIATION SELECTOR-15 for *1A8E TAI THAM SIGN INITIAL
FE0F VARIATION SELECTOR-16 for *1A8F TAI THAM SIGN FINAL

I can think of the follow relevant factors:

(a) It is a maxim of English law that a person intends the reasonable
foreseeable consequences of his actions.  By allowing grapheme cluster
boundaries between script changes, the UTC can hardly complain
loudly about inherited characters being usurped.

(b) Most subscript consonants are defined by SAKOT plus a base
consonant, and therefore the suggested control characters have the
nature of variation sequences.  The effect of these characters is,
though, mostly on how other characters are positioned relative to them,
rather than directly on the subscript characters themselves.

(c) There are 7 subscript consonants that are represented by single
characters:

U+1A55 TAI THAM CONSONANT SIGN MEDIAL RA
This seems not to need marking for position relative to the nucleus.
If it did, the marking up of logical order ᩉᩕ᩠ᩅ᩠᩶ᨿ  /huai/  'brook' as semi-visual
order 
would not be so simple, as SIGN FINAL should not apply to the leftmost
character, MEDIAL RA.

U+1A56 TAI THAM CONSONANT SIGN MEDIAL LA
This will have to be excluded from the experiment.  It is very rare as
a final consonant, and I suspect its exclusion will have no effect on
the experiment.

U+1A57 TAI THAM CONSONANT SIGN LA TANG LAI
This appears to be restricted to a single word, so its exclusion should
not matter at all.

U+1A5B TAI THAM CONSONANT SIGN HIGH RATHA OR LOW PA
Bizarrely, L2-19/365 treats this as a consonant modifier!  As the USE
does not require consonant modifiers to be applied to the base
consonant, this ought to have no adverse effects.  The combination
 frequently acts as a single consonant trespassing
on the territory of HIGH RATHA, but my suggestion that the sequence be
encoded as a precomposed character was rejected.

As far as I can tell, U+1A5B is always part of the phonetic onset.   As
the only case where one might need these control characters would be an
implausible contraction *ᩁᩢ᩠ᨭᩛᩣ /rat tʰaː/ logical order  parallel to Lao contraction
ᨣᩢ᩠ᩅᩣ /kʰan waː/ 'if' logical order  undisambiguated semi-visual order , which for Lao is rendered differently to ᨣ᩠ᩅᩢᩣ /kʰwaːk/ loɡical
order .  Now, the
disambiguated semi-visual order encoding for *ᩁᩢ᩠ᨭᩛᩣ is .  This is consistent with the USE if SIGN FINAL
is a variation selector, but is a seemingly needless flaw in L2-19/365
Section 5.1.1.

U+1A5C TAI THAM CONSONANT SIGN MA
This character seems only to occur immediately following
akshara-initial MA, so I think there are no issues.

U+1A5D TAI THAM CONSONANT SIGN BA
This sign is of very limited occurrence in Northern Thai.  In Lao, it
can occur as the subscript of a base consonant acting as a mater
lectionis, but I cannot see any scope for needing to mark the role of
the mark for proper rendering. 

U+1A5E TAI THAM CONSONANT SIGN SA
As this is a non-spacing mark principally used as a coda consonant, it
seems unlikely that we would need to mark the role at the experimental
stage.

(d) This scheme does not address the representation of the sequences
 and .  The best ideas I
have is the totally hacky sequences  and .

Richard.



Stop words for CLDR

2020-01-23 Thread Marius Spix via Unicode
I wonder if there is any interest in adding stop words to CLDR? Stop
words are ignored by natural language processing algorithms, with use
cases like search engines, word clouds and text classification.

There are already existing collections with stop words like [1] or [2]
which could be used, but I believe that Unicode CLDR would be the best
place for such lists.

Regards,

Marius Spix

[1] https://pypi.org/project/stop-words/
[2]
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip


Re: [unihan] Unihan variants information

2020-01-17 Thread jenkins via Unicode
Very impressive! Thank you for this.

> On Jan 17, 2020, at 6:03 AM, Michel Mariani via Unihan  
> wrote:
> 
> FYI, the "Unihan Variants" utility has been recently added to the open-source 
> application Unicopedia Plus .
> It provides both the linear and structured informations planned about one 
> year ago.
> I think that the graph view available in SVG format can be especially useful 
> to spot possible inconsistencies between variant properties...
> HTH,
> 
>   --Michel MARIANI
> 
> 
> 
>> I've developed an open-source, multi-platform desktop application called 
>> Unicode Plus , which is a set 
>> of utilities related to Unicode, Unihan and emoji.
>> 
>> The basic Unihan-related utilities are almost completed, and now I would 
>> like to add more useful information about the Unihan variants:
>> 
>> 1. First option: "Linear Information"
>> 
>> - A linear list of all the variants *related* to one given Unihan character 
>> would be displayed, similar to what can be found in Apple's Character Viewer 
>> (or Palette), or in the "Unihan Variant Dictionary" application.
>> 
>> - Two sources of data could be merged:
>> 
>>  1. The information provided by the "Variants table for Unicode" data 
>> file UniVariants.txt 
>>  by 
>> Prof. Kōichi Yasuoka.
>>  
>>  2. The information extracted from the relevant Unihan DB tag 
>> properties: kSemanticVariant, kSimplifiedVariant, 
>> kSpecializedSemanticVariant, kTraditionalVariant, kZVariant.
>> 
>> - Discarding self-variants, assuming that Z-variants are somehow 
>> symmetrical, and possibly merge the different types of variants tags would 
>> result into independant sets of *related* Unihan characters. Accessing the 
>> info would then simply imply testing which set a given character belongs to, 
>> and omit the character itself for display.
>> 
>> - This kind of information is most certainly user-friendly, however it lacks 
>> structural information about the relationships between the different 
>> variants.
>> 
>> 2. Second option: "Structured Information"
>> 
>> - This is probably more ambitious and challenging: ideally, the information 
>> could be displayed graphically as a diagram of characters joined by arrowed 
>> links, indicating the type of variant. It would support one-to-one, 
>> one-to-many and many-to-one relationships...
>> 
>> 
>> Any ideas, comments, suggestions are most welcome...
>> 
>> -- Michel MARIANI
> 



Re: how to make custom combining diacritical marks for arabic letters?

2020-01-15 Thread dinar qurbanov via Unicode
"What are the combining marks supposed to look like?"

as you can see in http://tmf.org.ru/arabic.html , i have tested
reversed fatkha. also i have ideas to make reversed kasra, different
reversed dhammas, and vertical variants of them all, and maybe totally
other diacritics, like caron, circumflex.

some of that ideas already available in unicode. see
https://en.wikipedia.org/wiki/Arabic_script_in_Unicode#Compact_table
line U+065x . i see there are reversed and inverted dammas, small v
and inverted small v, and others, but probably they are not enough for
me.

i have just read https://en.wikipedia.org/wiki/Arabic_diacritics and i
have seen and remembered that there are different levels of arabic
diacritics. consonant modifiers, "ijam", are more close to main
part/line , and diacritics for short vowels, haraka, are further. also
there is tashdid that is usually between them by its distance... i
would like to be able to make several more symbols to extend short
vowels.


"Are they your creation or do you have samples of usage?"

i have an idea to use arabic script for tatar language (it is turkic
language), and that is also usable for other languages, with using
harakas instead of full/long/"main line" vowel letters. this would
make writing shorter with possibility of omitting some of the
vowels...

there are only 3 short vowels in arabic language and 3 long vowels.
long vowels are written with main line like consonants, shorts with
diacritics.

i have checked in https://en.wikipedia.org/wiki/Uyghur_language and
then in https://en.wikipedia.org/wiki/Arabic_script#Special_letters ,
and as i know and as i see languages with arabic script use "whole"
letters to represent their additional vowels, for example, ۆ‎ , ې in
uyghur language, these are made with using diacritic, but the "ijam"
diacritic, consonant modifier. logically, short vowel diacritics still
can be put above or below them, though that has no usage in that
languages, and it probably works in unicode (ie probably the consonant
modifiers and the 3 short vowels do not intersect/cross, if put
together).

how many vowels i need for tatar language: аоуыи, their "thin" pairs
әөүэи, their "russian" pairs аоуы, and 2 "russian" vowels "е" and "э".
so, i need 16 diacritics to put them above or below consonant letters.

this my idea is not used anywhere, only in a short handwriting
example. it is here http://qdb.narod.ru/tattyazmagif/qaradaft07.gif .
so, yes, this is my creation, a constructed script, and it is not
developed completely, but just a sketch. so, i would like to use
private use area for that.


2020-01-14 20:02 GMT+03:00, Lorna Evans :
> What are the combining marks supposed to look like? Are they your
> creation or do you have samples of usage? It is true that you will not
> likely get combining marks to work if either they or the base character
> are PUA. Adding the complexity of RTL makes the issue worse.
>
> Lorna
>
> On 1/10/2020 12:30 PM, dinar qurbanov via Unicode wrote:
>> hello.
>>
>> you can browse to replies that are not quoted below from
>> https://unicode.org/mail-arch/unicode-ml/y2018-m05/0039.html .
>>
>> where can i write some bug reports or feature requests in order to get
>> custom diacritic marks automatically positioned at right place above
>> and below arabic letters, and also without having to put beginning /
>> middle / end forms of arabic letters manually, but using just "simple"
>> arabic letter unicode codes. and, where should i submit bug reports
>> for what, what is responsible for what.
>>
>> seems users of unicode should be able to use private use area like
>> this, to develop their own arabic and other diacritics, not only latin
>> / greek / cyrillic... though i am even not tried to make
>> latin/cyrillic/greek custom diacritics yet... i used custom latin and
>> cyrillic scripts, but i need not to develop custom diacritics, because
>> there are plenty of ready diacritics to use with them.
>>
>>
>> 2018-05-19 13:22 GMT+03:00, dinar qurbanov :
>>> this is a test i made that time: http://tmf.org.ru/arabic.html . look
>>> at second line. my custom mark is located too left on the most left
>>> "B", and is located too right on the middle (that is of middle form of
>>> B) and on the most righ "B" (that is of starter form of B). it should
>>> be located right above the below dot.
>>>
>>> - this was the problem that i could not solve.
>>>
>>> also there are problems that i could solve by using 1) rtl override
>>> mark; 2) and using start, middle, end, separate B characters instead
>>> of using simple arabic B, that would be easier. (you ca

Re: New Unicode Working Group: Message Formatting

2020-01-14 Thread Philippe Verdy via Unicode
People name are NOT transliterated freely. It's up to each person to
document his romanized name, it should not be invented by automatic
processes. And frequently the romanized name (officialized) does noit match
the original name in another script: this is very frequent for Chinese
people, as well as trademarks).
There are also common but informal names, not always official but commonly
use in the press/medias and their orthography varies across
countries/languages. If these people are "wellknown" (notably historic
personalities, or artists), they may have their page in some Wikipedia and
Wikidata.

There's no need to "translate" them, you'll use a database query to
retrieve names (including the preferred/most frequent one, the official
one). In some countries several orthographies may be used (e.g. for streets
named after people's: these names are not translatable, except if locally
the streets are multilingual: this is not a database of people names but a
geographic database for other purposes, even if these originate from people
they are still geographic names *derived* from people names).

For this you'll still use placeholders in the messages and the value of the
placeholder may be queried in the relevant database for the relevant target
language; variable forms for these names (e.g. genitives) may be found but
are not easily derived). If these are geographic names, they may be
transliterated but there are competing standards for transliterations of
toponyms, so you'll also need to tune your application to select the
romanization system relevant for the target language (the international
standards are language neutral, but not relevant for specific countries
that have their own officialized terminology, or for the Unioted Nations
that need to cite them in several official working languages), if the
geographic database does not already contain an officialized/prefered
romanization (there are also needs for transliteration from Latin to other
scripts).

Anyway proper names are to be treated specially, there's nothing that can
be used in message format API to select what will be the effective
replacement value of a placeholder. But the replacement may, or may not,
specify alternate forms for correct formatting when multiple forms are
possible (genitives, capitalisation, elisions and contextual mutations).
for the same selected name coming from an external database.

MessageFormat API and translator tools should not have to manage the
external databases, which will be "translated" separately with enough forms
relevant for their presentation and composition in larger messages.

Why this group exist now in CLDR ? most probably because there are already
difficulties to manage translations in existing CLDR data (which is focused
on a small part of what is translatable). CLDR is concerned by only a few
geographic items : countries, some subnational regions, continents, and
some cities used for timezones.

But the main problem is the proliferation of variant forms in CLDR, added
only for a few languages that need them, and no evident fallback to the
common form used in most other languages that don't need that distinction
or not the same kind of distinctions (e.g. plural forms, grammatical gender
or personal gender not always matching together, politeness/formal forms).

Once again I suggest you start contributing to a translation project and
experiment with them before continuing. Look at Wikimedia wikis
(translation templates, the translation extension, and the companion
Translatewiki.net wiki), Transifex, Google Translator, RessourceBundle and
formatting API in Java, .po/.pot for Gettext in many opensource projects,
Facebook translation tool, internationalization APIs in Windows, iOS,
MacOS, and the ICU library which is the de facto base for CLDR...


Le mar. 14 janv. 2020 à 16:11, wjgo_10...@btinternet.com via Unicode <
unicode@unicode.org> a écrit :

> The reply from Mr Verdy has indeed been helpful, as indeed has also been
> an offlist private reply from someone who has, thus far, not been a
> participant in this thread.
>
>
> Mr Verdy wrote:
>
>
> > You seem to have never seen how translation packages work and are used
> in common projects (not just CLDR, but you could find them as well in
> Wikimedia projects, or translation packages for lot of open source
> packages).
>
> What seems to be the case to Mr Verdy is in fact the actual situation.
>
> I do not satisfy the second of the two conditions of the invitation to
> join the working group. I am, in fact, retired and I have never worked in
> the i18n/l10n industry. Also, from the explanations it is not as close to
> my research interests as I had thought, and indeed hoped. I just do what I
> can on my research project from time to time using a home computer, a
> personal webspace hosted by an internet service provider, some budget
> software, m

Re: how to make custom combining diacritical marks for arabic letters?

2020-01-14 Thread Lorna Evans via Unicode
What are the combining marks supposed to look like? Are they your 
creation or do you have samples of usage? It is true that you will not 
likely get combining marks to work if either they or the base character 
are PUA. Adding the complexity of RTL makes the issue worse.


Lorna

On 1/10/2020 12:30 PM, dinar qurbanov via Unicode wrote:

hello.

you can browse to replies that are not quoted below from
https://unicode.org/mail-arch/unicode-ml/y2018-m05/0039.html .

where can i write some bug reports or feature requests in order to get
custom diacritic marks automatically positioned at right place above
and below arabic letters, and also without having to put beginning /
middle / end forms of arabic letters manually, but using just "simple"
arabic letter unicode codes. and, where should i submit bug reports
for what, what is responsible for what.

seems users of unicode should be able to use private use area like
this, to develop their own arabic and other diacritics, not only latin
/ greek / cyrillic... though i am even not tried to make
latin/cyrillic/greek custom diacritics yet... i used custom latin and
cyrillic scripts, but i need not to develop custom diacritics, because
there are plenty of ready diacritics to use with them.


2018-05-19 13:22 GMT+03:00, dinar qurbanov :

this is a test i made that time: http://tmf.org.ru/arabic.html . look
at second line. my custom mark is located too left on the most left
"B", and is located too right on the middle (that is of middle form of
B) and on the most righ "B" (that is of starter form of B). it should
be located right above the below dot.

- this was the problem that i could not solve.

also there are problems that i could solve by using 1) rtl override
mark; 2) and using start, middle, end, separate B characters instead
of using simple arabic B, that would be easier. (you can see in the
example that that characters are used). (using different forms of
letter can also be achieved by using php or javascript, etc).




2018-05-17 22:12 GMT+03:00 Richard Wordingham via Unicode
:

On Thu, 17 May 2018 09:49:55 +0300
dinar qurbanov via Unicode  wrote:


how to make custom combining diacritical marks for arabic letters?
should only font drivers and programs support it, or should also
unicode support it, for example, have special area for them?

as i know, private use area can be used to make combining diacritical
marks for latin script without problems.

but when i tried, several years ago, to make that for arabic script,
with fontforge, i had to use right to left override mark, and manually
insert beginning, middle, ending forms of arabic letters, and even
then, my custom marks were not located very properly above letters.

I'm offering suggestions, but I don't that they will work.

The one thing that may help you is that these marks cannot appear in
plain text.  There are a number of things you need to do:

1) Persuade the renderer to treat your character as being a run in a
single script.  You might be able to do this by:

a) Not having any lookups for the Arabic script.

b) Using RLM to persuade the renderer that you have a right-to-left run.

It is just possible that his may fail with OpenType fonts but work
with Graphite or AAT fonts.  If it works, you will then have to
implement all the Arabic shaping yourself.

2) If OpenType fonts will treat the data as a single script run, you
will need to ensure that there is an OpenType substitution feature that
the renderer will support.  Fortunately, many modern text applications
will allow you to force the ccmp feature to be enabled - I have used
such feature forcing with OpenType in LibreOffice and also in HTML,
which renders accordingly in all the modern browsers I have tested - MS
Edge on Windows 10, Firefox and, on iPhones, Safari.  While the ccmp
feature is enabled for the PUA in Firefox, it is disabled in MS Edge on
Windows 10.

3) I believe AAT will soon be available for products using the HarfBuzz
layout engine, so it is likely to become available on Firefox and
LibreOffice.  If AAT looks like a solution, you may need to research the
attitudes of Chrome and OpenOffice, for I believe they have chosen not
to support Graphite.

A totally different solution would be to recompile your application so
that it believes that your diacritics are in the Arabic script.

Richard.


Re: New Unicode Working Group: Message Formatting

2020-01-14 Thread Nelson H. F. Beebe via Unicode
William, this is off the Unicode list.

See

http://mathreader.livejournal.com/9239.html 

for a list of 207 variants of Chebyshev's name.

---
- Nelson H. F. BeebeTel: +1 801 581 5254  -
- University of UtahFAX: +1 801 581 4148  -
- Department of Mathematics, 110 LCBInternet e-mail: be...@math.utah.edu  -
- 155 S 1400 E RM 233   be...@acm.org  be...@computer.org -
- Salt Lake City, UT 84112-0090, USAURL: http://www.math.utah.edu/~beebe/ -
---


Re: New Unicode Working Group: Message Formatting

2020-01-14 Thread wjgo_10...@btinternet.com via Unicode


The reply from Mr Verdy has indeed been helpful, as indeed has also been 
an offlist private reply from someone who has, thus far, not been a 
participant in this thread.


Mr Verdy wrote:

You seem to have never seen how translation packages work and are used 
in common projects (not just CLDR, but you could find them as well in 
Wikimedia projects, or translation packages for lot of open source 
packages).

What seems to be the case to Mr Verdy is in fact the actual situation.


I do not satisfy the second of the two conditions of the invitation to 
join the working group. I am, in fact, retired and I have never worked 
in the i18n/l10n industry. Also, from the explanations it is not as 
close to my research interests as I had thought, and indeed hoped. I 
just do what I can on my research project from time to time using a home 
computer, a personal webspace hosted by an internet service provider, 
some budget software, mainly High-Logic FontCreator, and Serif PagePlus 
desktop publishing package, together with the software bundled with 
Windows 10. Older people are often advised to try to keep the mind 
active, so my research activity at least does that. If the research 
itself has benefits more generally in making progress in the application 
of information technology then that is an additional benefit.


One thing that of which you might like to take account and specifically 
"build-out" in computer formatting is a tendency that can occur in some 
computer systems software and also in everyday transactions also before 
computers became widespread, namely of not allowing a person to be 
recorded or listed with more that two initials before his or her 
surname, to the extent that some people even have a practice of not 
using more than two initials even when the document, such as a letter, 
or a form, before them specifically uses three or more initials. Common 
explanations are that "It's for the computer" and "Two initials is 
enough to identify someone" and "Someone could have many names". Yet the 
second is not true and the first is only because somewhere along the 
line someone has decided that that is how it to be done: the third is 
true, but the fact that that is the person's name on his or her birth 
certificate is the legal fact of the matter and so needs to be properly 
accommodated in systems recording names. Also, the United Kingdom and 
United States format of a given name, one or more additional given 
names, then a surname is not suitable for some other cultures. I 
remember some registration forms for college courses that would ask for 
surname and forenames, with a panel for each, together with a printed 
note on every such form "If your name cannot be expressed in that 
format, please write your whole name in the box labelled 'surname'".


However, with localization there are other issues. I seem to remember 
somewhere that people whose name is correctly expressed in a script 
other than Latin script often have a transliterated "Romanized form" of 
their name as well for use on travel documents. So will your format 
system include provision for this please, such as by allowing both to be 
linked together in a document please?


Another feature is that I have known people from various countries who 
have, in everyday use, chosen to be known in everyday workplace 
situations by an English first name rather than their official given 
name, while using their original surname, perhaps transliterated. So it 
would be good if the name format accounts for that too please, in a 
manner that does not give the possible impression of that use being for 
some questionable purpose. Maybe a new term such as ChosenSocialName 
could be used for that please.


An interesting facet of transliteration is that the name of a famous 
mathematician whose name was properly written using Cyrillic characters, 
was transliterated into English as Chebyshev, whereas the set of 
polynomials named after him are each designated by including the letter 
T. The transliteration of the name of the mathematician into German 
starts with a T rather than the C used in English. There was a short 
thread that explored within it this topic in this mailing list around 
the year 2000, not necessarily in the year 2000 itself, but I have not 
been able to locate it.


William Overington

Tuesday 14 January 2020



Re: Geological symbols

2020-01-14 Thread Hans Åberg via Unicode
For rendering, you might have a look at ConTeXt, because I recall it has an 
option whereby Unicode super- and sub-scripts can be displayed over each other 
without extra processing.


> On 14 Jan 2020, at 06:44, via Unicode  wrote:
> 
> Thanks for your reply. I think actually LaTeX is not a good option for our 
> purpose, because we want to create and disseminate datasets which are easy to 
> use and do not require any software or special font installation. Thus, we’ll 
> live with the little bit uglier version.
> Anyway, thanks!
> Thomas
>  




AW: Geological symbols

2020-01-13 Thread via Unicode
Thanks for your reply. I think actually LaTeX is not a good option for our 
purpose, because we want to create and disseminate datasets which are easy to 
use and do not require any software or special font installation. Thus, we’ll 
live with the little bit uglier version.
Anyway, thanks!
Thomas
 
Von: "Jörg Knappen"  
Gesendet: Dienstag, 14. Januar 2020 00:11
An: tho...@monmap.mn
Cc: unicode@unicode.org
Betreff: Aw: Geological symbols
 
Hallo Thomas,
 
Unicode delegates this (combined superscripts and subscripts) to higher level 
markup languages or Rich Text Editors.
 
I don't know how widespread the use of LateX is among geologists, but notation 
like this is a perfect use case for LaTeX.
 
--Jörg Knappen
  
  
Gesendet: Montag, 13. Januar 2020 um 12:20 Uhr
Von: "Thomas Spehs (MonMap) via Unicode" mailto:unicode@unicode.org> >
An: unicode@unicode.org <mailto:unicode@unicode.org> 
Betreff: Geological symbols
Hi, I would like to ask if there is any way to create geological “symbols” with 
Unicode such as: Q₁¹ˉ², but with the two “1”s over each other, without a space. 
Thanks!


Re: New Unicode Working Group: Message Formatting

2020-01-13 Thread Steven R. Loomis via Unicode


> El ene. 11, 2020, a las 11:37 a. m., wjgo_10...@btinternet.com via Unicode 
>  escribió:
> 
> A person in England, …

As noted in the blog, the scope of this working group is a syntax for "adapting 
programs”. It is not intended for individual communication between two persons.

> Where does the translation of the text take place please, and by whom or by 
> which computer?

The question of when and how the message translation takes place is also out of 
scope for the Working Group. Mr. Verdy has given a great summary introduction 
to the process in a separate reply.


--
Steven R. Loomis | @srl295 | git.io/srl295


Re: Geological symbols

2020-01-13 Thread Philippe Verdy via Unicode
It is possible with some other markup languages, including HTML by using
ruby notation and other interlinear notations for creating special vertical
layouts inside an horizontal line.

There are difficulties however caused by line wraps which may occur before
the vertical layout, or even inside it for each stacked item, and for
managing the lineheight for the whole line. Finally you could endup with
the same problems as those found in mathematical formulas... and for
composing Egyptian hieroglyphs of Visiblespeech, for which a markup
language has to be defined (with a convention, similar to an orthographic
or typographic convention) in addition to the core characters that are used
to build up the composition, and possibly some extra styling (to adjust the
size of individual items, or to align them properly in the stack and fit
them cleanly in the composition area (e.g. an ideographic square). Final
difficulties are added by bidirectionality

Not all texts are purely linear (unidimensional) and a linear
representation is difficult to interpret without adding the markup syntax
inside the source text and sometimes aven adding extra symbols (or
punctuation) in the linear composition, which would not be needed in a true
bidimensional layout. Unicode does not encode characters for the second
dimension and the layout, so it's up to markup languages (or orthographic
conventions) to define the extra semantics and/or layout. A font alone
cannot guess without these conventions, and even if these conventions are
used, assumptions made could infer sometimes the incorrect layout.




Le lun. 13 janv. 2020 à 17:16, Oren Watson via Unicode 
a écrit :

> This is not possible in unicode plaintext as far as I can tell, since
> Unicode doesn't allow overstriking arbitrary characters over each other the
> way more advanced layout systems, e.g. LaTeX do. It is however possible to
> engineer a font to arrange those characters like that by using aggressive
> kerning.
>
>
> On Mon, Jan 13, 2020 at 10:14 AM Thomas Spehs (MonMap) via Unicode <
> unicode@unicode.org> wrote:
>
>> Hi, I would like to ask if there is any way to create geological
>> “symbols” with Unicode such as: Q₁¹ˉ², but with the two “1”s over each
>> other, without a space. Thanks!
>>
>


Re: New Unicode Working Group: Message Formatting

2020-01-13 Thread wjgo_10...@btinternet.com via Unicode

I notice that in the web page

https://github.com/unicode-org/message-format-wg/issues/3

there is a request to add more features.

One of those requested features is as follows


Inflections (genders, articles, delensions, etc.)


So I am wondering quite what formats will be covered by the project and 
how those formats can be applied, in various contexts, not necessarily 
only those initially considered.


William Overington

Monday 13 January 2020



Re: Geological symbols

2020-01-13 Thread Oren Watson via Unicode
This is not possible in unicode plaintext as far as I can tell, since
Unicode doesn't allow overstriking arbitrary characters over each other the
way more advanced layout systems, e.g. LaTeX do. It is however possible to
engineer a font to arrange those characters like that by using aggressive
kerning.


On Mon, Jan 13, 2020 at 10:14 AM Thomas Spehs (MonMap) via Unicode <
unicode@unicode.org> wrote:

> Hi, I would like to ask if there is any way to create geological “symbols”
> with Unicode such as: Q₁¹ˉ², but with the two “1”s over each other,
> without a space. Thanks!
>


Re: New Unicode Working Group: Message Formatting

2020-01-11 Thread Philippe Verdy via Unicode
You seem to have never seen how translation packages work and are used in
common projects (not just CLDR, but you could find them as well in
Wikimedia projects, or translation packages for lot of open source
packages).
The purpose is to allow translating the UI of these applications for user's
demanded language. Internally the application can use whatever
representation it needs : it may be in any language or could be just an
identifier, here this does not matter as they are independant of the final
translation rendered. In CLDR, identifiers are used (more or less based on
simplified English, sometimes abbreviations or conventional codes). In
typical .po(t) packages the identifiers are the source language from which
the software was built and its strings extracted, and to be replaced by
calling an API.
Various projects do not always use English as the source of their
translation and even if this is the source, the strings themselves are not
always the unique identifiers used.

If you send you package and need to print it, of course you'll print the
label in a chosern language. Nothing forbifs the print to display both
languages, i.e. two copies of the message translated in two languages
(English or German in your example; just look at printed noticed you find
in your purchase packages: the booklets frequently include multiple copies,
one per language, often a dozen for products imported from China to Europe;
even food is frequently labeled in several languages for international
brands).

If needed, products descriptions or source and delivery addresses will be
accessible via an online web app by printing a barcode or QRcode on the
label (they will be converted to an URI): an URI by itself has no language,
it's also an identifier, allowing to retrive the texts in multiple
languages or the language of user's choice.

So your question is non-sense with the example you give.

Le sam. 11 janv. 2020 à 21:21, wjgo_10...@btinternet.com via Unicode <
unicode@unicode.org> a écrit :

> A person in England, who knows no German, wants to send the parcel to a
> person in Germany, who knows no English.
>
> The person in England wants to send a message about the delivery to the
> person in Germany..
>
> > English: “The package will arrive at {time} on {date}.”
>
> The person want to send the message by email.
>
> > German: “Das Paket wird am {date} um {time} geliefert.”
>
> Where does the translation of the text take place please, and by whom or
> by which computer?
>
> During the actual  transmission from the computer in England to the
> computer in Germany, is the text of the string in English, or German, or
> in a language-independent form please?
>
> 
>
> If the parcel were being sent from France to Germany by a person who
> knows only French, during the transmission of the message about the
> parcel, is the text of the string in French, or English, or German, or
> in a language-independent form please?
>
> William Overington
>
> Saturday 11 January 2020
>
>


Re: New Unicode Working Group: Message Formatting

2020-01-11 Thread wjgo_10...@btinternet.com via Unicode
A person in England, who knows no German, wants to send the parcel to a 
person in Germany, who knows no English.


The person in England wants to send a message about the delivery to the 
person in Germany..



English: “The package will arrive at {time} on {date}.”


The person want to send the message by email.


German: “Das Paket wird am {date} um {time} geliefert.”


Where does the translation of the text take place please, and by whom or 
by which computer?


During the actual  transmission from the computer in England to the 
computer in Germany, is the text of the string in English, or German, or 
in a language-independent form please?




If the parcel were being sent from France to Germany by a person who 
knows only French, during the transmission of the message about the 
parcel, is the text of the string in French, or English, or German, or 
in a language-independent form please?


William Overington

Saturday 11 January 2020



Re: New Unicode Working Group: Message Formatting

2020-01-10 Thread James Kass via Unicode
Yes, thank you, that answers the question.  Format rather than 
repertoire.  Please note, though, that the example given of a 
localizable message string is also an example of a localized sentence.


On 2020-01-10 11:17 PM, Steven R. Loomis wrote:

James,

A localizable message string is one similar to those given in the example:
English: “The package will arrive at {time} on {date}.”
German: “Das Paket wird am {date} um {time} geliefert.”

The message string may contain any number of complete sentences, including zero 
( “Arrival: {time}” ).

The Message Format Working Group is to define the *format* of the strings, not 
their *repertoire*. That is, should the string be “Arrival: %s” or “Arrival: 
${date}” or “Arrival: {0}”?


Does that answer your question?

--
Steven R. Loomis | @srl295 | git.io/srl295




El ene. 10, 2020, a las 2:48 p. m., James Kass via Unicode 
 escribió:


On 2020-01-10 9:55 PM, announceme...@unicode.org wrote:

But until now we have not had a syntax for localizable message strings 
standardized by Unicode.

What is the difference between “localizable message strings” and “localized 
sentences”?  Asking for a friend.





  1   2   3   4   5   6   7   8   9   10   >