Is there an IBM group mark symbol?

2015-01-30 Thread Ken Shirriff
I'm writing about the IBM 1401 and there's one character from its character
set that I couldn't find in Unicode: the group mark. The group mark is
three horizontal lines with a vertical line through it (see attached
image). This character is used in various books and publications, so it's a
real symbol that is used in text. Would it make sense for me to submit a
proposal to add this character?

Group mark image (from
https://en.wikipedia.org/wiki/IBM_1401#Character_and_op_codes):


Thank you,
Ken
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Is there an IBM group mark symbol?

2015-01-30 Thread Roozbeh Pournader
There may be something like it in the math symbols sets, but if there's
not, please feel free to submit a proposal.
On Jan 30, 2015 8:59 AM, Ken Shirriff ken.shirr...@gmail.com wrote:

 I'm writing about the IBM 1401 and there's one character from its
 character set that I couldn't find in Unicode: the group mark. The group
 mark is three horizontal lines with a vertical line through it (see
 attached image). This character is used in various books and publications,
 so it's a real symbol that is used in text. Would it make sense for me to
 submit a proposal to add this character?

 Group mark image (from
 https://en.wikipedia.org/wiki/IBM_1401#Character_and_op_codes):


 Thank you,
 Ken

 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: UAX 29 questions

2015-01-30 Thread Philippe Verdy
2015-01-30 9:32 GMT+01:00 Mark Davis ☕️ m...@macchiato.com:

 2. Also, the following 2 rules are not equivalent:

 a) Any  × (Format | Extend)
 b) X (Extend | Format)* → X


That's what I replied in the first message but using an as if which was
not clear enough, my seconde reply reformulated it by making clear about
the right side (the substitution iccuring n the next rules; that you view
as a shortcut).

Your first argument about convolution is not very justified between WB56
and WB57 that are also clear when rewritten by separating ALetter and
HebrewLetter.

But I also note this case for Hebrew's handling of apostrophes/quotes also
exists in the Latin script (including in English only) for the context of
word-breaking only (this does not apply to linebreaking and syllable
breaking for hyphenation, which are other types of breakers).

The rule about Format and Extend is still kept separate in WB56 and listed
first only because it correctly preserves the canonical equivalences for
extenders, which include all combining characters with non-zero combining
class; and which also include the gold rule for not breaking in the middle
of default grapheme clusters (which also includes joiners like CGJ and ZWJ
with any breaker algorithms, except code point breakers for some conforming
UTF's like UTF-16).

WB57 is evidently subject to tailorings. It just provides a default
behavior where the single quote/apostrophe is handled as an elision mark
most often used at end of words, and glued with the next word without space
separation.

WB57 It also handles the case where it is also followed by some spaces or
other punctuations and the single quote is then not an orthographic elision
mark but a punctuation marking an end of quotation.

One problem is the SingleQuote class used in WB57 is possibly too large :
it acts as an elision mark (apostrophe) only for a smaller number of
single-quote-like characters.

The other problem of WB57 is that it assumes that elision marked by
apostrophes occurs only at end of words (not true even for English) and
this is where per-language tailoring is not only possible but most probably
recommended.

Such tailoring should will affect the behavor of WB56 (notably in English,
French, Italian... where the apostrophe is lexicalized and its usage
regulated by their standard grammar).



But I wonder if tailoring of WB57 is not also needed for Hebrew. I see WB57
only as a initial default tailoring for the script itself, not for the
actual language (which may also be Yiddish). And could also include usual
transcriptions of foreign words, or of common but informal
abbreviations/contractions too (the apostrophe is highly prefered to the
dot for abbreviating/contracting in the middle of a word and notably when
the abbreviated part is not even pronounced but completely elided.

It seems ajso that Swedish may also use the colon in the middle of a word,
without space separations, instead of an apostrophe.

Other languages may prefer other signs for elisions (including an hyphen;
which does not break words but only syllables for candidate breaking of
long lines), notably if there are confusions with quote-like letters

Another common notation (found in French typography) uses superscripts for
the final letters when elision occurs in the middle of a word, but this is
in fact just a written abbreviation (this totaly replaces the use of the
abbreviation dot; normally never used in the middle and completely
eliminated in acronyms): this is not really an elision the abbreviated word
with superscript is sctill fullly read without the elision; so the
apostrophe cannot be used.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Is there an IBM group mark symbol?

2015-01-30 Thread Frédéric Grosshans

Le 30/01/2015 17:55, Ken Shirriff a écrit :
I'm writing about the IBM 1401 and there's one character from its 
character set that I couldn't find in Unicode: the group mark. The 
group mark is three horizontal lines with a vertical line through it 
(see attached image). This character is used in various books and 
publications, so it's a real symbol that is used in text. Would it 
make sense for me to submit a proposal to add this character?


In may 2007, Ken Whistler answered a slightly more general question on 
old IBM characters :


http://unicode.org/mail-arch/unicode-ml/y2007-m05/0373.html

The group mark was the more problematic and his answer was :

 * You can see it as a glyph variant of ␝ U+241D SYMBOL FOR GROUP SEPARATOR
 * You can have a symbol of the same appearance by combining ≡⃒ U+2261
   IDENTICAL TO, U+20D2 COMBINING LONG VERTICAL LINE OVERLAY.

However, none of the solution would seem to be really practical, and I 
didn’t find any corresponding symbol (including the variants in 
http://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt ). A 
proposal might help add it to the standard.


Frédéric



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Is there an IBM group mark symbol?

2015-01-30 Thread Jean-François Colson


Le 30/01/15 18:30, Jean-François Colson a écrit :
Le 30/01/15 17:55, Ken Shirriff a écrit :
I'm writing about the IBM 1401 and there's one character from its 
character set that I couldn't find in Unicode: the group mark. The 
group mark is three horizontal lines with a vertical line through it 
(see attached image). This character is used in various books and 
publications, so it's a real symbol that is used in text. Would it 
make sense for me to submit a proposal to add this character?


Why not?
In the meantime, you could approximate it with U+2261 IDENTICAL TO 
U+20D2 COMBINING LONG VERTICAL LINE OVERLAY: ≡⃒

Here is what that looks like in FreeMono: http://colson.eu/≡⃒.png
and in DejaVu Sans Mono: http://colson.eu/≡⃒..png



Group mark image (from 
https://en.wikipedia.org/wiki/IBM_1401#Character_and_op_codes):



Thank you,
Ken


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode




___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: UAX 29 questions

2015-01-30 Thread Mark Davis ☕️
I apology in advance that I'm running low on time, and didn't go through
all the messages on this thread carefully. So I may not be fully
appreciating people's positions. I'm just making some quick points about 2
items that caught my eye.


1. There are certainly times where two rules in sequence may overlap, just
for simplicity.

X Y* x Z
Y x Z* W

The first rule could trigger on X Y Z W, even though the second would also
trigger on it. This may or may not be sloppiness; sometimes it simply
makes the second rule too convoluted to also exclude triggering on
everything that could possibly trigger earlier.

That being said, if there simplifications in the rules that would make it
clearer, I'd suggest submitting a proposal for that. The UTC is meeting
next week, and could consider it either then or at subsequent meetings.

Note: the HTML files in http://unicode.org/Public/UNIDATA/auxiliary/ have a
number of sample cases (which are also used in the test files). Hovering
over boundaries in those sample cases shows which rule is triggered, such
as in
http://unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakTest.html#samples

We're always open to additional samples that are illustrative of how the
rules work. As I thought about your message, it became clear to me that it
would be useful to have a complete enough set of sample cases that each
rule is triggered by at least one case, if you or anyone else is interested
in helping to add those.


2. Also, the following 2 rules are not equivalent:

a) Any  × (Format | Extend)
b) X (Extend | Format)* → X

(b) implies (a), but not the reverse. The difference is on the right side
of characters. Rule b, affects every subsequent rule, and can be viewed as
a shorthand. After it, we can just say:

A B × C D

And that has the effect of saying:

A (Extend | Format)* B (Extend | Format)* × C (Extend | Format)* D

See also http://unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules

However, it may not be clear that (b) implies (a); that might be what you
are getting at. If so, then we could add an explicit statement to that
effect.



Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Thu, Jan 29, 2015 at 7:52 PM, Karl Williamson pub...@khwilliamson.com
wrote:

 On 01/25/2015 05:14 AM, Philippe Verdy wrote:

 This is not a contradiction.


 At the very least it is too sloppy for a standard.  Once there is a match
 in the list of rules, later rules shouldn't have to be looked at.  I'll
 submit a formal feedback form.

 But there is another issue as well.  I do not see how the specified rules
 when applied to the sequence of code points:

 U+0041 U+200D U+0020

 cause the ZWJ, an Extend, to not break with the A, an ALetter.

 Rule WB4 is

 Ignore Format and Extend characters, except when they appear at the
 beginning of a region of text..

 Not clearly stated, but it appears to me that the ZWJ must be considered
 here to be the beginning of a region of text, as we are looking at the
 boundary between it and the A.  No rule specifically mentions ALetter
 followed by an Extend, so by the default rule, WB14

 Otherwise, break everywhere (including around ideographs)

 this should be a word break position.  But that is absurd, as the Extend
 is supposed to extend what precedes it.  If I add a rule

 Don't break before Extend or Format
 × (Extend | Format)

 my implementation passes all tests.  I added this rule before WB4.



 combine the two rules and they are equivalent to these two alternate
 rules:
 WB56 can be read as these two:

   (WB56a) ALetter  ×  (MidLetter | MidNumLet | Single_Quote) (ALetter |
 Hebrew_Letter)

   (WB56b) Hebrew_Letter  ×  (MidLetter | MidNumLet | Single_Quote)
 (ALetter | Hebrew_Letter)


 Then add :

(WB57) Hebrew_Letter ×  Single_Quote

 it just removes the condition of a letter following the quote  in WB56b.
 So that WB56b and WB57 can be read as equivalent to these two:

   (WB56c) Hebrew_Letter  ×  (MidLetter | MidNumLet) (ALetter |
 Hebrew_Letter)

   (WB57) Hebrew_Letter × Single_Quote

 But you cannot merge any of these two last rules in a single rule for
 WB56.


 2015-01-25 7:26 GMT+01:00 Karl Williamson pub...@khwilliamson.com
 mailto:pub...@khwilliamson.com:

 I vaguely recall asking something like this before, but if so, I
 didn't save the answers, and a search of the archives didn't turn up
 anything.

 Some of the rules in UAX #29 don't make sense to me.

 For example, rule WB7a
Hebrew_Letter ×   Single_Quote

 seems to say that a Hebrew_Letter followed by a Single Quote
 shouldn't break.  (And Rule WB4 says that actually there can be
 Extend and Format characters between the two and those should be
 ignored).

 But the earlier rule, WB6

   (ALetter | Hebrew_Letter)  ×   (MidLetter | MidNumLet |
 Single_Quote) (ALetter | Hebrew_Letter)

 seems to me to say (among other things) that a Hebrew Letter