date:20130915

Re: Origin of Ellipsis (was: RE: Empty set)

2013-09-15 Thread Stephan Stiller


Doug wrote me:

You're not confusing "code point" with "code unit," are you?

Thanks for the note.

I think what you say is that I thought (or meant to write) "by first 
representing the sequence of scalar values in an encoding form and then 
counting [code points typecast from] code _units_". I think you are 
right, but there are some points of confusion, see below. Somehow I 
thought of "surrogate pair" as "pair of (surrogate) code points" instead 
of "pair of (surrogate) code units". I guess that additional level of 
indirection would make my interpretation (b) unlikely ... I think my 
statement is still technically correct because counting code points for 
UTF-16 and code units for UTF-16 leads to the same count.


What's confusing is a term like "high-surrogate code point" (see 
glossary). If surrogate code points are not encoded, then they 
practically don't exist in the ontology of Unicode terms, aside from 
being holes in the scalar value range, if thought of as a subrange of 
the integers.


In detail: The glossary defines "surrogate code point" as: "A Unicode 
code point in the range U+D800..U+DFFF. Reserved _for use_ by UTF-16, 
where _a pair of surrogate code units_ (a high surrogate followed by a 
low surrogate) “stand in” for a supplementary code point." This 
definition doesn't say much; it says they code _points_ are "for _use_ 
by UTF-16", but then UTF-16 uses surrogate code units, not surrogate 
code points. C1 in TUS §3.2 says: "The high-surrogate and low-surrogate 
code _points_ _are designated for_ surrogate code _units_ in the UTF-16 
character encoding form." But the actual definitions used for UTF-16 
don't seem to conceptually _derive_ "surrogate code unit" from 
"surrogate code point". => ??


Still, I don't understand why people keep talking about code points. For 
me conceptually (albeit not historically) everything starts with scalar 
values (which are index values for certain abstract things). Scalar 
values are then encoded by encoding forms (and then serialized in 
encoding schemes). Why does everyone talk about the more generic "code 
point" instead of "scalar value", when non-scalar-value code points 
aren't used? (Because we're not using surrogate code point pairs, we're 
instead using surrogate code unit pairs.) Anyways, I understand that 
KenW and Mark Davis have pointed to earlier debates on this in an 
earlier thread.


Stephan

Re: Origin of Ellipsis (was: RE: Empty set)

2013-09-15 Thread Stephan Stiller


Stephan Stiller wrote:

From the link it isn't entirely clear whether they
(a) count scalar values of NFC or
(b) count code points of NFC.

Are they not the same thing, except for surrogates?
Conceptually no, but numerically yes – you are right in that regard, and 
I wasn't precise in my description of (b). I suppose if you read their 
description literally (they say they use UTF-8 internally), it follows 
that they're forbidding surrogates, because these are invalid in UTF-8. 
(Is this what they're doing? I guess the answer wouldn't matter for 
someone who only produces Tweets properly composed of a sequence of 
scalar values.)


Then, when they write that "Twitter also counts the number of codepoints 
in the text rather than UTF-8 bytes", it makes me wonder whether they're 
maybe handling the data in UTF-16 in the relevant procedure that checks 
for length. The elementary unit of abstract "text" is for me the scalar 
value. When they write "code point", that means they've just implicitly 
typecast from "scalar value" to "code point", and the question is how 
the typecasting was performed: by directly interpreting the scalar 
values as numbers of type "code point" or by first representing the 
sequence of scalar values in an encoding form and then counting code 
points? My assumption would naturally be the former, which would also be 
consistent with vulgar :-) (popular) use of these terms – but I had to 
read Twitter's description a couple of times to make sense of it.


Stephan

Re: Origin of Ellipsis (was: RE: Empty set)

2013-09-15 Thread Ilya Zakharevich

On Sun, Sep 15, 2013 at 09:21:47PM +0200, Philippe Verdy wrote:
> If there's something to do now (given it is no longer used in CJK
> contexts), it's to strongly recommand that fonts map them to exactly the
> same glyph as the one obtained by aligning three periods in a raw without
> any additional space or kerning.

… unless preceded or followed by a period… .

AND: your “with no additional kerning” should better be read as
“exactly the same kerning as the kerning for the sequence of dots — 
which must be tuned up to follow the typography tradition of the
script/language”.

Ilya

Re: Origin of Ellipsis (was: RE: Empty set)

2013-09-15 Thread Doug Ewell


Addison Phillips wrote:


Not if the limit is counted in characters and not in bytes. Twitter,
for example, counts code points in the NFC representation of a tweet.


You're right. I take that back, about Twitter at least.

Stephan Stiller wrote:


From the link it isn't entirely clear whether they
(a) count scalar values of NFC or
(b) count code points of NFC.


Are they not the same thing, except for surrogates?

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell

Re: Origin of Ellipsis (was: RE: Empty set)

2013-09-15 Thread CE Whitehead



From: Doug Ewell 


Date: Sun, 15 Sep 2013 14:04:05 -0600



> Andre Schappo wrote:



>> U+2026 is useful for microblogs when one is looking to save characters



> Not if the microblog is in UTF-8, as almost all are.



Why not just type: 
. . . 
(I suppose this fails too as now the ellipsis can break at line breaks).
(In html code it works of course: 

 .  .

Note that the pre tags are just to prevent the nbsp s from getting converted to 
spaces.)

Best,

--C. E. Whitehead
cewcat...@hotmail.com
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell

Re: Origin of Ellipsis (was: RE: Empty set)

2013-09-15 Thread Phillips, Addison

Actually, that's my bad: I meant to type scalar value.

Stephan Stiller  wrote:

On 9/15/2013 3:07 PM, Phillips, Addison wrote:
Not if the limit is counted in characters and not in bytes. Twitter, for 
example, counts code points in the NFC representation of a tweet.
"character", "code point" – these are confusing words :-)

From the link it isn't entirely clear whether they
(a) count scalar values of NFC or
(b) count code points of NFC.

That's why I think it's bad to write "code point" when "scalar value" is 
intended.

Stephan

Re: Origin of Ellipsis (was: RE: Empty set)

2013-09-15 Thread Stephan Stiller


On 9/15/2013 3:07 PM, Phillips, Addison wrote:
Not if the limit is counted in characters and not in bytes. Twitter, 
for example, counts code points in the NFC representation of a tweet.

"character", "code point" – these are confusing words :-)

From the link it isn't entirely clear whether they
(a) count scalar values of NFC /or/
(b) count code points of NFC.

That's why I think it's bad to write "code point" when "scalar value" is 
intended.


Stephan

Re: Origin of Ellipsis (was: RE: Empty set)

2013-09-15 Thread Phillips, Addison

Not if the limit is counted in characters and not in bytes. Twitter, for 
example, counts code points in the NFC representation of a tweet.

Doug Ewell  wrote:

Andre Schappo wrote:

> U+2026 is useful for microblogs when one is looking to save characters

Not if the microblog is in UTF-8, as almost all are.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell

Re: Origin of Ellipsis

2013-09-15 Thread Stephan Stiller


On 9/15/2013 1:04 PM, Doug Ewell wrote:

André Schappo wrote:

U+2026 is useful for microblogs when one is looking to save characters

Not if the microblog is in UTF-8, as almost all are.


That's an astute observation, but André was talking about input limits
https://dev.twitter.com/docs/counting-characters ,
not backend/database space.

Stephan

Re: Origin of Ellipsis (was: RE: Empty set)

2013-09-15 Thread Doug Ewell


Andre Schappo wrote:


U+2026 is useful for microblogs when one is looking to save characters


Not if the microblog is in UTF-8, as almost all are.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell

Re: Origin of Ellipsis (was: RE: Empty set)

2013-09-15 Thread Philippe Verdy

Do you mean saving two characters for posting to Tweeter ? Well may be, but
Tweeter clearly does not promote correct typography and not even correct
orthography. It is clearly not a good model for publishing.

But given the history of this character, I just wonder why it was not
mapped along with East-Asian compatibility punctuations where it should
have always been. And many fonts have ignored this history and the intent
for compatibility with legacy CJK codepages. So not only they used
incorrect metrics for use with other scripts, but they also did not honor
the metrics of these CJK scripts. This is now a character which we should
not use at all as it does not even work as intended in any context (except
for those similar to tweets).

If there's something to do now (given it is no longer used in CJK
contexts), it's to strongly recommand that fonts map them to exactly the
same glyph as the one obtained by aligning three periods in a raw without
any additional space or kerning. And may be demand that renderers ignore
these font mappings and systematically replace it with three separate
periods so that they can properly apply correct justifications and glyph
metrics, with at least two branches depending on the previous glyph (CJK or
not, and possibly: if CJK, half-width or fullwidth, otherwise look at font
metrics of the previous glyph to see if it's monospaced or not and if not,
replacing by using 3 standard periods).

Those users that will want more spacing between dots of an ellipsis should
have to use explicit spacing in their encoded texts. And those that want
less spacing should use ligature control such as ZWJ between standard
periods as well Clearly this character must be clearly deprecated for all
uses except CK contexts, and should probably be even dropped from mappings
in most fonts (except CJK or monospaced fonts).

2013/9/15 Andre Schappo 

>
>  On 13 Sep 2013, at 20:02, Whistler, Ken wrote:
>
>
> The *interesting* question, in my opinion, is why folks feel impelled to
> use
> U+2026 to render a baseline ellipsis in Latin typography at all, rather
> than
> just using U+002E ad libitum...
>
> --Ken
>
>
>  U+2026 is useful for microblogs when one is looking to save characters
>
>  André
>
>
>

Re: Origin of Ellipsis (was: RE: Empty set)

2013-09-15 Thread Andre Schappo


On 13 Sep 2013, at 20:02, Whistler, Ken wrote:

The *interesting* question, in my opinion, is why folks feel impelled to use
U+2026 to render a baseline ellipsis in Latin typography at all, rather than
just using U+002E ad libitum...

--Ken

U+2026 is useful for microblogs when one is looking to save characters

André

Re: Origin of Ellipsis and double spacing after a sentence.

2013-09-15 Thread Stephan Stiller


On 9/14/2013 6:24 AM, Michael Everson wrote:

It facilitates comment by those who are reviewing the text.
If you add proofreaders' marks to an especially difficult manuscript, 
maybe. I've barely seen annotated papers with comments that would not 
have fit into the margins, and there's still the back (oh no! in that 
case you'll need to remember to hand-photocopy such a page, if you need 
to photocopy the annotations and corrections for some reason). In the 
majority of cases they would have fit comfortably. For the small number 
of cases where they wouldn't, everyone keep in mind that "space for 
comments" isn't the only factor: being able to go back and forth easily 
to refer to and remind oneself of other portions of the text can get a 
nuisance if what feels like a short paper is printed on too large a pile 
of pages.


On 9/14/2013 11:11 AM, Jim Allan wrote:
See http://www.heracliteanriver.com/?p=324 which claims with numerous 
examples that Michael Everson is totally wrong.
I have laid out my opinions (of varying strength) about typographic 
matters, but calling someone "totally wrong" to me demonstrates more 
emotion than there should be; the linked-to article is brilliant, but 
its use of the word "lie" (too easily understood as ascribing malicious 
intent, as opposed to the mindless propagation of false information) 
distracts from its excellent factual information and the good intuition 
and opinions of the author. And I'm not sure about those "couple dozen 
different types of spaces" that "Unicode implements" according to the 
article (I thought there's just about two dozen).


On 9/14/2013 11:44 AM, Michael Everson wrote:

It's what I was taught.
Probably my favorite non-argument, and even as an excuse it's still 
ultra-lame.


On 9/14/2013 12:04 PM, Asmus Freytag wrote:
But reviewing hardcopy is on its way out, so even this issue will 
disappear...
Except now we need to wait for it to dissipate from university thesis 
requirements. I can't resist pointing the list to what Peter Wilson 
wrote in the manual to his "memoir" document class for LaTeX. I see its 
latest version here

http://www.tex.ac.uk/ctan/macros/latex/contrib/memoir/memman.pdf .
My experience resonates with his comments at the beginning of sec 3.3.2 
("Double spacing") and the chapter frontmatter and section 21.4 
("Comments") within his ch 21.


On 9/14/2013 12:19 PM, Michael Everson wrote:

And as a book designer and publisher, I think that having large spaces after a 
full stop is both unnecessary and vulgar.

On 9/14/2013 11:18 PM, Michael Everson wrote:

This does not change my view. Unnecessary and vulgar.
Maybe – maybe not. What is "vulgar" is intended to convey? Where is the 
rationale for either view? The blog article has excellent reasoning, for 
example.


On 9/14/2013 1:09 PM, Philippe Verdy wrote:

the formation of infamous vertical "rivers" across lines of text
Obviously larger inter-sentence spacing gives the reader more hints at 
the text's discourse structure except where a sentence ends at the end 
of a line. It seems hard to believe that the supposedly "ugly" or 
"vulgar" look of holes or typographic rivers distracts enough to 
negatively outweigh double sentence spacing. (So I disagree with the 
article's implications here.) Can anyone /prove/ to me that rivers 
actually matter unless you're bored or tired enough to seek meaning in 
pattern search on a randomly typeset page? In any case, I think it's 
important to keep oneself lucid and unemotional about what's presently 
done and then make decisions.


On 9/14/2013 1:09 PM, Philippe Verdy wrote:
These questions are not just about "esthetic", but about preserving 
the average blackness of lines to guide the eye for easier and faster 
reading, and to make sure that important punctuation will be easily 
distinguished (because they guide the "rythm" with which the text 
should be clearly read by speech (imagine you're reading the text to a 
public with clear voice, for better understanding: this is not an 
evident practice, good readers are rare that can translate to their 
auditory the substance of the text with emotion and strength as it 
could have been intended by the author, better exhibiting his choice 
of words).
With respect to your wide knowledge, we're entering the world of 
speculation here. People who know about the typographic variation seen 
across the world's languages and typographic cultures (locales) should 
know that a lot of factors matter for the legibility of a text.


On 9/14/2013 6:37 PM, Asmus Freytag wrote:

On 9/14/2013 1:24 PM, Philippe Verdy wrote:
Lots of paper hardcopies are used everyday in every organisations, 
and notably in those working on legal texts.
Lawyers also think that WRITING IN ALL UPPERCASE SOMEHOW MAKES PEOPLE 
BE ABLE TO READ THINGS BETTER. Dunno, I'd stick with typographers and 
book designers...
Lawyers also waste plenty of paper with the multiplication of documents 
whose precise wording tends to matter onl

Re: Origin of Ellipsis (was: RE: Empty set)

Re: Origin of Ellipsis (was: RE: Empty set)

Re: Origin of Ellipsis (was: RE: Empty set)

Re: Origin of Ellipsis (was: RE: Empty set)

Re: Origin of Ellipsis (was: RE: Empty set)

Re: Origin of Ellipsis (was: RE: Empty set)

Re: Origin of Ellipsis (was: RE: Empty set)

Re: Origin of Ellipsis (was: RE: Empty set)

Re: Origin of Ellipsis

Re: Origin of Ellipsis (was: RE: Empty set)

Re: Origin of Ellipsis (was: RE: Empty set)

Re: Origin of Ellipsis (was: RE: Empty set)

Re: Origin of Ellipsis and double spacing after a sentence.

13 matches

Site Navigation

Mail list logo

Footer information