Re: Origin of Ellipsis (was: RE: Empty set)
Doug wrote me: You're not confusing "code point" with "code unit," are you? Thanks for the note. I think what you say is that I thought (or meant to write) "by first representing the sequence of scalar values in an encoding form and then counting [code points typecast from] code _units_". I think you are right, but there are some points of confusion, see below. Somehow I thought of "surrogate pair" as "pair of (surrogate) code points" instead of "pair of (surrogate) code units". I guess that additional level of indirection would make my interpretation (b) unlikely ... I think my statement is still technically correct because counting code points for UTF-16 and code units for UTF-16 leads to the same count. What's confusing is a term like "high-surrogate code point" (see glossary). If surrogate code points are not encoded, then they practically don't exist in the ontology of Unicode terms, aside from being holes in the scalar value range, if thought of as a subrange of the integers. In detail: The glossary defines "surrogate code point" as: "A Unicode code point in the range U+D800..U+DFFF. Reserved _for use_ by UTF-16, where _a pair of surrogate code units_ (a high surrogate followed by a low surrogate) “stand in” for a supplementary code point." This definition doesn't say much; it says they code _points_ are "for _use_ by UTF-16", but then UTF-16 uses surrogate code units, not surrogate code points. C1 in TUS §3.2 says: "The high-surrogate and low-surrogate code _points_ _are designated for_ surrogate code _units_ in the UTF-16 character encoding form." But the actual definitions used for UTF-16 don't seem to conceptually _derive_ "surrogate code unit" from "surrogate code point". => ?? Still, I don't understand why people keep talking about code points. For me conceptually (albeit not historically) everything starts with scalar values (which are index values for certain abstract things). Scalar values are then encoded by encoding forms (and then serialized in encoding schemes). Why does everyone talk about the more generic "code point" instead of "scalar value", when non-scalar-value code points aren't used? (Because we're not using surrogate code point pairs, we're instead using surrogate code unit pairs.) Anyways, I understand that KenW and Mark Davis have pointed to earlier debates on this in an earlier thread. Stephan
Re: Origin of Ellipsis (was: RE: Empty set)
Stephan Stiller wrote: From the link it isn't entirely clear whether they (a) count scalar values of NFC or (b) count code points of NFC. Are they not the same thing, except for surrogates? Conceptually no, but numerically yes – you are right in that regard, and I wasn't precise in my description of (b). I suppose if you read their description literally (they say they use UTF-8 internally), it follows that they're forbidding surrogates, because these are invalid in UTF-8. (Is this what they're doing? I guess the answer wouldn't matter for someone who only produces Tweets properly composed of a sequence of scalar values.) Then, when they write that "Twitter also counts the number of codepoints in the text rather than UTF-8 bytes", it makes me wonder whether they're maybe handling the data in UTF-16 in the relevant procedure that checks for length. The elementary unit of abstract "text" is for me the scalar value. When they write "code point", that means they've just implicitly typecast from "scalar value" to "code point", and the question is how the typecasting was performed: by directly interpreting the scalar values as numbers of type "code point" or by first representing the sequence of scalar values in an encoding form and then counting code points? My assumption would naturally be the former, which would also be consistent with vulgar :-) (popular) use of these terms – but I had to read Twitter's description a couple of times to make sense of it. Stephan
Re: Origin of Ellipsis (was: RE: Empty set)
On Sun, Sep 15, 2013 at 09:21:47PM +0200, Philippe Verdy wrote: > If there's something to do now (given it is no longer used in CJK > contexts), it's to strongly recommand that fonts map them to exactly the > same glyph as the one obtained by aligning three periods in a raw without > any additional space or kerning. … unless preceded or followed by a period… . AND: your “with no additional kerning” should better be read as “exactly the same kerning as the kerning for the sequence of dots — which must be tuned up to follow the typography tradition of the script/language”. Ilya
Re: Origin of Ellipsis (was: RE: Empty set)
Addison Phillips wrote: Not if the limit is counted in characters and not in bytes. Twitter, for example, counts code points in the NFC representation of a tweet. You're right. I take that back, about Twitter at least. Stephan Stiller wrote: From the link it isn't entirely clear whether they (a) count scalar values of NFC or (b) count code points of NFC. Are they not the same thing, except for surrogates? -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell
Re: Origin of Ellipsis (was: RE: Empty set)
From: Doug Ewell Date: Sun, 15 Sep 2013 14:04:05 -0600 > Andre Schappo wrote: >> U+2026 is useful for microblogs when one is looking to save characters > Not if the microblog is in UTF-8, as almost all are. Why not just type: . . . (I suppose this fails too as now the ellipsis can break at line breaks). (In html code it works of course: . . Note that the pre tags are just to prevent the nbsp s from getting converted to spaces.) Best, --C. E. Whitehead cewcat...@hotmail.com > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell
Re: Origin of Ellipsis (was: RE: Empty set)
Actually, that's my bad: I meant to type scalar value. Stephan Stiller wrote: On 9/15/2013 3:07 PM, Phillips, Addison wrote: Not if the limit is counted in characters and not in bytes. Twitter, for example, counts code points in the NFC representation of a tweet. "character", "code point" – these are confusing words :-) From the link it isn't entirely clear whether they (a) count scalar values of NFC or (b) count code points of NFC. That's why I think it's bad to write "code point" when "scalar value" is intended. Stephan
Re: Origin of Ellipsis (was: RE: Empty set)
On 9/15/2013 3:07 PM, Phillips, Addison wrote: Not if the limit is counted in characters and not in bytes. Twitter, for example, counts code points in the NFC representation of a tweet. "character", "code point" – these are confusing words :-) From the link it isn't entirely clear whether they (a) count scalar values of NFC /or/ (b) count code points of NFC. That's why I think it's bad to write "code point" when "scalar value" is intended. Stephan
Re: Origin of Ellipsis (was: RE: Empty set)
Not if the limit is counted in characters and not in bytes. Twitter, for example, counts code points in the NFC representation of a tweet. Doug Ewell wrote: Andre Schappo wrote: > U+2026 is useful for microblogs when one is looking to save characters Not if the microblog is in UTF-8, as almost all are. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell
Re: Origin of Ellipsis
On 9/15/2013 1:04 PM, Doug Ewell wrote: André Schappo wrote: U+2026 is useful for microblogs when one is looking to save characters Not if the microblog is in UTF-8, as almost all are. That's an astute observation, but André was talking about input limits https://dev.twitter.com/docs/counting-characters , not backend/database space. Stephan
Re: Origin of Ellipsis (was: RE: Empty set)
Andre Schappo wrote: U+2026 is useful for microblogs when one is looking to save characters Not if the microblog is in UTF-8, as almost all are. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell
Re: Origin of Ellipsis (was: RE: Empty set)
Do you mean saving two characters for posting to Tweeter ? Well may be, but Tweeter clearly does not promote correct typography and not even correct orthography. It is clearly not a good model for publishing. But given the history of this character, I just wonder why it was not mapped along with East-Asian compatibility punctuations where it should have always been. And many fonts have ignored this history and the intent for compatibility with legacy CJK codepages. So not only they used incorrect metrics for use with other scripts, but they also did not honor the metrics of these CJK scripts. This is now a character which we should not use at all as it does not even work as intended in any context (except for those similar to tweets). If there's something to do now (given it is no longer used in CJK contexts), it's to strongly recommand that fonts map them to exactly the same glyph as the one obtained by aligning three periods in a raw without any additional space or kerning. And may be demand that renderers ignore these font mappings and systematically replace it with three separate periods so that they can properly apply correct justifications and glyph metrics, with at least two branches depending on the previous glyph (CJK or not, and possibly: if CJK, half-width or fullwidth, otherwise look at font metrics of the previous glyph to see if it's monospaced or not and if not, replacing by using 3 standard periods). Those users that will want more spacing between dots of an ellipsis should have to use explicit spacing in their encoded texts. And those that want less spacing should use ligature control such as ZWJ between standard periods as well Clearly this character must be clearly deprecated for all uses except CK contexts, and should probably be even dropped from mappings in most fonts (except CJK or monospaced fonts). 2013/9/15 Andre Schappo > > On 13 Sep 2013, at 20:02, Whistler, Ken wrote: > > > The *interesting* question, in my opinion, is why folks feel impelled to > use > U+2026 to render a baseline ellipsis in Latin typography at all, rather > than > just using U+002E ad libitum... > > --Ken > > > U+2026 is useful for microblogs when one is looking to save characters > > André > > >
Re: Origin of Ellipsis (was: RE: Empty set)
On 13 Sep 2013, at 20:02, Whistler, Ken wrote: The *interesting* question, in my opinion, is why folks feel impelled to use U+2026 to render a baseline ellipsis in Latin typography at all, rather than just using U+002E ad libitum... --Ken U+2026 is useful for microblogs when one is looking to save characters André
Re: Origin of Ellipsis and double spacing after a sentence.
On 9/14/2013 6:24 AM, Michael Everson wrote: It facilitates comment by those who are reviewing the text. If you add proofreaders' marks to an especially difficult manuscript, maybe. I've barely seen annotated papers with comments that would not have fit into the margins, and there's still the back (oh no! in that case you'll need to remember to hand-photocopy such a page, if you need to photocopy the annotations and corrections for some reason). In the majority of cases they would have fit comfortably. For the small number of cases where they wouldn't, everyone keep in mind that "space for comments" isn't the only factor: being able to go back and forth easily to refer to and remind oneself of other portions of the text can get a nuisance if what feels like a short paper is printed on too large a pile of pages. On 9/14/2013 11:11 AM, Jim Allan wrote: See http://www.heracliteanriver.com/?p=324 which claims with numerous examples that Michael Everson is totally wrong. I have laid out my opinions (of varying strength) about typographic matters, but calling someone "totally wrong" to me demonstrates more emotion than there should be; the linked-to article is brilliant, but its use of the word "lie" (too easily understood as ascribing malicious intent, as opposed to the mindless propagation of false information) distracts from its excellent factual information and the good intuition and opinions of the author. And I'm not sure about those "couple dozen different types of spaces" that "Unicode implements" according to the article (I thought there's just about two dozen). On 9/14/2013 11:44 AM, Michael Everson wrote: It's what I was taught. Probably my favorite non-argument, and even as an excuse it's still ultra-lame. On 9/14/2013 12:04 PM, Asmus Freytag wrote: But reviewing hardcopy is on its way out, so even this issue will disappear... Except now we need to wait for it to dissipate from university thesis requirements. I can't resist pointing the list to what Peter Wilson wrote in the manual to his "memoir" document class for LaTeX. I see its latest version here http://www.tex.ac.uk/ctan/macros/latex/contrib/memoir/memman.pdf . My experience resonates with his comments at the beginning of sec 3.3.2 ("Double spacing") and the chapter frontmatter and section 21.4 ("Comments") within his ch 21. On 9/14/2013 12:19 PM, Michael Everson wrote: And as a book designer and publisher, I think that having large spaces after a full stop is both unnecessary and vulgar. On 9/14/2013 11:18 PM, Michael Everson wrote: This does not change my view. Unnecessary and vulgar. Maybe – maybe not. What is "vulgar" is intended to convey? Where is the rationale for either view? The blog article has excellent reasoning, for example. On 9/14/2013 1:09 PM, Philippe Verdy wrote: the formation of infamous vertical "rivers" across lines of text Obviously larger inter-sentence spacing gives the reader more hints at the text's discourse structure except where a sentence ends at the end of a line. It seems hard to believe that the supposedly "ugly" or "vulgar" look of holes or typographic rivers distracts enough to negatively outweigh double sentence spacing. (So I disagree with the article's implications here.) Can anyone /prove/ to me that rivers actually matter unless you're bored or tired enough to seek meaning in pattern search on a randomly typeset page? In any case, I think it's important to keep oneself lucid and unemotional about what's presently done and then make decisions. On 9/14/2013 1:09 PM, Philippe Verdy wrote: These questions are not just about "esthetic", but about preserving the average blackness of lines to guide the eye for easier and faster reading, and to make sure that important punctuation will be easily distinguished (because they guide the "rythm" with which the text should be clearly read by speech (imagine you're reading the text to a public with clear voice, for better understanding: this is not an evident practice, good readers are rare that can translate to their auditory the substance of the text with emotion and strength as it could have been intended by the author, better exhibiting his choice of words). With respect to your wide knowledge, we're entering the world of speculation here. People who know about the typographic variation seen across the world's languages and typographic cultures (locales) should know that a lot of factors matter for the legibility of a text. On 9/14/2013 6:37 PM, Asmus Freytag wrote: On 9/14/2013 1:24 PM, Philippe Verdy wrote: Lots of paper hardcopies are used everyday in every organisations, and notably in those working on legal texts. Lawyers also think that WRITING IN ALL UPPERCASE SOMEHOW MAKES PEOPLE BE ABLE TO READ THINGS BETTER. Dunno, I'd stick with typographers and book designers... Lawyers also waste plenty of paper with the multiplication of documents whose precise wording tends to matter onl