Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Thu, 19 Jul 2018 20:34:26 +0200, Christian Gollwitzer wrote: > Am 19.07.2018 um 14:50 schrieb Gregory Ewing: >> Chris Angelico wrote: >>> On Thu, Jul 19, 2018 at 4:41 PM, Gregory Ewing >>> wrote: >>> (Google doesn't seem to think so -- it asks me whether I meant "assist shop". Although it does offer to translateč it into Czech...) >>> >>> Into or from?? I'm thoroughly confused now! >> >> Hard to tell. This is what the link said: >> >> assistshop - Czech translation - bab.la English-Czech dictionary >> https://en.bab.la/dictionary/english-czech/assistshop Translation for >> 'assistshop' in the free English-Czech dictionary and" many other Czech >> translations. > > Well that link tries to translate "assistshop" into the czech word > "prodavač" which is the usual word for a person in a shop who consults > the customers and sells the goods to them; I don't know if "assist shop" > in English comes close, as I don't understand it (I'm a native German > speaker) In English, that would be "shop assistant". "Assist shop" would be grammatically incorrect: it should be written as "assist the shop", meaning "help the shop". Relevant: https://www.theatlantic.com/technology/archive/2018/01/the-shallowness-of-google-translate/551570/ -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
Am 19.07.2018 um 14:50 schrieb Gregory Ewing: Chris Angelico wrote: On Thu, Jul 19, 2018 at 4:41 PM, Gregory Ewing wrote: (Google doesn't seem to think so -- it asks me whether I meant "assist shop". Although it does offer to translateč it into Czech...) Into or from?? I'm thoroughly confused now! Hard to tell. This is what the link said: assistshop - Czech translation - bab.la English-Czech dictionary https://en.bab.la/dictionary/english-czech/assistshop Translation for 'assistshop' in the free English-Czech dictionary and" many other Czech translations. Well that link tries to translate "assistshop" into the czech word "prodavač" which is the usual word for a person in a shop who consults the customers and sells the goods to them; I don't know if "assist shop" in English comes close, as I don't understand it (I'm a native German speaker) Christian -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
Chris Angelico wrote: On Thu, Jul 19, 2018 at 4:41 PM, Gregory Ewing wrote: (Google doesn't seem to think so -- it asks me whether I meant "assist shop". Although it does offer to translate it into Czech...) Into or from?? I'm thoroughly confused now! Hard to tell. This is what the link said: assistshop - Czech translation - bab.la English-Czech dictionary https://en.bab.la/dictionary/english-czech/assistshop Translation for 'assistshop' in the free English-Czech dictionary and many other Czech translations. -- Greg -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
it's also thoroughly time to give this thread a well deserved rest RIP Abdur-Rahmaan Janhangeer https://github.com/Abdur-rahmaanJ Into or from?? I'm thoroughly confused now! > > ChrisA > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Thu, Jul 19, 2018 at 4:41 PM, Gregory Ewing wrote: > Stefan Ram wrote: >> >> »assistshop«, > > > Is that a word? > > (Google doesn't seem to think so -- it asks me whether > I meant "assist shop". Although it does offer to translate > it into Czech...) > Into or from?? I'm thoroughly confused now! ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
Stefan Ram wrote: »assistshop«, Is that a word? (Google doesn't seem to think so -- it asks me whether I meant "assist shop". Although it does offer to translate it into Czech...) -- Greg -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
Stefan Ram wrote: Gregory Ewing writes: That's debatable. I've never thought of it that way and I'm fairly certain I don't pronounce it that way. My tongue does not do the same thing when I say "ch" as it does when I say "tsh". archives ˈɑɚ kɑɪvz (n) bachelor ˈbæʧ lɚ (n) machine məˈʃin cash kæʃ dachshund ˈdɑks ˌhʊnt I'm talking specifically about the "ch" sound in "bachelor", "change", etc. It sounds and feels like a single sound to me. -- Greg -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
MRAB wrote: "ch" usually represents 2 phonemes, basically the sounds of "t" followed by "sh"; That's debatable. I've never thought of it that way and I'm fairly certain I don't pronounce it that way. My tongue does not do the same thing when I say "ch" as it does when I say "tsh". -- Greg -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On 18-07-18 10:07, Marko Rauhamaa wrote: >> Sure there were some surprises or gotcha's, but the result was still >> better than doing it in python2 and they were easier to deal with than >> in python2. > BTW, in those needs, even Python2 has Unicode strings and unicodedata at > your disposal. Sure, just as there are byte strings at your disposal in python3. I also don't think using u'...' in python2 is less ugly than using b'...' in python3. -- Antoon. -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
Antoon Pardon : > On 17-07-18 14:22, Marko Rauhamaa wrote: >> If you assume that NFC normalizes every letter to a single codepoint >> (and carefully use NFC everywhere), you are right. But equally likely >> you may inadvertently be setting yourself up for a surprise. > > You are moving the goal post. I didn't claim there were no surprises. > I only claim that in the end combining regular expressions and working > with multiple languages ended up being far easier with python3 strings > than with python2 strings. Fair enough. > Sure there were some surprises or gotcha's, but the result was still > better than doing it in python2 and they were easier to deal with than > in python2. BTW, in those needs, even Python2 has Unicode strings and unicodedata at your disposal. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On 17-07-18 14:22, Marko Rauhamaa wrote: > Antoon Pardon : > >> On 17-07-18 10:27, Marko Rauhamaa wrote: >>> Also, Python2's strings do as good a job at delivering codepoints as >>> Python3. >> No they don't. The programs that I work on, need to be able to treat >> at least german, french, dutch and english text. My experience is that >> in python3 it is way easier to do things right. Especially if you are >> working with regular expressions. > If you assume that NFC normalizes every letter to a single codepoint > (and carefully use NFC everywhere), you are right. But equally likely > you may inadvertently be setting yourself up for a surprise. You are moving the goal post. I didn't claim there were no surprises. I only claim that in the end combining regular expressions and working with multiple languages ended up being far easier with python3 strings than with python2 strings. Sure there were some surprises or gotcha's, but the result was still better than doing it in python2 and they were easier to deal with than in python2. -- Antoon. -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On 17/07/18 19:16, Marko Rauhamaa wrote: MRAB : "ch" usually represents 2 phonemes, basically the sounds of "t" followed by "sh"; Traditionally, that sound is considered a single phoneme: https://en.wikipedia.org/wiki/Affricate_consonant> Can you hear the difference in these expressions: high chairs height shares height chairs Try them on an English-speaking person. In a restaurant, ask for a "height share" and see if they bring you a high chair. The English "tr" sound can also be considered a single affricate phoneme: https://en.wikipedia.org/wiki/Voiceless_postalveolar_affricate> Is there a difference between these expressions: rye train right rain right train Marko I do not see what this has to do with the Python programming language, neither do I care. Please take this offline, as you've all ready been asked to do by a moderator, Tim Golden. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On 17/07/18 19:16, Marko Rauhamaa wrote: MRAB : "ch" usually represents 2 phonemes, basically the sounds of "t" followed by "sh"; Traditionally, that sound is considered a single phoneme: https://en.wikipedia.org/wiki/Affricate_consonant> To quote the introduction of that article, "It is often difficult to decide if a stop and fricative form a single phoneme or a consonant pair." I'm afraid your bold assertion is more than a bit arguable. Can you hear the difference in these expressions: high chairs height shares height chairs Yes, but then I'm a trained singer. Try them on an English-speaking person. In a restaurant, ask for a "height share" and see if they bring you a high chair. That's a different effect. Listeners will often subconsciously make small "corrections" to what they hear to bring it into context. It is particularly noticeable in experiments where one person repeats what another says while they are still speaking -- effectively simultaneous translation without the translation part :-) The person repeating will correct small mistakes in what was originally said without ever noticing the error. (Google is being annoying and not supplying me with the information, but I know there have been papers on this.) The English "tr" sound can also be considered a single affricate phoneme: https://en.wikipedia.org/wiki/Voiceless_postalveolar_affricate> Is there a difference between these expressions: rye train right rain right train Again, yes. Very much so this time. -- Rhodri James *-* Kynesim Ltd -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
MRAB : > "ch" usually represents 2 phonemes, basically the sounds of "t" > followed by "sh"; Traditionally, that sound is considered a single phoneme: https://en.wikipedia.org/wiki/Affricate_consonant> Can you hear the difference in these expressions: high chairs height shares height chairs Try them on an English-speaking person. In a restaurant, ask for a "height share" and see if they bring you a high chair. The English "tr" sound can also be considered a single affricate phoneme: https://en.wikipedia.org/wiki/Voiceless_postalveolar_affricate> Is there a difference between these expressions: rye train right rain right train Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On 2018-07-17 03:25, Tim Chase wrote: On 2018-07-17 01:08, Steven D'Aprano wrote: In English, I think most people would prefer to use a different term for whatever "sh" and "ch" represent than "character". The term you may be reaching for is "consonant cluster"? https://en.wikipedia.org/wiki/Consonant_cluster They are digraphs, 2 characters that are treated as a single unit. As it says in the first paragraph: "a consonant cluster, consonant sequence or consonant compound is a group of consonants which have no intervening vowel." "sh" is a single phoneme (sound) that happens to be written in English with 2 letters. "ch" usually represents 2 phonemes, basically the sounds of "t" followed by "sh"; other times it's "k" (e.g. in "echo"); occasionally it's "sh" (e.g. in "champagne"). -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
Antoon Pardon : > On 17-07-18 10:27, Marko Rauhamaa wrote: >> Also, Python2's strings do as good a job at delivering codepoints as >> Python3. > > No they don't. The programs that I work on, need to be able to treat > at least german, french, dutch and english text. My experience is that > in python3 it is way easier to do things right. Especially if you are > working with regular expressions. If you assume that NFC normalizes every letter to a single codepoint (and carefully use NFC everywhere), you are right. But equally likely you may inadvertently be setting yourself up for a surprise. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On 17-07-18 10:27, Marko Rauhamaa wrote: > Steven D'Aprano : >> On Mon, 16 Jul 2018 21:48:42 -0400, Richard Damon wrote: >>> Who says there needs to be one. A good engineer will use the >>> definition that is most appropriate to the task at hand. Some things >>> need very solid definitions, and some things don’t. >> The the problem is solved: we have a perfectly good de facto definition >> of character: it is a synonym for "code point", and every single one of >> Marko's objections disappears. > I admit it. Python3 is the perfect medium for your codepoint delivery > needs. > > What you don't seem to understand about my objections is that no > programmer needs codepoints per se. Also, Python2's strings do as good a > job at delivering codepoints as Python3. No they don't. The programs that I work on, need to be able to treat at least german, french, dutch and english text. My experience is that in python3 it is way easier to do things right. Especially if you are working with regular expressions. -- Antoon. -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
> On Jul 17, 2018, at 3:44 AM, Steven D'Aprano > wrote: > > On Mon, 16 Jul 2018 21:48:42 -0400, Richard Damon wrote: > >>> On Jul 16, 2018, at 9:21 PM, Steven D'Aprano >>> wrote: >>> On Mon, 16 Jul 2018 19:02:36 -0400, Richard Damon wrote: You are defining a variable/fixed width codepoint set. Many others want to deal with CHARACTER sets. >>> >>> Good luck coming up with a universal, objective, language-neutral, >>> consistent definition for a character. >>> >> Who says there needs to be one. A good engineer will use the definition >> that is most appropriate to the task at hand. Some things need very >> solid definitions, and some things don’t. > > The the problem is solved: we have a perfectly good de facto definition > of character: it is a synonym for "code point", and every single one of > Marko's objections disappears. > Which is a ‘changed’ definition! Do you agree that the concept of variable width encoding vastly predates the creation of Unicode? Can you also find any use of the word codepoint that predates the development of Unicode? Code points and code words are an invention of the Unicode consortium, and as such should really only be used in talking about IT and not some other encodings. I believe that Unicode also created the idea of storing composed characters as a series of codepoints instead of it being done in the input routine and the character set needing to define a character code for every needed composed character. > >> This goes back to my original point, where I said some people consider >> UTF-32 as a variable width encoding. For very many things, practically, >> the ‘codepoint’ isn’t the important thing, > > Ah, is this another one of those "let's pick a definition that nobody > else uses, and state it as a fact" like UTF-32 being variable width? > > If by "very many things", you mean "not very many things", I agree with > you. In my experience, dealing with code points is "good enough", > especially if you use Western European alphabets, and even more so if > you're willing to do a normalization step before processing text. > AH, that is the rub, you only deal with the parts of Unicode that are simple and regular. This is EXACTLY the issue that you blame people who want to use ASCII or Codepages to solve, just the next step in the evolution. One problem with normalization is that for Western European characters it tends to be able to convert every ‘Character’ to a code point, but in some corner cases, especially for other languages it can’t. I am not just talking about digraphs like ch that have been mentioned, but the real composed characters with a base glyph with marks above/below/embedded on it. Unicode represents many of them with a code point, but no where near all of them. If you actually read the Unicode documents, they do talk about Characters, and admit that they aren’t necessarily codepoints, so if you actually want to talk about a CHARACTER set, Unicode, even UTF-32 needs to sometimes be treated as variable width. > But of course other people's experience may vary. I'm interested in > learning about the library you use to process graphemes in your software. > > >> so the fact that every UTF-32 >> code point takes the same number of bytes or code words isn’t that >> important. They are dealing with something that needs to be rendered and >> preserving larger units, like the grapheme is important. > > If you're writing a text widget or a shell, you need to worry about > rendering glyphs. Everyone else just delegates to their text widget, GUI > framework, or shell. > But someone needs to write that text widget, or it might not do exactly what you want, say wrapping the text around obstacles already placed on the screen/page. And try using that text widget to find the ‘middle’ (as shown) of a text string, (other than iterating with multiple calls to it to try and find it). Unicode made the processing of Codepoints simpler, but made the processing of actual rendered text much more complicated if you want to handle everything right. > This doesn’t mean that UTF-32 is an awful system, just that it isn’t the magical cure that some were hoping for. >>> >>> Nobody ever claimed it was, except for the people railing that since it >>> isn't a magically system we ought to go back to the Good Old Days of >>> code page hell, or even further back when everyone just used ASCII. >>> >> Sometimes ASCII is good enough, especially on a small machine with >> limited resources. > > I doubt that there are many general purpose computers with resources > *that* limited. Even MicroPython supports Unicode, and that runs on > embedded devices with memory measured in kilobytes. 8K is considered the > smallest amount of memory usable with MicroPython, although 128K is more > realistic as the *practical* lower limit. > > In the mid 1980s, I was using computers with 128K of RAM, and they were > still
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
Chris Angelico : > On Tue, Jul 17, 2018 at 6:27 PM, Marko Rauhamaa wrote: >> Of course, UTF-8 doesn't relieve you from Unicode problems. But it has >> one big advantage: it can usually deal with non-Unicode data without any >> extra considerations while Python3's strings make you have to take >> elaborate measures to handle those special cases. Why, even print() must >> be guarded against UnicodeEncodeError when the printed string is not in >> the programmer's control. > > What is this "non-Unicode data" that UTF-8 can handle? Do you mean > arbitrary byte sequences? Because no, it cannot; properly-formed UTF-8 > sequences MUST comply with the precise requirements of the format. I was being imprecise: byte strings carrying UTF-8 can handle bad UTF-8 with equal ease. And that's a real, practical advantage. > Can you give an example of how Python 3's print function can raise > UnicodeEncodeError when given a Python 3 string? >>> print("\ud810") Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'utf-8' codec can't encode character '\ud810' \ in position 0: surrogates not allowed Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
Chris Angelico : > On Tue, Jul 17, 2018 at 7:03 PM, Marko Rauhamaa wrote: >> What I'd need is for the tty to tell me what column the cursor is >> visually. Or better yet, the tty would have to tell me where the column >> would be *after* I emit the next grapheme cluster. > > Are you prepared for the possibility that emitting characters won't > change what column you're in? Absolutely. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Tue, Jul 17, 2018 at 7:03 PM, Marko Rauhamaa wrote: > Chris Angelico : > >> On Tue, Jul 17, 2018 at 6:27 PM, Marko Rauhamaa wrote: >>> For me, the issue is where do I produce a line break in my text output? >>> Currently, I'm just counting codepoints to estimate the width of the >>> output. >> >> Well, that's just flat out wrong, then. Counting graphemes isn't going >> to make it any better. Grab a well-known library like Pango and let it >> do your measurements for you, *in pixels*. Or better still, just poke >> your text to a dedicated text-display widget and let it display it >> correctly. > > What I'd need is for the tty to tell me what column the cursor is > visually. Or better yet, the tty would have to tell me where the column > would be *after* I emit the next grapheme cluster. Are you prepared for the possibility that emitting characters won't change what column you're in? Start a new line, then emit one Arabic character. What column are you in? Now emit three more Arabic characters, completing the word. What column? Now emit a U+0020 SPACE. What column? Now emit some Latin characters, followed by more Arabic. Where are you? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
Chris Angelico : > On Tue, Jul 17, 2018 at 6:27 PM, Marko Rauhamaa wrote: >> For me, the issue is where do I produce a line break in my text output? >> Currently, I'm just counting codepoints to estimate the width of the >> output. > > Well, that's just flat out wrong, then. Counting graphemes isn't going > to make it any better. Grab a well-known library like Pango and let it > do your measurements for you, *in pixels*. Or better still, just poke > your text to a dedicated text-display widget and let it display it > correctly. What I'd need is for the tty to tell me what column the cursor is visually. Or better yet, the tty would have to tell me where the column would be *after* I emit the next grapheme cluster. The tty *does* know that but I don't know if there is an interface to query it. This doesn't seem to be working properly: sys.stdout.write("a\u0300\u001b[6n\n") (and would be a tricky interface even if it did) Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Tue, Jul 17, 2018 at 6:27 PM, Marko Rauhamaa wrote: > It is essential for people to understand that the very same issues that > plague UTF-8 plague UTF-32 as well. Using UTF in both highlights that > fact. What a wonderful nonsense. I suppose that the same issues plague Elon Musk as plague the musk sticks in the sweets aisle in the supermarket - they do use the same letters, after all. >> If by "very many things", you mean "not very many things", I agree >> with you. In my experience, dealing with code points is "good enough", >> especially if you use Western European alphabets, and even more so if >> you're willing to do a normalization step before processing text. > > Of course, UTF-8 doesn't relieve you from Unicode problems. But it has > one big advantage: it can usually deal with non-Unicode data without any > extra considerations while Python3's strings make you have to take > elaborate measures to handle those special cases. Why, even print() must > be guarded against UnicodeEncodeError when the printed string is not in > the programmer's control. What is this "non-Unicode data" that UTF-8 can handle? Do you mean arbitrary byte sequences? Because no, it cannot; properly-formed UTF-8 sequences MUST comply with the precise requirements of the format. Can you give an example of how Python 3's print function can raise UnicodeEncodeError when given a Python 3 string? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Tue, Jul 17, 2018 at 6:27 PM, Marko Rauhamaa wrote: >> But of course other people's experience may vary. I'm interested in >> learning about the library you use to process graphemes in your software. > > For me, the issue is where do I produce a line break in my text output? > Currently, I'm just counting codepoints to estimate the width of the > output. Well, that's just flat out wrong, then. Counting graphemes isn't going to make it any better. Grab a well-known library like Pango and let it do your measurements for you, *in pixels*. Or better still, just poke your text to a dedicated text-display widget and let it display it correctly. Back in the early 2000s, I built a program that displayed text in a monospaced font, and it was riddled with assumptions that "one byte == one character == N pixels of width" (for some value of N that changed only when you change font). It was easier to throw it out completely and start over than to try to "bolt on" true Unicode support. The replacement program uses GTK and Pango to do all its display work, and while it still has a lot of complexities (because it has to handle colour codes, highlighting, point-to-word, and such, all of which get very complicated when you mix LTR and RTL text), at least it can 100% dependably say "wrap to this point". For the convenience of the human using it, it specifies a wrap width in characters, but in the fine print, the wrap width is defined as "the width of that many of the letter 'n' in the chosen font". At no point do I ever count bytes, code units, code points, grapheme clusters, or blue-faced baboons, to try to pretend that I know the width of the string. All of them are wrong for the wrapping of text. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
Steven D'Aprano : > On Tue, 17 Jul 2018 09:52:13 +0300, Marko Rauhamaa wrote: > >> Both Python2 and Python3 provide two forms of string, one containing >> 8-bit integers and another one containing 21-bit integers. > > Why do you insist on making counter-factual statements as facts? Don't > you have a Python REPL you can try these outrageous claims out before > making them? > > [...] > > Python strings are sequences of abstract characters. which -- by your definition -- are codepoints -- which by any definition -- are integers. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
Steven D'Aprano : > On Mon, 16 Jul 2018 21:48:42 -0400, Richard Damon wrote: >> Who says there needs to be one. A good engineer will use the >> definition that is most appropriate to the task at hand. Some things >> need very solid definitions, and some things don’t. > > The the problem is solved: we have a perfectly good de facto definition > of character: it is a synonym for "code point", and every single one of > Marko's objections disappears. I admit it. Python3 is the perfect medium for your codepoint delivery needs. What you don't seem to understand about my objections is that no programmer needs codepoints per se. Also, Python2's strings do as good a job at delivering codepoints as Python3. Simultaneously, Python2's strings are a better fit for the Unix system and network programming model. >> This goes back to my original point, where I said some people >> consider UTF-32 as a variable width encoding. For very many things, >> practically, the ‘codepoint’ isn’t the important thing, > > Ah, is this another one of those "let's pick a definition that nobody > else uses, and state it as a fact" like UTF-32 being variable width? Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value. https://en.wikipedia.org/wiki/UTF-32> That is called bijection. Even more, it's a homomorphism. Homomorphism is very high degree of sameness. It is essential for people to understand that the very same issues that plague UTF-8 plague UTF-32 as well. Using UTF in both highlights that fact. > If by "very many things", you mean "not very many things", I agree > with you. In my experience, dealing with code points is "good enough", > especially if you use Western European alphabets, and even more so if > you're willing to do a normalization step before processing text. Of course, UTF-8 doesn't relieve you from Unicode problems. But it has one big advantage: it can usually deal with non-Unicode data without any extra considerations while Python3's strings make you have to take elaborate measures to handle those special cases. Why, even print() must be guarded against UnicodeEncodeError when the printed string is not in the programmer's control. > But of course other people's experience may vary. I'm interested in > learning about the library you use to process graphemes in your software. For me, the issue is where do I produce a line break in my text output? Currently, I'm just counting codepoints to estimate the width of the output. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Tue, 17 Jul 2018 10:51:38 +0300, Marko Rauhamaa wrote: > in which Python3's honor is defended in a good many of the discussions > in this newsgroup: anger, condescension, ridicule, name-calling. You call it defending Python 3's honour. I call it responding to people who insist on spreading misinformation and falsehoods even when given the correct details. Some people have their self-image wrapped up in being able to portray themselves as a maverick who, almost alone, sees through the "lies" about to see "the truth". Others prefer reality instead, and get upset when false facts are repeated, over and over again, as truth. If instead you want to discuss actual concrete areas where Python's text/ bytes divide hurts, you'll find that there are plenty of people who agree. Especially if they have to write string-handling code that needs to run under both 2 and 3. Been there, done that, don't want to do it again. The Python 3 redesign was done to fix certain common, hard-to-diagnose problems in string handling caused by Python2's violation of the Zen "in the face of ambiguity, refuse the temptation to guess". (Python 2 guesses what encoding you probably mean when it comes to strings and bytes, and when it gets it right it is convenient, but when it gets it wrong, it is badly wrong, and hard to diagnose and fix.) It impossible to improve the text handling experience for every single programmer writing every single kind of program under every single set of circumstances. Like any semantic change, there are going to be winners and losers, and the core devs' position is that if the losers have concrete and backwards-compatible suggestions for improving their experience (e.g. re-adding % support for byte strings) they will consider them, but going back to the Python 2 misdesign is off the table. -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Tue, 17 Jul 2018 15:20:16 +0900, INADA Naoki wrote (replying to Marko): > I still don't understand what's your original point. I think UTF-8 vs > UTF-32 is totally different from Python 2 vs 3. > > For example, string in Rust and Swift (2010s languages!) are *valid* > UTF-8. There are strong separation between byte array and string, even > they use UTF-8. They looks similar to Python 3, not Python 2. > > And Python can use UTF-8 for internal encoding in the future. AFAIK, > PyPy tries it now. After they succeeded, I want to try port it to > CPython after we removed legacy Unicode APIs. (ref PEP 393) I'm not sure about PyPy, but I'm fairly certain that MicroPython uses UTF-8. I would be very interested to see the results of using UTF-8 in CPython. At the least, it would remove the need to keep a separate UTF-8 representation in the string object, as they do now. It might even be more compact, although a naive implementation would lose the ability to do constant time indexing into strings. That might be a tradeoff worth keeping, if indexing remained sufficiently fast. -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Tue, 17 Jul 2018 09:52:13 +0300, Marko Rauhamaa wrote: > Both Python2 and Python3 provide two forms of string, one containing > 8-bit integers and another one containing 21-bit integers. Why do you insist on making counter-factual statements as facts? Don't you have a Python REPL you can try these outrageous claims out before making them? py> b'abcd'[2] + 1 # bytes are sequences of integers 100 py> 'abcd'[2] + 1 # strings are not sequences of integers Traceback (most recent call last): File "", line 1, in TypeError: Can't convert 'int' object to str implicitly Python strings are sequences of abstract characters. -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Tue, 17 Jul 2018 08:26:45 +0300, Marko Rauhamaa wrote: > Steven D'Aprano : >> On Mon, 16 Jul 2018 22:51:32 +0300, Marko Rauhamaa wrote: >>> UTF-8 bytes can only represent the first 128 code points of Unicode. >> >> This is DailyWTF material. Perhaps you want to rethink your wording and >> maybe even learn a bit more about Unicode and the UTF encodings before >> making such statements. >> >> The idea that UTF-8 bytes cannot represent the whole of Unicode is not >> even wrong. Of course a *single* byte cannot, but a single byte is not >> "UTF-8 bytes". > > So I hope that by now you have understood my point and been able to > decide if you agree with it or not. If your point was not what you wrote, then no, I'm sorry, my crystal ball unexpectedly broke down (why it didn't foresee its own failure I'll never know...). I can't tell what you are thinking, only what you write. Sometimes I can guess (like my earlier guess that you meant grapheme, rather than glyph) but in this case, if you mean something other than "UTF-8 bytes can only represent the first 128 code points of Unicode" I'm flummoxed. -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
INADA Naoki : >> I won't comment on Rust and Swift because I don't know them. > ... >> I won't comment on Go, either. > > Hmm, do you say Python 3 is "cult-like" without survey other popular, > programming languages? You can talk about Python3 independently of other programming languages. Python3 is not a cult. It's a programming language. What is cult-like is the manner in which Python3's honor is defended in a good many of the discussions in this newsgroup: anger, condescension, ridicule, name-calling. > I can't agree that it's cult-like behavior. I think it's practical > design decision. If Python3 works for you, I'm happy for you. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Mon, 16 Jul 2018 21:25:20 -0500, Tim Chase wrote: > On 2018-07-17 01:08, Steven D'Aprano wrote: >> In English, I think most people would prefer to use a different term >> for whatever "sh" and "ch" represent than "character". > > The term you may be reaching for is "consonant cluster"? > > https://en.wikipedia.org/wiki/Consonant_cluster Thanks! -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Mon, 16 Jul 2018 21:48:42 -0400, Richard Damon wrote: >> On Jul 16, 2018, at 9:21 PM, Steven D'Aprano >> wrote: >> >>> On Mon, 16 Jul 2018 19:02:36 -0400, Richard Damon wrote: >>> >>> You are defining a variable/fixed width codepoint set. Many others >>> want to deal with CHARACTER sets. >> >> Good luck coming up with a universal, objective, language-neutral, >> consistent definition for a character. >> > Who says there needs to be one. A good engineer will use the definition > that is most appropriate to the task at hand. Some things need very > solid definitions, and some things don’t. The the problem is solved: we have a perfectly good de facto definition of character: it is a synonym for "code point", and every single one of Marko's objections disappears. > This goes back to my original point, where I said some people consider > UTF-32 as a variable width encoding. For very many things, practically, > the ‘codepoint’ isn’t the important thing, Ah, is this another one of those "let's pick a definition that nobody else uses, and state it as a fact" like UTF-32 being variable width? If by "very many things", you mean "not very many things", I agree with you. In my experience, dealing with code points is "good enough", especially if you use Western European alphabets, and even more so if you're willing to do a normalization step before processing text. But of course other people's experience may vary. I'm interested in learning about the library you use to process graphemes in your software. > so the fact that every UTF-32 > code point takes the same number of bytes or code words isn’t that > important. They are dealing with something that needs to be rendered and > preserving larger units, like the grapheme is important. If you're writing a text widget or a shell, you need to worry about rendering glyphs. Everyone else just delegates to their text widget, GUI framework, or shell. >>> This doesn’t mean that UTF-32 is an awful system, just that it isn’t >>> the magical cure that some were hoping for. >> >> Nobody ever claimed it was, except for the people railing that since it >> isn't a magically system we ought to go back to the Good Old Days of >> code page hell, or even further back when everyone just used ASCII. >> > Sometimes ASCII is good enough, especially on a small machine with > limited resources. I doubt that there are many general purpose computers with resources *that* limited. Even MicroPython supports Unicode, and that runs on embedded devices with memory measured in kilobytes. 8K is considered the smallest amount of memory usable with MicroPython, although 128K is more realistic as the *practical* lower limit. In the mid 1980s, I was using computers with 128K of RAM, and they were still able to deal with more than just ASCII. I think the "limited resources" argument is bogus. -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
> I won't comment on Rust and Swift because I don't know them. ... > I won't comment on Go, either. Hmm, do you say Python 3 is "cult-like" without survey other popular, programming languages? There are many popular languages which separate bytes and unicode string explicitly and string is not byte-transparent; C#, Java, ECMAScript, (including families like TypeScript), Rust, Swift, Julia, and more. I can't agree that it's cult-like behavior. I think it's practical design decision. Regards, -- INADA Naoki -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On 7/16/2018 10:25 PM, Tim Chase wrote: On 2018-07-17 01:08, Steven D'Aprano wrote: In English, I think most people would prefer to use a different term for whatever "sh" and "ch" represent than "character". The term you may be reaching for is "consonant cluster"? https://en.wikipedia.org/wiki/Consonant_cluster Sibilant (soft) ch (as opposed to hard aspirated chi as in Greek letter khi (visually like X)) and sh are single consonants, single phonemes in spoken language. In less parsimonious writing systems than Latin, they are often represented by single characters. When transliterated into Latin characters, both decorated c and s and ch and sh are used. 'str', as in string or street is a consonant cluster. It might be represented by a single ligature, but I would not expect any phoneme-based writing system to consider the result to be a single character. (Given that the sound of X (hard chi) mutated into 'ks', the latter is not impossible.) -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
INADA Naoki : > On Tue, Jul 17, 2018 at 2:31 PM Marko Rauhamaa wrote: >> So I hope that by now you have understood my point and been able to >> decide if you agree with it or not. > > I still don't understand what's your original point. > I think UTF-8 vs UTF-32 is totally different from Python 2 vs 3. > > For example, string in Rust and Swift (2010s languages!) are *valid* > UTF-8. There are strong separation between byte array and string, even > they use UTF-8. They looks similar to Python 3, not Python 2. I won't comment on Rust and Swift because I don't know them. > And Python can use UTF-8 for internal encoding in the future. AFAIK, > PyPy tries it now. After they succeeded, I want to try port it to > CPython after we removed legacy Unicode APIs. (ref PEP 393) How CPython3 implements str objects internally is not what I'm talking about. It's the programmer's model in any compliant Python3 implementation. Both Python2 and Python3 provide two forms of string, one containing 8-bit integers and another one containing 21-bit integers. Python3 made the situation worse in a minor way and a major way. The minor way is the uglification of the byte string notation. The major way is the wholesale preference or mandating of Unicode strings in numerous standard-library interfaces. > So "UTF-8 is better than UTF-32" is totally different problem from > "Python 2 is better than 3". Unix programming is smoothest when the programmer can operate on bytes. Bytes are the mother tongue of Unix, and programming languages should not try to present a different model to the programmer. > Is your point "accepting invalid UTF-8 implicitly by default is better > than explicit 'surrogateescape' error handler" like Go? > (It's 2010s languages with UTF-8 based string too, but accept invalid > UTF-8). I won't comment on Go, either. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On 7/16/2018 7:02 PM, Richard Damon wrote: On Jul 16, 2018, at 3:28 PM, Terry Reedy wrote: If one is using a broader definition than usual, it is clearer to say so. This is the core of what I wrote. Do you disagree? You are defining a variable/fixed width codepoint set. No, I did not define anything. I said, I believe accurately, that this is the, or at least one common understanding of 'variable/fixed width encoding. To repeat, it one is writing to be understood, rather than create an effect, and one uses a word or phrase in a non-standard fashion (which I myself do occasionally), then it is clearer to say what one is doing (which I try to also do). -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Tue, Jul 17, 2018 at 2:31 PM Marko Rauhamaa wrote: > > Steven D'Aprano : > > On Mon, 16 Jul 2018 22:51:32 +0300, Marko Rauhamaa wrote: > >> UTF-8 bytes can only represent the first 128 code points of Unicode. > > > > This is DailyWTF material. Perhaps you want to rethink your wording > > and maybe even learn a bit more about Unicode and the UTF encodings > > before making such statements. > > > > The idea that UTF-8 bytes cannot represent the whole of Unicode is not > > even wrong. Of course a *single* byte cannot, but a single byte is not > > "UTF-8 bytes". > > So I hope that by now you have understood my point and been able to > decide if you agree with it or not. > > > Marko I still don't understand what's your original point. I think UTF-8 vs UTF-32 is totally different from Python 2 vs 3. For example, string in Rust and Swift (2010s languages!) are *valid* UTF-8. There are strong separation between byte array and string, even they use UTF-8. They looks similar to Python 3, not Python 2. And Python can use UTF-8 for internal encoding in the future. AFAIK, PyPy tries it now. After they succeeded, I want to try port it to CPython after we removed legacy Unicode APIs. (ref PEP 393) So "UTF-8 is better than UTF-32" is totally different problem from "Python 2 is better than 3". Is your point "accepting invalid UTF-8 implicitly by default is better than explicit 'surrogateescape' error handler" like Go? (It's 2010s languages with UTF-8 based string too, but accept invalid UTF-8). Regards, -- INADA Naoki -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
Steven D'Aprano : > On Mon, 16 Jul 2018 22:51:32 +0300, Marko Rauhamaa wrote: >> UTF-8 bytes can only represent the first 128 code points of Unicode. > > This is DailyWTF material. Perhaps you want to rethink your wording > and maybe even learn a bit more about Unicode and the UTF encodings > before making such statements. > > The idea that UTF-8 bytes cannot represent the whole of Unicode is not > even wrong. Of course a *single* byte cannot, but a single byte is not > "UTF-8 bytes". So I hope that by now you have understood my point and been able to decide if you agree with it or not. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On 2018-07-17 01:21, Steven D'Aprano wrote: > > This doesn’t mean that UTF-32 is an awful system, just that it > > isn’t the magical cure that some were hoping for. > > Nobody ever claimed it was, except for the people railing that > since it isn't a magically system we ought to go back to the Good > Old Days of code page hell, or even further back when everyone just > used ASCII. But even ed(1) on most systems is 8-bit clean so even there you're not limited to ASCII. I can't say I miss code-pages in the least. -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On 2018-07-17 01:08, Steven D'Aprano wrote: > In English, I think most people would prefer to use a different > term for whatever "sh" and "ch" represent than "character". The term you may be reaching for is "consonant cluster"? https://en.wikipedia.org/wiki/Consonant_cluster -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
> On Jul 16, 2018, at 9:21 PM, Steven D'Aprano > wrote: > >> On Mon, 16 Jul 2018 19:02:36 -0400, Richard Damon wrote: >> >> You are defining a variable/fixed width codepoint set. Many others want >> to deal with CHARACTER sets. > > Good luck coming up with a universal, objective, language-neutral, > consistent definition for a character. > Who says there needs to be one. A good engineer will use the definition that is most appropriate to the task at hand. Some things need very solid definitions, and some things don’t. This goes back to my original point, where I said some people consider UTF-32 as a variable width encoding. For very many things, practically, the ‘codepoint’ isn’t the important thing, so the fact that every UTF-32 code point takes the same number of bytes or code words isn’t that important. They are dealing with something that needs to be rendered and preserving larger units, like the grapheme is important. > >> This doesn’t mean that UTF-32 is an awful system, just that it isn’t the >> magical cure that some were hoping for. > > Nobody ever claimed it was, except for the people railing that since it > isn't a magically system we ought to go back to the Good Old Days of code > page hell, or even further back when everyone just used ASCII. > Sometimes ASCII is good enough, especially on a small machine with limited resources. Sometimes you do need to use a ‘Code Page’ because of limited resources and that unit will only be able to talk a single language because of that too). Sometimes you have the luxury of being able to use a somewhat complete Unicode implementation. Sometimes you are never going to be displaying anything, and you can mostly just treat everything as a bag of bytes. You use the tool that is right for the job. > -- > Steven D'Aprano > "Ever since I learned about confirmation bias, I've been seeing > it everywhere." -- Jon Ronson > > -- > https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Mon, 16 Jul 2018 22:51:32 +0300, Marko Rauhamaa wrote: > All UTF-8. No unicode strings. That just means you are re-implementing the bits of Unicode you care about (which may be "nothing at all") as UTF-8. If your application is nothing but middleware squirting bytes from one layer to another layer, that might be all you need care about. But then you're not processing text in your application, and why should your experience in not-processing-text be given any weight over the experiences of those who do process text? And later, in another post: > UTF-8 bytes can only represent the first 128 code points of Unicode. This is DailyWTF material. Perhaps you want to rethink your wording and maybe even learn a bit more about Unicode and the UTF encodings before making such statements. The idea that UTF-8 bytes cannot represent the whole of Unicode is not even wrong. Of course a *single* byte cannot, but a single byte is not "UTF-8 bytes". -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Mon, 16 Jul 2018 15:28:51 -0400, Terry Reedy wrote: > On 7/16/2018 1:11 PM, Richard Damon wrote: > >> Many consider that UTF-32 is a variable-width encoding because of the >> combining characters. It can take multiple ‘codepoints’ to define what >> should be a single ‘character’ for display. > > I hope you realize that this is not the standard meaning of > 'variable-width encoding', which is 'variable number of bytes for a > codepoint'. A minor correction Terry: it is the number of code units, not bytes. UTF-8 uses 1-byte code units, and from 1 to 4 code units per code point; UTF-16 uses 2-byte code units (a 16-bit word), and 1 or 2 words per code point; UTF-32 uses 4-byte code units (a 32-bit word), and only ever a single code unit for every code point. -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Mon, 16 Jul 2018 19:02:36 -0400, Richard Damon wrote: > You are defining a variable/fixed width codepoint set. Many others want > to deal with CHARACTER sets. Good luck coming up with a universal, objective, language-neutral, consistent definition for a character. > This doesn’t mean that UTF-32 is an awful system, just that it isn’t the > magical cure that some were hoping for. Nobody ever claimed it was, except for the people railing that since it isn't a magically system we ought to go back to the Good Old Days of code page hell, or even further back when everyone just used ASCII. -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Tue, 17 Jul 2018 06:15:25 +1000, Chris Angelico wrote: > On Tue, Jul 17, 2018 at 4:55 AM, Steven D'Aprano > wrote: >> There is nothing special about diacritics such that we ought to treat >> some combinations like "Ch" (two code points = one character) as "fixed >> width" while others like "â" (two code points = one character) as >> "variable width". > > When you reverse a word, do you treat "ch" and "sh" as one character or > two? In English, "ch" is always two letters of the alphabet. In Welsh and Czech, they can be one or two letters. (I think they will be two letters only in loan words, but I'm not certain about that.) Whether that makes them one or two characters depends on how you define "character". Good luck with finding a universal, objective, unambiguous definition. > I'm of the opinion that they're single characters, and thus this > should be "dalokosh": > > https://wiki.teamfortress.com/wiki/Dalokohs_Bar > > (It's the Russian for "chocolate" - "шоколад" - transliterated to > English/Latin - "šokolad" or "shokolad" - and then reversed.) In English, I think most people would prefer to use a different term for whatever "sh" and "ch" represent than "character". But you make a good point that even in English, we sometimes want to treat two letter combinations as a single unit. -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
> On Jul 16, 2018, at 3:28 PM, Terry Reedy wrote: > >> On 7/16/2018 1:11 PM, Richard Damon wrote: >> >> Many consider that UTF-32 is a variable-width encoding because of the >> combining characters. It can take multiple ‘codepoints’ to define what >> should be a single ‘character’ for display. > > I hope you realize that this is not the standard meaning of 'variable-width > encoding', which is 'variable number of bytes for a codepoint'. UTF-16 and > UTF-8 are variable width. If one expands the definition enough, Ascii is > 'variable width' because 'fi' is two bytes, or more realistically, because <= > and >= are two bytes instead of one (as they can be in Unicode!). > > If one is using a broader definition than usual, it is clearer to say so. > > -- > Terry Jan Reedy > You are defining a variable/fixed width codepoint set. Many others want to deal with CHARACTER sets. The Unicode consortium agrees that a code point is not necessarily a character (which is one reason they came up with the term). When actually trying to do work with text strings, the fact that some codepoints are combining codes that need to ‘stick’ to their mate becomes important. One of the claimed advantages of fixed width character set encodings is that you aren’t supposed to need to worry about breaking strings in two, but that doesn’t work in Unicode, you need to make sure you aren’t breaking a combining sequence. Even worse, Unicode really needs arbitrary look back to render substrings because it uses shift codes for things like left-to-right/right-to-left rendering control. This doesn’t mean that UTF-32 is an awful system, just that it isn’t the magical cure that some were hoping for. -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Tue, Jul 17, 2018 at 7:02 AM, Ethan Furman wrote: > On 07/16/2018 01:15 PM, Chris Angelico wrote: >> >> On Tue, Jul 17, 2018 at 4:55 AM, Steven D'Aprano wrote: > > >>> There is nothing special about diacritics such that we ought to treat >>> some combinations like "Ch" (two code points = one character) as "fixed >>> width" while others like "â" (two code points = one character) as >>> "variable width". >> >> >> When you reverse a word, do you treat "ch" and "sh" as one character >> or two? I'm of the opinion that they're single characters, and thus >> this should be "dalokosh": > > > Depends on the language: in Spanish, "ch" is it's own letter (at least it > was when I grew up), so any word containing it should still contain it when > reversed: "chica" would be "acich". > Yeah. In Russian, "sh" is the single character "ш". I'm of the opinion that, even after being transliterated into English phonetics, that should be treated as a unit. ISO-9 uses "š" rather than "sh", which is an improvement in character correspondence, but your average English speaker is more likely to be able to pronounce "dalokosh" correctly than to figure out "dalokoš". In the same way, I created a magic item in a D campaign called "Yasham Burda", even though the more correct spelling would be "Yaşam Burda" or even "Yasam Burda", for the benefit of my monolingual players. But I'd still treat the "sh" as one character. Ain't transliteration fun? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
Ethan Furman : > Depends on the language: in Spanish, "ch" is it's own letter (at least > it was when I grew up), so any word containing it should still contain > it when reversed: "chica" would be "acich". The Royal Academy broke "ch" and "ll" up into separate letters a decade or so back. It had become accepted practice in dictionaries way before that. In Finnish, "v" and "w" are still ortographic variants of the same letter. In practice, Finns don't have a problem with computers insisting they are separate letters. While the Royal Academy of the Spanish Language has now accepted that "ñ" is an accented "n", no Finn would think that "ä" is an accented "a" any more than an English-speaker would think that "G" is an accented "C" (which it originally was). Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Tue, Jul 17, 2018 at 6:54 AM, Marko Rauhamaa wrote: > Chris Angelico : >> Challenge: Reverse a string in UTF-8. > > Counter-challenge: Reverse a Unicode string: > >>>> s = "a\u0304e" >>>> s >'āe' >>>> L = list(s) >>>> L.reverse() >>>> "".join(L) >'ēa' > >> Challenge: Center text in UTF-8. > > Counter-challenge: Center a Unicode string: > >>>> t = s * 3 >>>> t >'āeāeāe' >>>> t.center(9) >'āeāeāe' > >> Challenge: Given a (non-initial) character in a buffer of UTF-8 bytes, >> find the immediately preceding character. > > The counter-challenge is left as an exercise for the reader. > >> All of these are fundamentally difficult by nature, but if you index >> by code points, you eliminate one level of difficulty; indexing by >> bytes retains all the existing difficulty and adds another layer. > > Oh, sorry. I thought you were suggesting Unicode strings would make the > challenges somehow easy. So now that you've actually read my entire post, you'll see that there are fundamental difficulties, but that UTF-8 introduces more. Great. Now go ahead and reply to my post, knowing my actual point. Congratulations on posting something of no value. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On 07/16/2018 01:15 PM, Chris Angelico wrote: On Tue, Jul 17, 2018 at 4:55 AM, Steven D'Aprano wrote: There is nothing special about diacritics such that we ought to treat some combinations like "Ch" (two code points = one character) as "fixed width" while others like "â" (two code points = one character) as "variable width". When you reverse a word, do you treat "ch" and "sh" as one character or two? I'm of the opinion that they're single characters, and thus this should be "dalokosh": Depends on the language: in Spanish, "ch" is it's own letter (at least it was when I grew up), so any word containing it should still contain it when reversed: "chica" would be "acich". -- ~Ethan~ -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
Chris Angelico : > Challenge: Reverse a string in UTF-8. Counter-challenge: Reverse a Unicode string: >>> s = "a\u0304e" >>> s 'āe' >>> L = list(s) >>> L.reverse() >>> "".join(L) 'ēa' > Challenge: Center text in UTF-8. Counter-challenge: Center a Unicode string: >>> t = s * 3 >>> t 'āeāeāe' >>> t.center(9) 'āeāeāe' > Challenge: Given a (non-initial) character in a buffer of UTF-8 bytes, > find the immediately preceding character. The counter-challenge is left as an exercise for the reader. > All of these are fundamentally difficult by nature, but if you index > by code points, you eliminate one level of difficulty; indexing by > bytes retains all the existing difficulty and adds another layer. Oh, sorry. I thought you were suggesting Unicode strings would make the challenges somehow easy. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Tue, Jul 17, 2018 at 4:55 AM, Steven D'Aprano wrote: > There is nothing special about diacritics such that we ought to treat > some combinations like "Ch" (two code points = one character) as "fixed > width" while others like "â" (two code points = one character) as > "variable width". When you reverse a word, do you treat "ch" and "sh" as one character or two? I'm of the opinion that they're single characters, and thus this should be "dalokosh": https://wiki.teamfortress.com/wiki/Dalokohs_Bar (It's the Russian for "chocolate" - "шоколад" - transliterated to English/Latin - "šokolad" or "shokolad" - and then reversed.) But that's an extremely difficult thing to explain to your average gamer... ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Tue, Jul 17, 2018 at 5:51 AM, Marko Rauhamaa wrote: > Steven D'Aprano : >> Under that standard definition, UTF-8 and UTF-16 are variable-width, >> and UTF-32 is fixed-width. >> >> But I'll accept that UTF-32 is variable-width if Marko accepts that >> ASCII is too. > > If that makes you happy, fine. The point is, UTF-32 has no advantages > over UTF-8. And I'm referring to the text abstraction as seen by the > programmer. It has nothing to do with the layout of bytes inside > CPython. > > I use UTF-8 in my C programs and sense no disadvantage. I have never > felt a need for wchar_t. Similarly, I had a small Python2 program that > quizzed me about Hebrew vocabulary with Finnish translations and > Esperanto pronunciation instructions. All UTF-8. No unicode strings. (I > *have* converted that to Python3 just to be on the bleeding edge, but it > didn't give me any advantages over Python2.) Challenge: Reverse a string in UTF-8. Challenge: Center text in UTF-8. Challenge: Given a (non-initial) character in a buffer of UTF-8 bytes, find the immediately preceding character. All of these are fundamentally difficult by nature, but if you index by code points, you eliminate one level of difficulty; indexing by bytes retains all the existing difficulty and adds another layer. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On 16/07/18 20:51, Marko Rauhamaa wrote: I use UTF-8 in my C programs and sense no disadvantage. I have never felt a need for wchar_t. That's not a good comparison, though, because wchar_t in C really doesn't give you much (if any) advantage over rolling your own UTF-8 support, even when that means making sure you don't split characters across buffers. -- Rhodri James *-* Kynesim Ltd -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
Steven D'Aprano : > Under that standard definition, UTF-8 and UTF-16 are variable-width, > and UTF-32 is fixed-width. > > But I'll accept that UTF-32 is variable-width if Marko accepts that > ASCII is too. If that makes you happy, fine. The point is, UTF-32 has no advantages over UTF-8. And I'm referring to the text abstraction as seen by the programmer. It has nothing to do with the layout of bytes inside CPython. I use UTF-8 in my C programs and sense no disadvantage. I have never felt a need for wchar_t. Similarly, I had a small Python2 program that quizzed me about Hebrew vocabulary with Finnish translations and Esperanto pronunciation instructions. All UTF-8. No unicode strings. (I *have* converted that to Python3 just to be on the bleeding edge, but it didn't give me any advantages over Python2.) Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On 7/16/2018 1:11 PM, Richard Damon wrote: Many consider that UTF-32 is a variable-width encoding because of the combining characters. It can take multiple ‘codepoints’ to define what should be a single ‘character’ for display. I hope you realize that this is not the standard meaning of 'variable-width encoding', which is 'variable number of bytes for a codepoint'. UTF-16 and UTF-8 are variable width. If one expands the definition enough, Ascii is 'variable width' because 'fi' is two bytes, or more realistically, because <= and >= are two bytes instead of one (as they can be in Unicode!). If one is using a broader definition than usual, it is clearer to say so. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Mon, 16 Jul 2018 14:22:27 -0400, Richard Damon wrote: [...] > But I am not talking about those sort of characters or ligatures, So what? I am. You don't get to say "only non-standard definitions I approve of count". There is the industry standard definition of what it means to be a fixed- or variable-width encoding, which we can all agree on, or we can have a free-for-all where I reject your non-standard meaning and you reject mine and nobody can understand anything that anyone else says. You (generic "you", not necessarily you personally) don't get to demand that I must accept your redefinition, while simultaneously refusing to return the favour. If you try, I will simply dismiss what you say as nonsense on stilts: you (still generic you) clearly don't know what variable-width means and are trying to shift the terms of the debate by redefining terms so that black means white and white means purple. > but > ‘characters’ that are built up of a combining diacritical marks (like > accents) and a base character. Unicode define many code points for the > more common of these, but many others do not. I am aware how Unicode works, and it doesn't change a thing. Fixed/variable width is NOT defined in terms of "characters", but if it were, ASCII would be variable width too. Limiting the definition to only diacritics is just a feeble attempt to wiggle out of the logical consequences of your (generic your) position. There is nothing special about diacritics such that we ought to treat some combinations like "Ch" (two code points = one character) as "fixed width" while others like "â" (two code points = one character) as "variable width". To do so is just special pleading. And the thing about special pleading is that we're not obliged to accept it. Plead as much as you like, the answer is still no. -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Tue, Jul 17, 2018 at 4:22 AM, Richard Damon wrote: > > But I am not talking about those sort of characters or ligatures, but > ‘characters’ that are built up of a combining diacritical marks (like > accents) and a base character. Unicode define many code points for the more > common of these, but many others do not. > So, you're talking about "grapheme clusters". Those can be arbitrarily large and complex. Trolls revel in the ability to adorn base characters with ridiculous numbers of "dripping" marks, making it hard to type their names. Since the amount of information in one grapheme cluster is (as far as I know) potentially infinite, it's fundamentally impossible to create a fixed-size encoding that can represent them. If I'm wrong about the possibilities being infinite, then they are certainly very extensive, as there are MANY combining characters available (the only question is whether you can use the same characters multiple times, in which case there are infinite possibilities, or if not, in which case the possibilities are base_characters*2^combining_characters aka "virtually infinite"). http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries This is a display feature, not an input feature, and certainly not a string representation feature. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
> On Jul 16, 2018, at 1:36 PM, Steven D'Aprano > wrote: > > On Mon, 16 Jul 2018 13:11:23 -0400, Richard Damon wrote: > >>> On Jul 16, 2018, at 12:51 PM, Steven D'Aprano >>> wrote: >>> On Mon, 16 Jul 2018 00:28:39 +0300, Marko Rauhamaa wrote: if your new system used Python3's UTF-32 strings as a foundation, that would be an equally naïve misstep. You'd need to reach a notch higher and use glyphs or other "semiotic atoms" as building blocks. UTF-32, after all, is a variable-width encoding. >>> >>> Python's strings aren't UTF-32. They are sequences of abstract code >>> points. >>> >>> UTF-32 is not a variable-width encoding. >>> >>> -- >>> Steven D'Aprano >>> >>> >> Many consider that UTF-32 is a variable-width encoding because of the >> combining characters. It can take multiple ‘codepoints’ to define what >> should be a single ‘character’ for display. > > Ah, well if we're going to start making up our own definitions of terms, > then ASCII is a variable-width encoding too. > > "Ch" (a single letter of the alphabet in a number of European languages, > including Welsh and Czech) requires two code points in ASCII. Even in > English, "qu" could be considered a two-byte "character" (grapheme), and > for ASCII users, (c) is a THREE code point character for what ought to be > a single character ©. > > The standard definition of variable- and fixed-width encodings refers to > how many *code units* is required to make up a single *code point*. > > Under that standard definition, UTF-8 and UTF-16 are variable-width, and > UTF-32 is fixed-width. > > But I'll accept that UTF-32 is variable-width if Marko accepts that ASCII > is too. > > -- > Steven D'Aprano > But I am not talking about those sort of characters or ligatures, but ‘characters’ that are built up of a combining diacritical marks (like accents) and a base character. Unicode define many code points for the more common of these, but many others do not. -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
On Mon, 16 Jul 2018 13:11:23 -0400, Richard Damon wrote: >> On Jul 16, 2018, at 12:51 PM, Steven D'Aprano >> wrote: >> >>> On Mon, 16 Jul 2018 00:28:39 +0300, Marko Rauhamaa wrote: >>> >>> if your new system used Python3's UTF-32 strings as a foundation, that >>> would be an equally naïve misstep. You'd need to reach a notch higher >>> and use glyphs or other "semiotic atoms" as building blocks. UTF-32, >>> after all, is a variable-width encoding. >> >> Python's strings aren't UTF-32. They are sequences of abstract code >> points. >> >> UTF-32 is not a variable-width encoding. >> >> -- >> Steven D'Aprano >> >> > Many consider that UTF-32 is a variable-width encoding because of the > combining characters. It can take multiple ‘codepoints’ to define what > should be a single ‘character’ for display. Ah, well if we're going to start making up our own definitions of terms, then ASCII is a variable-width encoding too. "Ch" (a single letter of the alphabet in a number of European languages, including Welsh and Czech) requires two code points in ASCII. Even in English, "qu" could be considered a two-byte "character" (grapheme), and for ASCII users, (c) is a THREE code point character for what ought to be a single character ©. The standard definition of variable- and fixed-width encodings refers to how many *code units* is required to make up a single *code point*. Under that standard definition, UTF-8 and UTF-16 are variable-width, and UTF-32 is fixed-width. But I'll accept that UTF-32 is variable-width if Marko accepts that ASCII is too. -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list
Re: Glyphs and graphemes [was Re: Cult-like behaviour]
> On Jul 16, 2018, at 12:51 PM, Steven D'Aprano > wrote: > >> On Mon, 16 Jul 2018 00:28:39 +0300, Marko Rauhamaa wrote: >> >> if your new system used Python3's UTF-32 strings as a foundation, that >> would be an equally naïve misstep. You'd need to reach a notch higher >> and use glyphs or other "semiotic atoms" as building blocks. UTF-32, >> after all, is a variable-width encoding. > > Python's strings aren't UTF-32. They are sequences of abstract code > points. > > UTF-32 is not a variable-width encoding. > > -- > Steven D'Aprano > Many consider that UTF-32 is a variable-width encoding because of the combining characters. It can take multiple ‘codepoints’ to define what should be a single ‘character’ for display. -- https://mail.python.org/mailman/listinfo/python-list
Glyphs and graphemes [was Re: Cult-like behaviour]
On Mon, 16 Jul 2018 00:28:39 +0300, Marko Rauhamaa wrote: > if your new system used Python3's UTF-32 strings as a foundation, that > would be an equally naïve misstep. You'd need to reach a notch higher > and use glyphs or other "semiotic atoms" as building blocks. UTF-32, > after all, is a variable-width encoding. Python's strings aren't UTF-32. They are sequences of abstract code points. UTF-32 is not a variable-width encoding. I don't know what *you* mean by "semiotic atoms", (possibly you mean graphemes?) but "glyphs" are the visual images of characters, and there's a virtual infinity of those for each character, differing in type-face, size, and style (roman, italic, bold, reverse-oblique, etc). There is no evidence aside from your say-so that a programming language "need" support "glyphs" as a native data type, or even graphemes. For starters, such a system would be exceedingly complex: graphemes are both language and context dependent. English, for example, has around 250 distinct graphemes: https://books.google.com.au/books? id=QrBQAmfXYooC=PT238=PT238=250 +graphemes=bl=abiymnQ5pq=eq3k06BkuGfpuGC6wKqPkCR_8Bw=en=X=HAdyUbfULpCnqwGRi4DYAg_esc=y Certainly it would be utterly impractical for a programming language designer, knowing nothing but a few half-remembered jargon terms, to try to design a native string type that matched the grapheme rules for the hundreds of human languages around the world. Or even just for English. Let third-party libraries blaze that trail first. By no means is Unicode the last word in text processing. It might not even be the last word in native string types for programming languages. But it is a true international standard which provides a universal character set and a selection of useful algorithms able to be used as powerful building blocks for text-processing libraries. Honestly Marko, your argument strikes me as akin to somebody who insists that because Python's float data type doesn't support full CAS (computer algebra system) and theorem prover, its useless and a step backwards and we should abandon IEEE-754 float semantics and let users implement their own floating point maths using nothing but fixed 1-byte integers. A float, after all, is nothing but 8 bytes. -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list