Re: Grapheme clusters, a.k.a.real characters

2017-07-21 Thread Steve D'Aprano
On Fri, 21 Jul 2017 06:05 pm, Chris Angelico wrote: >> But emoji sequences will often require four code points, three of which will >> be in the supplementary planes. >> >> http://unicode.org/emoji/charts/emoji-zwj-sequences.html > > "Often"? I doubt that; a lot of emoji don't require that many.

Re: Grapheme clusters, a.k.a.real characters

2017-07-21 Thread Chris Angelico
On Fri, Jul 21, 2017 at 4:34 PM, Steve D'Aprano wrote: > On Fri, 21 Jul 2017 01:43 pm, Chris Angelico wrote: > >> Strings with all code >> points on the BMP and no combining characters are still able to be >> represented as they are today, again with the empty

Re: Grapheme clusters, a.k.a.real characters

2017-07-21 Thread Steve D'Aprano
On Fri, 21 Jul 2017 01:43 pm, Chris Angelico wrote: > Strings with all code > points on the BMP and no combining characters are still able to be > represented as they are today, again with the empty secondary array. I presume that since the problem we're trying to solve here is that certain

Re: Grapheme clusters, a.k.a.real characters

2017-07-20 Thread Chris Angelico
On Fri, Jul 21, 2017 at 1:20 PM, Steve D'Aprano wrote: > On Fri, 21 Jul 2017 04:05 am, Marko Rauhamaa wrote: > >> If any string code point is 1114112 or greater > > By definition, no Unicode code point can ever have an ordinal value greater > than > 0x10 =

Re: Grapheme clusters, a.k.a.real characters

2017-07-20 Thread Steve D'Aprano
On Fri, 21 Jul 2017 04:05 am, Marko Rauhamaa wrote: > If any string code point is 1114112 or greater By definition, no Unicode code point can ever have an ordinal value greater than 0x10 = 1114111. So I don't know what you're talking about, but it isn't Unicode. If you want to invent your

Re: Grapheme clusters, a.k.a.real characters

2017-07-20 Thread Marko Rauhamaa
Chris Angelico : > Actually, the implementation I detailed was far SIMPLER than I thought > it would be; I started writing that post trying to prove that it was > impossible, but it turns out it isn't actually impossible. Just highly > impractical. The existing str

Re: Grapheme clusters, a.k.a.real characters

2017-07-20 Thread Chris Angelico
On Fri, Jul 21, 2017 at 2:10 AM, Random832 wrote: > On Thu, Jul 20, 2017, at 01:15, Steven D'Aprano wrote: >> I haven't really been paying attention to Marko's suggestion in detail, >> but if we're talking about a whole new data type, how about a list of >> nodes, where

Re: Grapheme clusters, a.k.a.real characters

2017-07-20 Thread Chris Angelico
On Fri, Jul 21, 2017 at 2:46 AM, Rhodri James wrote: > On 20/07/17 16:18, Rustom Mody wrote: >> >> So coming to the point: >> Its not whether Einstein or Mencken¹ is right but rather that Mencken >> applies to >> 1 whereas Einstein applies to 3 >> >> And (IMHO) text should

Re: Grapheme clusters, a.k.a.real characters

2017-07-20 Thread Rhodri James
On 20/07/17 16:18, Rustom Mody wrote: So coming to the point: Its not whether Einstein or Mencken¹ is right but rather that Mencken applies to 1 whereas Einstein applies to 3 And (IMHO) text should be squarely classed in 3 not 1 The gmas of this world have made shopping lists, written (and

Re: Grapheme clusters, a.k.a.real characters

2017-07-20 Thread Random832
On Thu, Jul 20, 2017, at 01:15, Steven D'Aprano wrote: > I haven't really been paying attention to Marko's suggestion in detail, > but if we're talking about a whole new data type, how about a list of > nodes, where each node's data is a decomposed string object guaranteed to > be either: How

Re: Grapheme clusters, a.k.a.real characters

2017-07-20 Thread Rustom Mody
On Thursday, July 20, 2017 at 3:21:52 AM UTC+5:30, Rick Johnson wrote: > On Tuesday, July 18, 2017 at 10:07:41 PM UTC-5, Steve D'Aprano wrote: > > On Wed, 19 Jul 2017 12:10 am, Rustom Mody wrote: > > [...] > > > > Einstein: If you can't explain something to a six-year- > > > old, you really

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Steven D'Aprano
On Thu, 20 Jul 2017 12:40:08 +1000, Chris Angelico wrote: > On Thu, Jul 20, 2017 at 12:12 PM, Steve D'Aprano > wrote: >> On Thu, 20 Jul 2017 08:12 am, Gregory Ewing wrote: >> >>> Chris Angelico wrote: >> [snip overly complex and complicated string implementation] >>

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Chris Angelico
On Thu, Jul 20, 2017 at 12:12 PM, Steve D'Aprano wrote: > On Thu, 20 Jul 2017 08:12 am, Gregory Ewing wrote: > >> Chris Angelico wrote: > [snip overly complex and complicated string implementation] > An accurate description, but in my own defense, I had misunderstood

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Steve D'Aprano
On Thu, 20 Jul 2017 08:12 am, Gregory Ewing wrote: > Chris Angelico wrote: [snip overly complex and complicated string implementation] > +1. We should totally do this just to troll the RUE! You're an evil, wicked man, and I love it. -- Steve “Cheer up,” they said, “things could be worse.”

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Steve D'Aprano
On Thu, 20 Jul 2017 01:30 am, Random832 wrote: > On Tue, Jul 18, 2017, at 22:49, Steve D'Aprano wrote: >> > What about Emoji? >> > U+1F469 WOMAN is two columns wide on its own. >> > U+1F4BB PERSONAL COMPUTER is two columns wide on its own. >> > U+200D ZERO WIDTH JOINER is zero columns wide on its

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Steve D'Aprano
On Thu, 20 Jul 2017 04:34 am, Mikhail V wrote: > It is also pretty obvious that these Caps makes it harder to read in general. > (more obvious that excessive diacritics, like in French) No it isn't. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough,

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Ben Finney
Random832 writes: > On Tue, Jul 18, 2017, at 19:21, Gregory Ewing wrote: > > Random832 wrote: > > > What about Emoji? > > > U+1F469 WOMAN is two columns wide on its own. > > > U+1F4BB PERSONAL COMPUTER is two columns wide on its own. > > Emoji comes from Japanese 絵文字 -

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Rick Johnson
On Wednesday, July 19, 2017 at 5:29:23 AM UTC-5, Rhodri James wrote: > when Acorn were developing their version of extended ASCII > in the late 80s, they asked three different University > lecturers in Welsh what extra characters they needed, and > got three different answers. And who would have

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Rick Johnson
On Wednesday, July 19, 2017 at 1:57:47 AM UTC-5, Steven D'Aprano wrote: > On Wed, 19 Jul 2017 17:51:49 +1200, Gregory Ewing wrote: > > > Chris Angelico wrote: > >> Once you NFC or NFD normalize both strings, identical strings will > >> generally have identical codepoints... You should then be

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Rick Johnson
On Tuesday, July 18, 2017 at 10:37:18 PM UTC-5, Steve D'Aprano wrote: > On Wed, 19 Jul 2017 10:34 am, Mikhail V wrote: > > > Ok, in this narrow context I can also agree. > > But in slightly wider context that phrase may sound almost like: > > "neither geometrical shape is better than the other as

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Rick Johnson
On Tuesday, July 18, 2017 at 7:35:13 PM UTC-5, Mikhail V wrote: > ChrisA wrote: > >On Wed, Jul 19, 2017 at 6:05 AM, Mikhail V wrote: > >> On 2017-07-18, Steve D'Aprano wrote: > > > > _Neither system is right or wrong, or better than the > > > > other._ > > > > > >

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Rick Johnson
On Tuesday, July 18, 2017 at 10:24:54 PM UTC-5, Steve D'Aprano wrote: > On Wed, 19 Jul 2017 10:08 am, Ben Finney wrote: > > > Gregory Ewing writes: > > > > > The term "emoji" is becoming rather strained these days. > > > The idea of "woman" and "personal computer"

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Gregory Ewing
Chris Angelico wrote: * Strings with all codepoints < 256 are represented as they currently are (one byte per char). There are no combining characters in the first 256 codepoints anyway. * Strings with all codepoints < 65536 and no combining characters, ditto (two bytes per char). * Strings with

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Gregory Ewing
Grant Edwards wrote: Maybe it was a mistaken spelling of 'fortuned'? Most likely. Interestingly, several sites claimed to be able to tell me things about it. One of them tried to find poetry related to it (didn't find any, though). Another one offered to show me how to pronounce it, and it

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Rick Johnson
On Tuesday, July 18, 2017 at 10:07:41 PM UTC-5, Steve D'Aprano wrote: > On Wed, 19 Jul 2017 12:10 am, Rustom Mody wrote: [...] > > Einstein: If you can't explain something to a six-year- > > old, you really don't understand it yourself. > > > > [...] > > Think about it: it simply is nonsense. If

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Terry Reedy
On 7/19/2017 4:28 AM, Steven D'Aprano wrote: On Tue, 18 Jul 2017 10:11:39 -0400, Random832 wrote: On Fri, Jul 14, 2017, at 04:15, Marko Rauhamaa wrote: Consider, for example, a Python source code editor where you want to limit the length of the line based on the number of characters more

Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Mikhail V
Steven D'Aprano wrote: >On Wed, 19 Jul 2017 10:34 am, Mikhail V wrote: >> Ok, in this narrow context I can also agree. >> But in slightly wider context that phrase may sound almost like: >> "neither geometrical shape is better than the other as a basis >> for a wheel. If you have polygonal

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread MRAB
On 2017-07-19 09:29, Marko Rauhamaa wrote: Gregory Ewing : Marko Rauhamaa wrote: * a final "v" receives a superfluous "e" ("love") It's not superfluous there, it's preventing "love" from looking like it should rhyme with "of". I'm pretty sure that wasn't the

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Thomas Jollans
On 19/07/17 04:19, Rustom Mody wrote: > On Wednesday, July 19, 2017 at 3:00:21 AM UTC+5:30, Marko Rauhamaa wrote: >> Chris Angelico : >> >>> Let me give you one concrete example: the letter "ö". In English, it >>> is (very occasionally) used to indicate diaeresis, where a pair of >>> letters is

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Chris Angelico
On Thu, Jul 20, 2017 at 1:45 AM, Marko Rauhamaa wrote: > So let's assume we will expand str to accommodate the requirements of > grapheme clusters. > > All existing code would still produce only traditional strings. The only > way to introduce the new "super code points" is by

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Marko Rauhamaa
Chris Angelico : > Now, this is a performance question, and it's not unreasonable to talk > about semantics first and let performance wait for later. But when you > consider how many ASCII-only strings Python uses internally (the names > of basically every global function and

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Random832
On Tue, Jul 18, 2017, at 22:49, Steve D'Aprano wrote: > > What about Emoji? > > U+1F469 WOMAN is two columns wide on its own. > > U+1F4BB PERSONAL COMPUTER is two columns wide on its own. > > U+200D ZERO WIDTH JOINER is zero columns wide on its own. > > > What about them? In a monospaced font,

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Random832
On Tue, Jul 18, 2017, at 19:21, Gregory Ewing wrote: > Random832 wrote: > > What about Emoji? > > U+1F469 WOMAN is two columns wide on its own. > > U+1F4BB PERSONAL COMPUTER is two columns wide on its own. > > The term "emoji" is becoming rather strained these days. > The idea of "woman" and

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Chris Angelico
On Wed, Jul 19, 2017 at 11:42 PM, Marko Rauhamaa wrote: > Chris Angelico : > >> Perhaps we don't have the same understanding of "constant time". Or >> are you saying that you actually store and represent this as those >> arbitrary-precision integers? Every

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Marko Rauhamaa
Grant Edwards : > On 2017-07-19, Gregory Ewing wrote: >> Grant Edwards wrote: >>>vacuum, continuum, squush, fortuuned >> >> Fortuuned? Where did you find that? > > It was in the scowl-7.1 wordlist I had laying around: > >

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Grant Edwards
On 2017-07-19, Gregory Ewing wrote: > Grant Edwards wrote: >>vacuum, continuum, squush, fortuuned > > Fortuuned? Where did you find that? It was in the scowl-7.1 wordlist I had laying around: http://wordlist.aspell.net/ However, the scowl website now claims

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Marko Rauhamaa
Chris Angelico : > Perhaps we don't have the same understanding of "constant time". Or > are you saying that you actually store and represent this as those > arbitrary-precision integers? Every character of every string has to > be a multiprecision integer? Yes, although feel

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Chris Angelico
On Wed, Jul 19, 2017 at 10:13 PM, Marko Rauhamaa wrote: > Chris Angelico : > >> On Wed, Jul 19, 2017 at 7:53 PM, Marko Rauhamaa wrote: >>> Here's a proposal: >>> >>>* introduce a building (predefined) class Text >>> >>>* conceptually,

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Marko Rauhamaa
Chris Angelico : > On Wed, Jul 19, 2017 at 7:53 PM, Marko Rauhamaa wrote: >> Here's a proposal: >> >>* introduce a building (predefined) class Text >> >>* conceptually, a Text object is a sequence of "real" characters >> >>* you can access each

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Chris Angelico
On Wed, Jul 19, 2017 at 7:53 PM, Marko Rauhamaa wrote: > Here's a proposal: > >* introduce a building (predefined) class Text > >* conceptually, a Text object is a sequence of "real" characters > >* you can access each "real" character by its position in O(1) > >

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Rhodri James
On 19/07/17 09:17, Steven D'Aprano wrote: On Tue, 18 Jul 2017 16:37:37 +0100, Rhodri James wrote: (For the record, one of my grandmothers would have been baffled by this conversation, and the other one would have had definite opinions on whether accents were distinct characters or not,

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Marko Rauhamaa
Chris Angelico : > To be quite honest, I wouldn't care about that possibility. If I could > design regex semantics purely from an idealistic POV, I would say that > [xyzã], regardless of its encoding, will match any of the four > characters "x", "y", "z", "ã". > > Earlier I

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Steven D'Aprano
On Tue, 18 Jul 2017 10:11:39 -0400, Random832 wrote: > On Fri, Jul 14, 2017, at 04:15, Marko Rauhamaa wrote: >> Consider, for example, a Python source code >> editor where you want to limit the length of the line based on the >> number of characters more typically than based on the number of

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Steven D'Aprano
On Tue, 18 Jul 2017 16:37:37 +0100, Rhodri James wrote: > (For the record, one of my grandmothers would have been baffled by this > conversation, and the other one would have had definite opinions on > whether accents were distinct characters or not, followed by a > digression into whether "ŵ"

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Marko Rauhamaa
Gregory Ewing : > Marko Rauhamaa wrote: >> * a final "v" receives a superfluous "e" ("love") > > It's not superfluous there, it's preventing "love" from looking like > it should rhyme with "of". I'm pretty sure that wasn't the original motivation. If I had to guess,

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Marko Rauhamaa
Gregory Ewing : > Marko Rauhamaa wrote: >>> * the final consonant of a single-syllable word is doubled only if the >>> consonant is "k", "l" or "s" ("kick", "kill", "kiss") >> >> ... or "f" ("stiff") or "z" ("buzz") > > or sometimes "r" ("burr"), or "t" ("butt").

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Gregory Ewing
Marko Rauhamaa wrote: For all we know, someone somewhere might be cooking up a language that depends on "q̈". It makes perfectly good sense to me. It's the second derivative of q with respect to time. -- Greg -- https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Gregory Ewing
Marko Rauhamaa wrote: * "v" is never doubled ("shovel") Except for all the words that Grant listed before. * a final "v" receives a superfluous "e" ("love") It's not superfluous there, it's preventing "love" from looking like it should rhyme with "of". (Of course you just have to know

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Chris Angelico
On Wed, Jul 19, 2017 at 4:49 PM, Steven D'Aprano wrote: > The *really* tricky part is if you receive a string from the user > intended as a regular expression. If they provide > > [xyzã] > > as part of a regex, and you receive ã in denormalized form > > U+0061 LATIN SMALL

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Gregory Ewing
Marko Rauhamaa wrote: * the final consonant of a single-syllable word is doubled only if the consonant is "k", "l" or "s" ("kick", "kill", "kiss") ... or "f" ("stiff") or "z" ("buzz") or sometimes "r" ("burr"), or "t" ("butt"). -- Greg --

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Gregory Ewing
Grant Edwards wrote: vacuum, continuum, squush, fortuuned Fortuuned? Where did you find that? Google gives me a bizarre set of results, none of which appear to be an English dictionary definition. -- Greg -- https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-19 Thread Steven D'Aprano
On Wed, 19 Jul 2017 17:51:49 +1200, Gregory Ewing wrote: > Chris Angelico wrote: >> Once you NFC or NFD normalize both strings, identical strings will >> generally have identical codepoints... You should then be able to use >> normal regular expressions to match correctly. > > Except that if you

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Gregory Ewing
Chris Angelico wrote: Once you NFC or NFD normalize both strings, identical strings will generally have identical codepoints... You should then be able to use normal regular expressions to match correctly. Except that if you want to match a set of characters, you can't reliably use [...], you

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Steve D'Aprano
On Mon, 17 Jul 2017 04:12 am, Ben Finney wrote: > Steven D'Aprano writes: > >> On Sun, 16 Jul 2017 12:33:10 +1000, Ben Finney wrote: >> >> > And yet the ASCII and Unicode standard says code point 0x0A (U+000A >> > LINE FEED) is a character, by definition. >> [...] >> > > Is

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Steve D'Aprano
On Wed, 19 Jul 2017 10:34 am, Mikhail V wrote: > Ok, in this narrow context I can also agree. > But in slightly wider context that phrase may sound almost like: > "neither geometrical shape is better than the other as a basis > for a wheel. If you have polygonal wheels, they are still called

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Steve D'Aprano
On Wed, 19 Jul 2017 10:08 am, Ben Finney wrote: > Gregory Ewing writes: > >> The term "emoji" is becoming rather strained these days. >> The idea of "woman" and "personal computer" being emotions >> is an interesting one... > > I think of “emoji” as “not actually a

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Steve D'Aprano
On Wed, 19 Jul 2017 12:10 am, Rustom Mody wrote: > On Monday, July 17, 2017 at 10:14:00 PM UTC+5:30, Rhodri James wrote: >> On 17/07/17 05:10, Rustom Mody wrote: >> > Hint1: Ask your grandmother whether unicode's notion of character makes >> > sense. Ask 10 gmas from 10 language-L's >> > Hint2:

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Steve D'Aprano
On Wed, 19 Jul 2017 12:29 am, Random832 wrote: > On Sun, Jul 16, 2017, at 01:37, Steven D'Aprano wrote: >> In a *well-designed* *bug-free* monospaced font, all code points should >> be either zero-width or one column wide. Or two columns, if the font >> supports East Asian fullwidth characters. >

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Steve D'Aprano
On Tue, 18 Jul 2017 11:59 pm, Chris Angelico wrote: >> (I don't think any native English words use a double-V or double-U, but the >> possibility exists.) > > vacuum. Nice. Also continuum and residuum. For double V, we have savvy, skivvy, flivver (an old slang term for cars). -- Steve

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Rustom Mody
On Wednesday, July 19, 2017 at 3:00:21 AM UTC+5:30, Marko Rauhamaa wrote: > Chris Angelico : > > > Let me give you one concrete example: the letter "ö". In English, it > > is (very occasionally) used to indicate diaeresis, where a pair of > > letters is not a double letter - for example,

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Steve D'Aprano
On Wed, 19 Jul 2017 12:09 am, Random832 wrote: > On Fri, Jul 14, 2017, at 08:33, Chris Angelico wrote: >> What do you mean about regular expressions? You can use REs with >> normalized strings. And if you have any valid definition of "real >> character", you can use it equally on an

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Chris Angelico
On Wed, Jul 19, 2017 at 10:34 AM, Mikhail V wrote: > Ok, in this narrow context I can also agree. > But in slightly wider context that phrase may sound almost like: > "neither geometrical shape is better than the other as a basis > for a wheel. If you have polygonal wheels,

Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Mikhail V
ChrisA wrote: >On Wed, Jul 19, 2017 at 6:05 AM, Mikhail V wrote: >> On 2017-07-18, Steve D'Aprano wrote: >> >>> That's neither better nor worse than the system used by English and French, >>> where letters with dicritics are not distinct letters, but guides to >>>

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Ben Finney
Gregory Ewing writes: > The term "emoji" is becoming rather strained these days. > The idea of "woman" and "personal computer" being emotions > is an interesting one... I think of “emoji” as “not actually a character in any system anyone would use for writing

Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Mikhail V
Marko Rauhamaa wrote: >What did you think of my concrete examples, then? (Say, finding >"Alvárez" with the regular expression "Alv[aá]rez".) I think that should match both "Alvarez" and "Alvárez" ...? But firstly, I feel like I need to _guess_ what ideas you are presenting. Unless I open up Vim

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Gregory Ewing
Random832 wrote: What about Emoji? U+1F469 WOMAN is two columns wide on its own. U+1F4BB PERSONAL COMPUTER is two columns wide on its own. The term "emoji" is becoming rather strained these days. The idea of "woman" and "personal computer" being emotions is an interesting one... -- Greg --

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Gregory Ewing
Steve D'Aprano wrote: (I don't think any native English words use a double-V or double-U, but the possibility exists.) vacuum savvy (Vacuum is arguably Latin, but we've been using it for long enough that it's at least as English as most of the other words we use.) -- Greg --

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Anders Wegge Keller
På Tue, 18 Jul 2017 11:27:03 -0400 Dennis Lee Bieber skrev: > Probably would have to go to words predating the Roman occupation > (which probably means a dialect closer to Welsh or other Gaelic). > Everything later is an import (anglo-saxon being germanic tribes

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Marko Rauhamaa
Chris Angelico : > Let me give you one concrete example: the letter "ö". In English, it > is (very occasionally) used to indicate diaeresis, where a pair of > letters is not a double letter - for example, "coöperate". (You can > also hyphenate, "co-operate".) In German, it is

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Chris Angelico
On Wed, Jul 19, 2017 at 6:05 AM, Mikhail V wrote: > On 2017-07-18, Steve D'Aprano wrote: > >> That's neither better nor worse than the system used by English and French, >> where letters with dicritics are not distinct letters, but guides to

Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Mikhail V
On 2017-07-18, Steve D'Aprano wrote: > That's neither better nor worse than the system used by English and French, > where letters with dicritics are not distinct letters, but guides to > pronunciation. >_Neither system is right or wrong, or better than the

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Chris Angelico
On Wed, Jul 19, 2017 at 4:56 AM, Marko Rauhamaa wrote: > Chris Angelico : >> What I *think* you're asking for is for square brackets in a regex to >> count combining characters with their preceding base character. > > Yes. My example tries to match a single

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Marko Rauhamaa
Chris Angelico : > On Wed, Jul 19, 2017 at 4:31 AM, Marko Rauhamaa wrote: >> Chris Angelico : >> >>> On Wed, Jul 19, 2017 at 3:01 AM, Marko Rauhamaa wrote: Yes. Also, not every letter can be normalized to a single

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Chris Angelico
On Wed, Jul 19, 2017 at 4:31 AM, Marko Rauhamaa wrote: > Chris Angelico : > >> On Wed, Jul 19, 2017 at 3:01 AM, Marko Rauhamaa wrote: >>> Yes. Also, not every letter can be normalized to a single codepoint so >>> NFC is not a way out. For

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Marko Rauhamaa
Chris Angelico : > On Wed, Jul 19, 2017 at 3:01 AM, Marko Rauhamaa wrote: >> Yes. Also, not every letter can be normalized to a single codepoint so >> NFC is not a way out. For example, >> >> re.match("^[q̈]$", "q̈") >> >> returns None regardless of

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Chris Angelico
On Wed, Jul 19, 2017 at 3:01 AM, Marko Rauhamaa wrote: > Chris Angelico : > >> what you're more likely to want is "match the letter á", and you don't >> care whether it's represented as U+0061 U+0301 or as U+00E1. That's >> where Unicode normalization comes in.

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Grant Edwards
On 2017-07-18, Anders Wegge Keller wrote: > På Tue, 18 Jul 2017 23:59:33 +1000 > Chris Angelico skrev: >> On Tue, Jul 18, 2017 at 11:11 PM, Steve D'Aprano > > >>> (I don't think any native English words use a double-V or double-U, but >>> the possibility

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Rhodri James
On 18/07/17 17:03, Marko Rauhamaa wrote: Random832: As for double-v, a quick search through /usr/share/dict/words reveals "civvies", "divvy", "revved/revving", "savvy" and "skivvy", and various conjugations thereof. All following, more or less, the rule of using a

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Marko Rauhamaa
Marko Rauhamaa : > * the final consonant of a single-syllable word is doubled only if the >consonant is "k", "l" or "s" ("kick", "kill", "kiss") ... or "f" ("stiff") or "z" ("buzz") Marko -- https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Marko Rauhamaa
Chris Angelico : > what you're more likely to want is "match the letter á", and you don't > care whether it's represented as U+0061 U+0301 or as U+00E1. That's > where Unicode normalization comes in. Yes. Also, not every letter can be normalized to a single codepoint so NFC is

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Chris Angelico
On Wed, Jul 19, 2017 at 1:40 AM, Rhodri James wrote: > On 18/07/17 16:27, Dennis Lee Bieber wrote: >> >> On Tue, 18 Jul 2017 10:38:48 -0400, Random832 >> declaimed the following: >> >>> Define "native" then. My interpretation of "native English

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Chris Angelico
On Wed, Jul 19, 2017 at 12:09 AM, Random832 wrote: > On Fri, Jul 14, 2017, at 08:33, Chris Angelico wrote: >> What do you mean about regular expressions? You can use REs with >> normalized strings. And if you have any valid definition of "real >> character", you can use it

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Marko Rauhamaa
Random832 : > As for double-v, a quick search through /usr/share/dict/words reveals > "civvies", "divvy", "revved/revving", "savvy" and "skivvy", and > various conjugations thereof. All following, more or less, the rule of > using a double consonant after a short vowel in

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Rhodri James
On 18/07/17 16:27, Dennis Lee Bieber wrote: On Tue, 18 Jul 2017 10:38:48 -0400, Random832 declaimed the following: Define "native" then. My interpretation of "native English words" is "anything you wouldn't have to put in italics to use in a sentence". Which would also

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Rhodri James
On 18/07/17 15:10, Rustom Mody wrote: On Monday, July 17, 2017 at 10:14:00 PM UTC+5:30, Rhodri James wrote: On 17/07/17 05:10, Rustom Mody wrote: Hint1: Ask your grandmother whether unicode's notion of character makes sense. Ask 10 gmas from 10 language-L's Hint2: When in doubt gma usually is

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Grant Edwards
On 2017-07-18, Steve D'Aprano wrote: > (I don't think any native English words use a double-V or double-U, but the > possibility exists.) double-v: flivver, navvy, bivvy, bevvy, trivvet, divvy, skivvy, skivvies, etc. and various gerund and past tense verbs:

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Random832
On Tue, Jul 18, 2017, at 10:23, Anders Wegge Keller wrote: > På Tue, 18 Jul 2017 23:59:33 +1000 > Chris Angelico skrev: > > On Tue, Jul 18, 2017 at 11:11 PM, Steve D'Aprano > >> (I don't think any native English words use a double-V or double-U, but > >> the possibility exists.)

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Random832
On Sun, Jul 16, 2017, at 01:37, Steven D'Aprano wrote: > In a *well-designed* *bug-free* monospaced font, all code points should > be either zero-width or one column wide. Or two columns, if the font > supports East Asian fullwidth characters. What about Emoji? U+1F469 WOMAN is two columns wide

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Anders Wegge Keller
På Tue, 18 Jul 2017 23:59:33 +1000 Chris Angelico skrev: > On Tue, Jul 18, 2017 at 11:11 PM, Steve D'Aprano >> (I don't think any native English words use a double-V or double-U, but >> the possibility exists.) > vacuum. That's latin. -- //Wegge --

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Rustom Mody
On Monday, July 17, 2017 at 10:14:00 PM UTC+5:30, Rhodri James wrote: > On 17/07/17 05:10, Rustom Mody wrote: > > Hint1: Ask your grandmother whether unicode's notion of character makes > > sense. > > Ask 10 gmas from 10 language-L's > > Hint2: When in doubt gma usually is right > > "For every

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Random832
On Fri, Jul 14, 2017, at 04:15, Marko Rauhamaa wrote: > Consider, for example, a Python source code > editor where you want to limit the length of the line based on the > number of characters more typically than based on the number of pixels. Even there you need to go based on the width in

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Random832
On Fri, Jul 14, 2017, at 08:33, Chris Angelico wrote: > What do you mean about regular expressions? You can use REs with > normalized strings. And if you have any valid definition of "real > character", you can use it equally on an NFC-normalized or > NFD-normalized string than any other. They're

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Chris Angelico
On Tue, Jul 18, 2017 at 11:11 PM, Steve D'Aprano wrote: > On Tue, 18 Jul 2017 08:01 am, Mikhail V wrote: > >> And just in case still its not clear: this is not >> solved by adding dirt around the letter: if there is >> enough significance of the phoneme distinction

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Steve D'Aprano
On Tue, 18 Jul 2017 08:01 am, Mikhail V wrote: > And just in case still its not clear: this is not > solved by adding dirt around the letter: if there is > enough significance of the phoneme distinction then > one should add a distinct letter for a syntax in question. It isn't "dirt", any more

Re: Grapheme clusters, a.k.a.real characters

2017-07-18 Thread Marko Rauhamaa
Mikhail V : > And just in case still its not clear: this is not solved by adding > dirt around the letter: if there is enough significance of the phoneme > distinction then one should add a distinct letter for a syntax in > question. The letters of Finnish are:

Re: Grapheme clusters, a.k.a.real characters

2017-07-17 Thread Gregory Ewing
Steve D'Aprano wrote: I don't think that it is even a given that "atomic units of language" exist. To quote a Hindi speaker earlier in this thread, की is a letter, and yet it can be decomposed into की = क + ई, so it isn't "atomic". If letters aren't atomic, then what are? They're like

Grapheme clusters, a.k.a.real characters

2017-07-17 Thread Mikhail V
ChrisA wrote: >Yep! Nobody would take any notice of the fact that you just put dots >on all those letters. It's not like it's going to make any difference >to anything. We're not dealing with matters of life and death here. >Oh wait.

Re: Grapheme clusters, a.k.a.real characters

2017-07-17 Thread Rhodri James
On 17/07/17 05:10, Rustom Mody wrote: Hint1: Ask your grandmother whether unicode's notion of character makes sense. Ask 10 gmas from 10 language-L's Hint2: When in doubt gma usually is right "For every complex problem there is an answer that is clear, simple and wrong." (H.L. Mencken).

Re: Grapheme clusters, a.k.a.real characters

2017-07-17 Thread Chris Angelico
On Tue, Jul 18, 2017 at 1:36 AM, Steve D'Aprano wrote: > On Mon, 17 Jul 2017 02:10 pm, Rustom Mody wrote: >> Hint1: Ask your grandmother whether unicode's notion of character makes >> sense. > > What on earth makes you think that my grandmother is a valid judge of

Re: Grapheme clusters, a.k.a.real characters

2017-07-17 Thread Steve D'Aprano
On Mon, 17 Jul 2017 02:10 pm, Rustom Mody wrote: >> Please don't feed the trolls. > > Its usually called 'joke' Steven! Did the word fall out of your dictionary > in the last upgrade? > Rick was no more trolling than Marko Funny you say that. I often think Marko is trolling, but if he is, he

  1   2   >