date:20170715

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Chris Angelico

On Sun, Jul 16, 2017 at 3:37 PM, Steven D'Aprano  wrote:
> On Sun, 16 Jul 2017 11:32:16 +1000, Chris Angelico wrote:
>
>> Exactly. That's my point. Even in a monospaced font, U+200B is a
>> character, yet it is by rule a zero-width character. So even in a
>> monospaced font, some characters must vary in width.
>
> In a *well-designed* *bug-free* monospaced font, all code points should
> be either zero-width or one column wide. Or two columns, if the font
> supports East Asian fullwidth characters.
>
> In practice, no single font is going to cover the entire range of
> Unicode. So one might hope for a *well-designed* *bug-free* FAMILY of
> monospaced fonts which, between them, cover the entire range, and agree
> on the width of a column.

Hmm, I'm not sure about that. A font can be monospaced for the most
part, yet respect multiple different "width groups" (eg East Asian
characters all get one width, while Latin-family characters all get a
different width). However, even in the idealized form you describe,
you still have to cope with zero-width characters (do they get zero or
do they get one column?), and characters that join together (Arabic
and Korean (Hangul)).

I think the Liberation Sans Mono font (family??) does a pretty good
job of making most text columnate well (for instance, the narrow
spaces (thin, half, third, etc) all expand to a full space), while not
getting too het up about everything being exactly the same number of
pixels. If monospacing is, as you say, a compromise, at least Lib Sans
Mono has picked a good compromise.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Chris Angelico

On Sun, Jul 16, 2017 at 2:25 PM, Rick Johnson
 wrote:
> But the two "realms" and two "character types" are but only a
> small sample of the syntactical complexity of Python
> strings. For we haven't even discussed the many types of
> string literals that Python defines. Some include:
>
> (1) "Normal Strings"
> (2) r"Raw Strings
> (3) b"Byte Strings"
> (4) u"Unicode Strings"
> (5) ru"Raw Unicode"
> (6) ur'Unicode "that is _raw_"'
> (7) f"Format literals"
> ...
>
> Whew!

There are only two types of *string objects* in Python: Unicode
strings and byte strings. All the above are just ways of encoding
those in your source code. That's all. (And f-strings aren't really
strings, but expressions.)

There is only one type of *integer object* in Python, yet there are
many forms of literal:

* decimal - 1234
* octal - 0o2322
* hexadecimal - 0x4d2
* binary - 0b10011010010
* the above, with separation - 1_234, 0b100_1101_0010, etc

None of this has anything to do with the current discussion.
*ANYTHING*. Please do not introduce red herrings.

> Chris was arguing that zero width spaces should not be
> counted as characters when the `len()` function is applied
> to the string, for which i disagree on the basis of
> consistency. My first reaction is: "Why would you inject a
> char into a string -- even a zero-width char! -- and then
> expect that the char should not affect the length of the
> string as returned by `len`?"

Did you read my emails? I was never arguing that.

> Being that strings (on the highest level) are merely linear
> arrays of chars, such an assumption defies all logic.
> Furthermore, the length of a string (in chars) and the
> "perceived" length of a string (when rendered on a screen,
> or printed on paper), are in no way relevant to one another.

"chars" meaning what? We still don't have any definition of
"character" here. In Python, strings are arrays of code points.

> [1] Of course, even in the realms of ASCII, there are chars
> that cannot be inserted by the programmer _simply_ by
> pressing a single key on the keyboard. But most of these
> chars were useless anyways. So we will ignore this small
> detail for now. One point to mention is that Unicode
> greatly increased the number of useless chars.

Define "useless".

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Steven D'Aprano

On Sun, 16 Jul 2017 12:33:10 +1000, Ben Finney wrote:

> And yet the ASCII and Unicode standard says code point 0x0A (U+000A LINE
> FEED) is a character, by definition.
[...]
> > Is an acute accent a character?
> 
> Yes, according to Unicode. ‘´’ (U+0301 ACUTE ACCENT) is a character.

Do you have references for those claims?

Because I'm pretty sure that Unicode is very, very careful to never use 
the word "character" in a formal or normative manner, only as an informal 
term for "the kinds of things that regular folk consider letters or 
characters or similar".

And I don't think regular folks would know what a line feed was if it 
jumped out of their computer and bit them :-) They would know what an 
accent is, and I doubt they would consider an accent not on a base letter 
to be a character. (I know I don't.)

-- 
Steve
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Chris Angelico

On Sun, Jul 16, 2017 at 2:33 PM, Rustom Mody  wrote:
> On Sunday, July 16, 2017 at 4:09:16 AM UTC+5:30, Mikhail V wrote:
>> On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
>> > Random access to code points is as uninteresting as random access to
>> > UTF-8 bytes.
>> > I might want random access to the "Grapheme clusters, a.k.a.real
>> > characters".
>>
>> What _real_ characters are you referring to?
>> If your data has "á" (U00E1), then it is one real character,
>> if you have "a" (U0061) and "ˊ" (U02CA) then it is _two_
>> real characters. So in both cases you have access to code points =
>> real characters.
>
> Right now in an adjacent mailing list (debian) I see someone signed off with a
>
> grüß
>
> I guess the third character is a u with some ‘dirt’
> Whats the fourth?

It's a "sharp S".

Tell me, is "å" an a with some 'dirt', or is it a separate character?
Is "i" an ı with some dirt, or a separate letter? Oh wait, you
probably think that "i" is a letter, and "ı" is the same letter but
with some dirt missing. What about "p"? Is that just "d" written the
wrong way up? At what point does something merit being called a
different letter?

ALL of these are unique characters. If you look up the alphabetization
rules for Norwegian, Turkish, and English, you'll see that "å" is not
"a", that "ı" is not "i", and that "p" is not "d".

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Steven D'Aprano

On Sun, 16 Jul 2017 11:32:16 +1000, Chris Angelico wrote:

> On Sun, Jul 16, 2017 at 11:20 AM, Rick Johnson
>  wrote:
>> On Saturday, July 15, 2017 at 7:29:14 PM UTC-5, Chris Angelico wrote:
>>> [...] Also, that doesn't deal with U+200B or U+180E, which have
>>> well-defined widths *smaller* than typical Latin letters. (200B is a
>>> zero-width space. Is it a character?)
>>
>> Of *COURSE* it's a character.
>>
>> Would you also consider 0 not to be a number?
>>
>> Sheesh!
> 
> Exactly. That's my point. Even in a monospaced font, U+200B is a
> character, yet it is by rule a zero-width character. So even in a
> monospaced font, some characters must vary in width.

In a *well-designed* *bug-free* monospaced font, all code points should 
be either zero-width or one column wide. Or two columns, if the font 
supports East Asian fullwidth characters.

In practice, no single font is going to cover the entire range of 
Unicode. So one might hope for a *well-designed* *bug-free* FAMILY of 
monospaced fonts which, between them, cover the entire range, and agree 
on the width of a column.

But even in this best of all possible situations, you can't make everyone 
happy, because there exist *thin spaces* which should render as a 
fraction of the width of a regular space. But a monospaced font can't do 
that: it either makes the thin space zero-width, or a full column.

Monospace is by its very nature a compromise on the "natural" width of 
the characters. A sometimes *useful* compromise, but it cannot solve all 
problems.

-- 
Steve
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: "Edit with IDLE" doesn't work any more ?

2017-07-15 Thread Rick Johnson

On Friday, April 28, 2017 at 8:23:43 AM UTC-5, Peter Otten wrote:
> Stefan Ram wrote:
> 
> > Peter Otten <__pete...@web.de> writes:
> >>one of the modules in Python's standard library IDLE will try to run with
> >>your module rather than the one it actually needs. Common candidates are
> >>code.py or string.py, but there are many more.
> > 
> >   I know this from Java:
> > 
> >   When you write a program
> > 
> > ... main( final String[] args ) ...
> > 
> >   and then create a file »String.class« in the program's
> >   directory, the program usually will not work anymore.
> > 
> >   However, in Java one can use an absolute path as in,
> > 
> > ... main( final java.lang.String[] args ) ...
> > 
> >   , in which case the program will still work in the
> >   presence of such a »String.class« file.
> > 
> >   I wonder whether Python also might have such a kind
> >   of robust "absolute addressing" of a module.
> 
> While I would welcome such a "reverse netloc" scheme or at least a "std" 
> toplevel package that guarantees imports from the standard library I fear 
> the pain is not yet big enough ;)

The pain will only get more intense with time. This is an issue that Python3 
should have solved when it broke so much backwards compatibility. Better to 
break it all at once; than again, and again, and again.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: is @ operator popular now?

2017-07-15 Thread oyster

 sorry, I mean "PEP 465 - A dedicated infix operator for matrix
multiplication" on
https://docs.python.org/3/whatsnew/3.5.html#whatsnew-pep-465

2017-07-15 20:05 GMT+08:00 Matt Wheeler :
> On Sat, 15 Jul 2017, 12:35 oyster,  wrote:
>>
>> as the title says. has @ been used in projects?
>
>
> Strictly speaking, @ is not an operator.
> It delimits a decorator statement (in python statements and operations are
> not the same thing).
> However, to answer the question you actually asked, yes, all the time.
>
> For specific examples, see:
> pytest's fixtures
> contextlib.contextmanager (makes creating context managers mich simpler in
> most cases)
> @property @classmethod etc. etc. (I sometimes see these used a bit too
> freely, when a plain attribute or a function at the module level would be
> more appropriate)
>
> --
>
> --
> Matt Wheeler
> http://funkyh.at
-- 
https://mail.python.org/mailman/listinfo/python-list

Decorating coroutines

2017-07-15 Thread Michele Simionato

I have just released version 4.1.1 of the decorator module. The new feature is 
that it is possible to decorate coroutines. Here is an example of how
to define a decorator `log_start_stop` that can be used to trace coroutines:

$ cat x.py
import time
import logging
from asyncio import get_event_loop, sleep, wait
from decorator import decorator


@decorator
async def log_start_stop(coro, *args, **kwargs):
logging.info('Starting %s%s', coro.__name__, args)
t0 = time.time()
await coro(*args, **kwargs)
dt = time.time() - t0
logging.info('Ending %s%s after %d seconds', coro.__name__, args, dt)


@log_start_stop
async def task(n):  # a do nothing task
for i in range(n):
await sleep(1)

if __name__ == '__main__':
logging.basicConfig(level=logging.INFO)
tasks = [task(3), task(2), task(1)]
get_event_loop().run_until_complete(wait(tasks))

This will print something like this:

~$ python3 x.py
INFO:root:Starting task(1,)
INFO:root:Starting task(3,)
INFO:root:Starting task(2,)
INFO:root:Ending task(1,) after 1 seconds
INFO:root:Ending task(2,) after 2 seconds
INFO:root:Ending task(3,) after 3 seconds

The trouble is that at work I am forced to maintain compatibility with Python 
2.7, so I do not have significant code using coroutines. If there are people 
out there which use a lot of coroutines and would like to decorate them, I 
invite you to try out the decorator module and give me some feedback if you 
find errors or strange behaviors. I am not aware of any issues, but one is 
never sure with new features.

Thanks for your help,

 Michele Simionato
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Rick Johnson

On Saturday, July 15, 2017 at 9:33:49 PM UTC-5, Ben Finney wrote:
> MRAB  writes:

[...]

> > Is linefeed a character? You might call it a "control
> > character", but it's not really a _character_, it's
> > control/format _code_.
> 
> And yet the ASCII and Unicode standard says code point 0x0A
> (U+000A LINE FEED) is a character, by definition.  Rather
> than saying “no, it's not a character”, I think a more
> accurate statement would be: a linefeed *is* a character in
> ASCII, but that doesn't mean every other standard must
> agree.  Indeed it may be better to say: a line feed is a
> character and is also a control code.
> 
> > Is an acute accent a character?
> 
> Yes, according to Unicode. ‘´’ (U+0301 ACUTE ACCENT) is a
> character.
> 
> > No, it's a diacritic mark that's added to a character.
> 
> Lose the “no”, and I agree.

So you would be happy with a string containing a single
character that was _decorated_ with a single accent mark
(say, for instance U+00E3 (Latin Small Letter A with
tilde), to return a length value of 2? Really?

> It's entirely reasonable for a concept to fit in multiple
> categories simultaneously.

Reasonable? Perhaps...

Practical? No way!

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Rustom Mody

On Sunday, July 16, 2017 at 4:09:16 AM UTC+5:30, Mikhail V wrote:
> On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
> > Random access to code points is as uninteresting as random access to
> > UTF-8 bytes.
> > I might want random access to the "Grapheme clusters, a.k.a.real
> > characters".
> 
> What _real_ characters are you referring to?
> If your data has "á" (U00E1), then it is one real character,
> if you have "a" (U0061) and "ˊ" (U02CA) then it is _two_
> real characters. So in both cases you have access to code points =
> real characters.

Right now in an adjacent mailing list (debian) I see someone signed off with a

grüß

I guess the third character is a u with some ‘dirt’
Whats the fourth?

> 
> For metaphysical discussion -  in _my_ definition there

s/metaphysical/linguistic

> is no such "real" character as "á", since it is the "a" glyph with some dirt,
> so according to my definition, it should be two separate characters,
> both semantically and technically seen.
> 
> And, in my definition, the whole Unicode is a huge junkyard, to start with.
> 
> But opinions may vary, and in case you prefer or forced to write "á",
> then it can be impractical to store it as two characters, regardless of
> encoding.

Heck even in the English that I learnt in school we had
ægis, homœopath etc
And just now looking up:
https://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature
I see economics is œconomics!!

Seriously the word "ligature" like the word "grapheme" is misleading
Its not a graphical or typographic notion its an atom of the language's lexicon

No Hindi speaker seeing
क + ई = की
calls the last anything but a letter
And the vowel sign ी is never first class a vowel
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Rick Johnson

On Saturday, July 15, 2017 at 8:54:40 PM UTC-5, MRAB wrote:
> You need to be careful about the terminology.

You are correct. I admit I was a little loose with my
terms there.

> Is linefeed a character? 

Since LineFeed is the same as NewLine, then yes, IMO,
linefeed is a character.

> You might call [linefeed] a "control character", but it's
> not really a _character_, it's control/format _code_.

True. 

Allow me try and define some concrete terms that we can
use.

In the old days, long before i was born, and even long
before i downloaded my first compiler (ah the memories!),
the concept of strings was so much simpler. Yep, back in
those days all you had was, basically, two discreate sub
components of a string: the "actual chars" and the "virtual
chars".

(Disambiguation)

The "actual chars"[1] are any chars that a programmer could
insert by pressing a single key on the keyboard, such as:
"1", "2", "3", "a", "b", "c" , "!", "@", "#" -- etc..

The "virtual chars" -- or the "control codes" as you put it
(the ones that start with a "\") -- are the chars
that represent "structural elements" of the string (f.i. \n,
\t, etc..). But in reality, the implementation of strings
has complicated the idea of "virtual chars as solely structural
elements" of the display, by including such absurdities as:

(1) Sounds ("\a")
(2) Virtual interactions such as: BackSpace("\b"),
CarrigeReturn ("\r") and FormFeed ("\f")

intermixed with control codes that constitute _actual_
structural elements such as:

(1) LineFeed or NewLine ("\n")
(2) HorizontalTab ("\t")
(3) VericalTab ("\v")

And a few other non-structural codes that allow embedding
delimiters or hex or octals. 

And furthermore, two distinct "realms", if i may, in which
a string can exist: the "virtual character realm" and the
"display realm".

(Disambiguation)

The "virtual character realm" is sort of like an operating
room where a doctor (aka: programmer) performs operations on
the patient (aka: string), or if you like, a castle where a
mad scientist builds a Unicode monster from a hodgepodge
of body parts he stole from local grave yards and is later
lynched by a mob of angry peasants for his perceived sins
against nature. But i digress...

Whereas the "display realm" is sort of like an awards
ceremony for celebrities, except here, strings take the
place of strung-out celebs and characters are dressed in the
over-hyped rags (aka: font) of an overpaid fashion designer .

But the two "realms" and two "character types" are but only a
small sample of the syntactical complexity of Python
strings. For we haven't even discussed the many types of
string literals that Python defines. Some include:

(1) "Normal Strings"
(2) r"Raw Strings
(3) b"Byte Strings"
(4) u"Unicode Strings"
(5) ru"Raw Unicode"
(6) ur'Unicode "that is _raw_"'
(7) f"Format literals"
...

Whew!

IMO, I think the reason why the implementation of strings has
been such a tough nut to crack (Python3000 notwithstanding),
is due very much to what i call a "syntactical circus". 

> Is an acute accent a character? No, it's a diacritic mark
> that's added to a character.

And i agree. 

Chris was arguing that zero width spaces should not be
counted as characters when the `len()` function is applied
to the string, for which i disagree on the basis of
consistency. My first reaction is: "Why would you inject a
char into a string -- even a zero-width char! -- and then
expect that the char should not affect the length of the
string as returned by `len`?"

Being that strings (on the highest level) are merely linear
arrays of chars, such an assumption defies all logic.
Furthermore, the length of a string (in chars) and the
"perceived" length of a string (when rendered on a screen,
or printed on paper), are in no way relevant to one another.

When we, as programmers, are manipulateing strings (slicing,
munging, concatenating, etc..) our only concern should be
that _every_ char is accessable, indexable, quantifiable and
will maintain its order. And whether or not a char will be
visible, when rendered on a screen or paper, is irrelevant to
these "programmer centric" operations. Rendering is the
domain of graphic designers, not software developers.

> When you're working with Unicode strings, you're not
> working with strings of characters as such, but with
> strings of 'codepoints', some of which are characters,
> others combining marks, yet others format codes, and so on.

Which is unfortunate for the programmer. Who would like to
get things done without a viscous implementation mucking up
the gears.

[1] Of course, even in the realms of ASCII, there are chars
that cannot be inserted by the programmer _simply_ by
pressing a single key on the keyboard. But most of these
ch

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Ben Finney

MRAB  writes:

> You need to be careful about the terminology.

Definitely agreed.
>
> Is linefeed a character? You might call it a "control character", but
> it's not really a _character_, it's control/format _code_.

And yet the ASCII and Unicode standard says code point 0x0A (U+000A LINE
FEED) is a character, by definition.

Rather than saying “no, it's not a character”, I think a more accurate
statement would be: a linefeed *is* a character in ASCII, but that
doesn't mean every other standard must agree.

Indeed it may be better to say: a line feed is a character and is also a
control code.

> Is an acute accent a character?

Yes, according to Unicode. ‘´’ (U+0301 ACUTE ACCENT) is a character.

> No, it's a diacritic mark that's added to a character.

Lose the “no”, and I agree.

The acute accent is a character and *also* is a diacritic mark that is
added to a character. Unicode categorises U+0301 is a character in the
categories “symbol” and “modifier”.

Note that those are not exclusive. It's entirely reasonable for a
concept to fit in multiple categories simultaneously.

What is being revealed in this discussion is the folly of insisting on
exclusive categories for everything, and that terms must have exactly
one meaning.

You are correct that we need to be clear which definition is being used.
But we cannot thereby say that other, different, definitions are
*necessarily* wrong. That is an extra claim that would need to be
demonstrated, and the mere fact of the difference is not sufficient.

-- 
 \  “It's dangerous to be right when the government is wrong.” |
  `\   —Francois Marie Arouet Voltaire |
_o__)  |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Rick Johnson

On Saturday, July 15, 2017 at 7:55:46 PM UTC-5, Steve D'Aprano wrote:
> On Sun, 16 Jul 2017 12:31 am, Rick Johnson wrote:
> 
> > I never hear Chinese or eastern Europeans
> > bellyaching
> 
> Do you speak much to Chinese and Eastern Europeans who
> don't speak or write English? How would you know what they
> say?
> 
> "All toupées are bad. I've never seen a good one that looked real."
> 
> http://rationalwiki.org/wiki/Toupee_fallacy

A good retort!

But not airtight, i'm afraid.

Here, allow me to explain...

The implication of the Toupee Fallacy is that one cannot
ever discover a "good toupee", since "good toupees" would be
indistinguishable from _real_ hair. Which is true, however,
the Toupee Fallacy also applies inversely...

What i mean is that your implicit implication that i am
unable to discover "good toupees", and therefore unable to
quantify them,  also applies to your inability to prove that
"Good Toupees" even exist. Sure, we can _assume_ that "Good
Toupees" exist, but such a conjecture would never be
_scientific_. Therefore, the Toupee Fallacy is invalid as a
weapon of debate because it relies on the unproved premise
that "Good Toupees" even exist.

Isn't that ironic? 

Dontcha think?

[1] Save that the experimenter yanked on the hair of every
person she encountered, which of course is not polite, so we
will safely assume that such techniques, while arguably 100%
scientific, were not used during the "Toupee Fallacy Study".
Which incidentally is why the media never dubbed it the
"Toupee Terror Attacks". But i digress...
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread MRAB


On 2017-07-16 02:20, Rick Johnson wrote:

On Saturday, July 15, 2017 at 7:29:14 PM UTC-5, Chris Angelico wrote:

[...] Also, that doesn't deal with
U+200B or U+180E, which have well-defined widths *smaller* than
typical Latin letters. (200B is a zero-width space. Is it a
character?)


Of *COURSE* it's a character.

Would you also consider 0 not to be a number?

Sheesh!


[snip]

You need to be careful about the terminology.

Is linefeed a character? You might call it a "control character", but 
it's not really a _character_, it's control/format _code_.


Is an acute accent a character? No, it's a diacritic mark that's added 
to a character.


When you're working with Unicode strings, you're not working with 
strings of characters as such, but with strings of 'codepoints', some of 
which are characters, others combining marks, yet others format codes, 
and so on.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Chris Angelico

On Sun, Jul 16, 2017 at 11:20 AM, Rick Johnson
 wrote:
> On Saturday, July 15, 2017 at 7:29:14 PM UTC-5, Chris Angelico wrote:
>> [...] Also, that doesn't deal with
>> U+200B or U+180E, which have well-defined widths *smaller* than
>> typical Latin letters. (200B is a zero-width space. Is it a
>> character?)
>
> Of *COURSE* it's a character.
>
> Would you also consider 0 not to be a number?
>
> Sheesh!

Exactly. That's my point. Even in a monospaced font, U+200B is a
character, yet it is by rule a zero-width character. So even in a
monospaced font, some characters must vary in width.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Rick Johnson

On Saturday, July 15, 2017 at 7:29:14 PM UTC-5, Chris Angelico wrote:
> [...] Also, that doesn't deal with
> U+200B or U+180E, which have well-defined widths *smaller* than
> typical Latin letters. (200B is a zero-width space. Is it a
> character?)

Of *COURSE* it's a character.

Would you also consider 0 not to be a number?

Sheesh! 

When call the `len()` function on a string containing only
three "zero-width unicode chars", i want `len` to return the
integer 3 not 0! In what upside-down/inside-out universe
would you prefer that `len` lie to you and return 0? You
can't be serious...

Doth not a string containing three characters have a
length of 3? And if not, what other length could it have?

Doth not a knapsack containing 3 items have a quantity of 3?
And if not, what other quantity could it have?

You seem to want this fine group to believe that if the 3
items in the knapsack are _visible_ to the naked eye (say,
three apples), then they are relevant to the quantity. But
what if the three objects in the knapsack are, say,
radiowaves -- yep, three radiowaves bouncing around inside a
knapsack -- are we to believe that the knapsack is empty?
And if we are, then every scientist and mathematician since
antiquity shall be rolling over in their graves.

Furthermore, why should the storage API and the display API
give a monkey's toss about the other, when they are
obviously "two sides of a mountain". 
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Steve D'Aprano

On Sun, 16 Jul 2017 12:31 am, Rick Johnson wrote:

> I never hear Chinese or eastern Europeans
> bellyaching

Do you speak much to Chinese and Eastern Europeans who don't speak or write
English? How would you know what they say?

"All toupées are bad. I've never seen a good one that looked real."

http://rationalwiki.org/wiki/Toupee_fallacy

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Chris Angelico

On Sun, Jul 16, 2017 at 9:50 AM, Gregory Ewing
 wrote:
> Chris Angelico wrote:
>>
>> Hold on, let me just grab my MUD
>> client, which is already using a fixed width font...
>>
>>
>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>> 忘掉那　無形鎖
>> الثلج لا يشعرني بإكتئاب
>> הקור לא מפריע לי, לא חודר
>> U+1680 is " "
>> U+200B is ""
>> U+180E is "᠎"
>> 다 잊어 다 잊어
>>
>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>
>
> I suspect that different lines in that example are actually
> being rendered in different fonts. Characters within the *same*
> monospaced font should have the same width (otherwise it's not
> really a monospaced font!), but there are no guarantees between
> different fonts.
>
> Perhaps the meta-problem here is that Unicode being so big has
> made it impractical to have a single font that encompasses all
> the characters you might ever want to render, so you often have
> to make do with a hodgepodge of fonts that don't play well
> together.

That could explain some of it. However, Chinese characters have a
well-defined space which is significantly wider than most monospaced
fonts would use for Latin characters, so it would look ugly for most
text in Western European languages. Also, that doesn't deal with
U+200B or U+180E, which have well-defined widths *smaller* than
typical Latin letters. (200B is a zero-width space. Is it a
character?) Hebrew text is rendered right-to-left, which makes
columnar alignment *very* interesting. Arabic text, in addition to
being RTL, is written in a joined/running style, so individual letters
aren't rendered the same way that an entire word is. And in the Korean
example, half the glyphs are represented as composed syllables (U+B2E4
HANGUL SYLLABLE DA) and half are decomposed letters (U+1103 HANGUL
CHOSEONG TIKEUT followed by U+1161 HANGUL JUNGSEONG A). These are not
combining characters - they are legitimate characters in their own
right. (At least, I can't find anything in the Unicode data files that
indicates that they aren't letters. I can use them individually in
Python identifiers, for instance.)

So even if someone were to create a single font with every Unicode
character represented, it couldn't actually give every character the
same width, because that would result in incorrect rendering for many
scripts.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Gregory Ewing


Chris Angelico wrote:

Hold on, let me just grab my MUD
client, which is already using a fixed width font...

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
忘掉那　無形鎖
الثلج لا يشعرني بإكتئاب
הקור לא מפריע לי, לא חודר
U+1680 is " "
U+200B is ""
U+180E is "᠎"
다 잊어 다 잊어
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-


I suspect that different lines in that example are actually
being rendered in different fonts. Characters within the *same*
monospaced font should have the same width (otherwise it's not
really a monospaced font!), but there are no guarantees between
different fonts.

Perhaps the meta-problem here is that Unicode being so big has
made it impractical to have a single font that encompasses all
the characters you might ever want to render, so you often have
to make do with a hodgepodge of fonts that don't play well
together.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Mikhail V

On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
> Random access to code points is as uninteresting as random access to
> UTF-8 bytes.
> I might want random access to the "Grapheme clusters, a.k.a.real
> characters".

What _real_ characters are you referring to?
If your data has "á" (U00E1), then it is one real character,
if you have "a" (U0061) and "ˊ" (U02CA) then it is _two_
real characters. So in both cases you have access to code points =
real characters.

For metaphysical discussion -  in _my_ definition there
is no such "real" character as "á", since it is the "a" glyph with some dirt,
so according to my definition, it should be two separate characters,
both semantically and technically seen.

And, in my definition, the whole Unicode is a huge junkyard, to start with.

But opinions may vary, and in case you prefer or forced to write "á",
then it can be impractical to store it as two characters, regardless of
encoding.

Mikhail
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: pyserial and end-of-line specification

2017-07-15 Thread Andre Müller

Just take a look into the documentation:
https://docs.python.org/3/library/io.html#io.TextIOWrapper

And in the example of Pyserial:
http://pyserial.readthedocs.io/en/latest/shortintro.html#eol

I think it shold be:
sio = io.TextIOWrapper(io.BufferedRWPair(ser, ser),
newline='yourline_ending')

But the documentation of Pytho says:
Warning BufferedRWPair does not attempt to synchronize accesses to its
underlying raw streams. You should not pass it the same object as reader
and writer; use BufferedRandom instead.


Maybe you should also try:

sio = io.TextIOWrapper(io.BufferedRandom(ser), newline='yourline_ending')

If it's readonly:
sio = io.TextIOWrapper(io.BufferedReader(ser), newline='yourline_ending')


I never tried it, but your question leads me to take a look into this cool
features of the io module.

Greetings Andre
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: is @ operator popular now?

2017-07-15 Thread Matt Wheeler

On Sat, 15 Jul 2017, 13:49 Christian Heimes,  wrote:

> @ is an actual operator in Python. It was added in Python 3.5 as infix
> matrix multiplication operator, e.g.
>
>m3 = m1 @ m2
>

TIL

The operator is defined in PEP 465,
> https://www.python.org/dev/peps/pep-0465/


Perhaps it should also be listed at
https://docs.python.org/3.6/genindex-Symbols.html
-- 

--
Matt Wheeler
http://funkyh.at
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: "Edit with IDLE" doesn't work any more ?

2017-07-15 Thread jonathan . blanck89

Am Freitag, 28. April 2017 14:48:22 UTC+2 schrieb Yip, Kin:
> Hi,
> 
> I've finally known why   By chance, I went to the installation directory 
> : C:\Program Files\Python36\Lib\tkinter   
> 
> to check on files.  I did "EDIT with IDLE" on any files there.  It all works 
> !   Then, I went back to my directory
> where I put all my personal .py codes.  It didn't work there.   Finally, I've 
> guessed and realized/tested that
> "EDIT with IDLE"  doesn't work in my python directory because I have just 
> recently made a file called :
> 
> tkinter.py
> 
> 
> Somehow, this stops "EDIT with IDLE" from working if I try to "EDIT with 
> IDLE" on any files in that directory/folder.
> 
> After I rename it to mytkinter.py , things work normally now ! 
> 
> Weird !Don't know exactly why ...?!   
> 
> Sorry to bother you guys ...
> 
> Kin

you da real MVP!
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Steve D'Aprano

On Sun, 16 Jul 2017 12:01 am, Marko Rauhamaa wrote:

> It does seem to me UTF-8 is a better waiting position than strings.
> Strings give you more trouble while not truly solving any problems.


/face-palm

Okay, that's it, this conversation is over. You have no clue what you are
talking about.

http://rationalwiki.org/wiki/Not_even_wrong

http://rationalwiki.org/wiki/Category_mistake



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Chris Angelico

On Sun, Jul 16, 2017 at 12:08 AM, Rick Johnson
 wrote:
> On Friday, July 14, 2017 at 2:40:43 AM UTC-5, Chris Angelico wrote:
>> [...]
>> What is the length of a string? How often do you actually
>> care about the number of grapheme clusters - and not, for
>> example, about the pixel width? (To columnate text, for
>> instance, you need to know about its width in pixels or
>> millimeters, not the number of characters in the line.)
>
> Not in the case of a fixed width font!

Yes, of course. How silly of me. Hold on, let me just grab my MUD
client, which is already using a fixed width font...

Here's a piece of text, copied and pasted straight from the client.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
忘掉那　無形鎖
الثلج لا يشعرني بإكتئاب
הקור לא מפריע לי, לא חודר
U+1680 is " "
U+200B is ""
U+180E is "᠎"
다 잊어 다 잊어
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

And here's how it renders.

http://imgur.com/1xTT1s0

It's so easy! Monospaced fonts solve everything. Every single
character gets the exact same number of pixels of width, because
that's how the standard stipulates it.

>> And if you're going to group code points together because
>> some of them are combining characters, would you also group
>> them together because there's a zero-width joiner in the
>> middle? The answer will sometimes be "yes of course" and
>> sometimes "of course not".
>
> Consistency is the key. And we must remember that he who
> assembled such inconsistent strings can only blame herself.

Except that it's the same string in different contexts. There is no
inconsistency in the string.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Chris Angelico

On Sun, Jul 16, 2017 at 12:01 AM, Marko Rauhamaa  wrote:
> Steve D'Aprano :
>
>> On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
>>> I might want random access to the "Grapheme clusters, a.k.a.real
>>> characters".
>>
>> That would be nice to have, but the truth is that for most coders,
>> Unicode code points are the low-hanging fruit that get you 95% of the
>> way, and for many applications that's "close enough".
>
> I think "close enough" is actually dangerous. We shouldn't encourage
> that practice.
>
>> Support for the Unicode grapheme breaking algorithm would get you
>> probably 90% of the rest of the way. And then some sort of
>> configurable system where defaults were based on the locale would
>> probably get you a fairly complete grapheme-based text library.

Okay. So here's your challenge: don't get "close enough", get perfect.
Divide the following strings into "characters" by your definition;
give me a list of one-character strings. Make sure you are perfect and
consistent. I'll start with an easy one.

1) "Giờ\u00A0ra\u00A0đi, một\u00A0mình\u00A0ta"
2) "לעזוב, לעזוב"
3) "اطلقي سرك"
4) "「別讓他們進來看見」"
5) "다 잊어 다 잊어"

Your locale, should  this matter, is your choice of en_AU.utf8,
en_US.utf8, tr_TR.utf8, or sv_SE.utf8.

In case the information is lost in transmission, here are the same
strings, as sequences of codepoints.

1) U+0047 U+0069 U+1EDD U+00A0 U+0072 U+0061 U+00A0 U+0111 U+0069
U+002C U+0020 U+006D U+1ED9 U+0074 U+00A0 U+006D U+00EC U+006E U+0068
U+00A0 U+0074 U+0061
2) U+05DC U+05E2 U+05D6 U+05D5 U+05D1 U+002C U+0020 U+05DC U+05E2
U+05D6 U+05D5 U+05D1
3) U+0627 U+0637 U+0644 U+0642 U+064A U+0020 U+0633 U+0631 U+0643
4) U+300C U+5225 U+8B93 U+4ED6 U+5011 U+9032 U+4F86 U+770B U+898B U+300D
5) U+B2E4 U+0020 U+C78A U+C5B4 U+0020 U+1103 U+1161 U+0020 U+110B
U+1175 U+11BD U+110B U+1165

Once this is solved, you can propose adding an iteration function that
follows these rules. Probably to the unicodedata module, although it'd
most likely have to go via PyPI first.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Rick Johnson

On Friday, July 14, 2017 at 12:43:50 PM UTC-5, Steve D'Aprano wrote:
> Before you answer, does your answer apply to Arabic and
> Thai as well as Western European languages?

I find it interesting that those who bellyache the loudest
about the "inclusivity of regional charator encodings" never
dabble much outside their _own_ basic English set. For
instance: I never hear Chinese or eastern Europeans
bellyaching about how ASCII forced them to use a standard
keyboard and denied them the "gawd given right" to become an
amatuer space cadet[1]! Nope, they just learn English and move
on.

> [...]
>
> As for the legacy encodings:
> 
> - they're not 7-bit clean, except for ASCII;
> 
> - some of them are variable-width;
> 
> - none of them support the full range of Unicode, so they
> aren't universal character sets;
> 
> - in other words, you either resign yourself to being
> unable to exchange documents with other people, resign
> yourself to dealing with moji-bake, or invent some complex
> and non-backwards-compatible in-band mechanism for
> switching charsets;
> 
> - they suffer from the exact same problems as Unicode
> regarding the distinction between code points and
> graphemes;
> 
> - so not only do they lack the advantages of Unicode, but
> they have even more disadvantages.

Thanks for finally admitting that Unicode is not the cure
all that you unicode cultist make it out to be.

[1] Possibly with the exception of Xan Lee. ;-). BTW, what
happened to the old chap?
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Rick Johnson

On Friday, July 14, 2017 at 2:40:43 AM UTC-5, Chris Angelico wrote:
> [...]
> What is the length of a string? How often do you actually
> care about the number of grapheme clusters - and not, for
> example, about the pixel width? (To columnate text, for
> instance, you need to know about its width in pixels or
> millimeters, not the number of characters in the line.)

Not in the case of a fixed width font!

> And if you're going to group code points together because
> some of them are combining characters, would you also group
> them together because there's a zero-width joiner in the
> middle? The answer will sometimes be "yes of course" and
> sometimes "of course not".

Consistency is the key. And we must remember that he who
assembled such inconsistent strings can only blame herself.


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Rick Johnson

On Friday, July 14, 2017 at 2:40:43 AM UTC-5, Chris Angelico wrote:
[...]
> IMO the Python str type is adequate as a core data type. What we may
> need, though, is additional utility functions, eg:
> 
> * unicodedata.grapheme_clusters(str) - split str into a sequence of
> grapheme clusters
> * pango.get_text_extents(str) - measure the pixel dimensions of a line of text
> * platform.punish_user() - issue a platform-dependent response (such
> as an electric shock, a whack with a 2x4, or a dropped anvil) on
> someone who has just misunderstood Unicode again
> * socket.punish_user() - as above, but to the user at the opposite end
> of a socket

Chris's violent nature is obviously due to him watching so
many looney tunes episodes, that he believes an anvil to the
head causes no damage. This is not a cartoon Chris!

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Marko Rauhamaa

Steve D'Aprano :

> On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
>> I might want random access to the "Grapheme clusters, a.k.a.real
>> characters".
>
> That would be nice to have, but the truth is that for most coders,
> Unicode code points are the low-hanging fruit that get you 95% of the
> way, and for many applications that's "close enough".

I think "close enough" is actually dangerous. We shouldn't encourage
that practice.

> Support for the Unicode grapheme breaking algorithm would get you
> probably 90% of the rest of the way. And then some sort of
> configurable system where defaults were based on the locale would
> probably get you a fairly complete grapheme-based text library.

Yes, that kind of a text class would be useful.

> I'm interested in such a thing. That's why I pointed out the issue on
> the bug tracker, to try to garner interest in it. As far as I can
> tell, you seem to be more interested in cheap point scoring, digs
> against Unicode, and an insistence that UTF-8 is better than strings
> (which doesn't even make sense).

It does seem to me UTF-8 is a better waiting position than strings.
Strings give you more trouble while not truly solving any problems.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: is @ operator popular now?

2017-07-15 Thread Peter Otten

Chris Angelico wrote:

> On Sat, Jul 15, 2017 at 11:05 PM, Peter Otten <__pete...@web.de> wrote:
>> Matt Wheeler wrote:
>>
 as the title says. has @ been used in projects?
>>
>> numpy, probably?
>>
>>> Strictly speaking, @ is not an operator.
>>
>> In other words it's not popular, not even widely known.
>>
>> Compare:
>>
>> $ python3.4 -c '__pete...@web.de'
>>   File "", line 1
>> __pete...@web.de
>>  ^
>> SyntaxError: invalid syntax
>> $ python3.5 -c '__pete...@web.de'
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> NameError: name '__peter__' is not defined
>>
>> Starting with 3.5 my email address is valid Python syntax. Now I'm
>> waiting for the __peter__ builtin ;)
> 
> And you'll have to 'import web' too.
> 
> I've no idea what 'web.de' would be and what happens when you matmul it by
> you.
> 
> ChrisA

This is getting more complex than expected. Here's a prototype:

import builtins

def __peter__():
class Provider:
def __init__(self, name):
self.name = name
def __getattr__(self, name):
return Provider(f"{self.name}.{name}")
def __rmatmul__(self, user):
assert user.email.endswith("@" + self.name)
return user

class User:
def __init__(self, email):
self.email = email
user, at, site = email.partition("@")
name = site.partition(".")[0]
setattr(builtins, name, Provider(name))
def __repr__(self):
return self.email

return User("__pete...@web.de")

builtins.__peter__ = __peter__()

del __peter__

$ python3.7 -i web.py
>>> __pete...@web.de
__pete...@web.de

I'm sure you won't question the feature's usefulness after this. Future 
versions may send me an email or wipe your hard disk at my discretion...


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: is @ operator popular now?

2017-07-15 Thread Chris Angelico

On Sat, Jul 15, 2017 at 11:05 PM, Peter Otten <__pete...@web.de> wrote:
> Matt Wheeler wrote:
>
>>> as the title says. has @ been used in projects?
>
> numpy, probably?
>
>> Strictly speaking, @ is not an operator.
>
> In other words it's not popular, not even widely known.
>
> Compare:
>
> $ python3.4 -c '__pete...@web.de'
>   File "", line 1
> __pete...@web.de
>  ^
> SyntaxError: invalid syntax
> $ python3.5 -c '__pete...@web.de'
> Traceback (most recent call last):
>   File "", line 1, in 
> NameError: name '__peter__' is not defined
>
> Starting with 3.5 my email address is valid Python syntax. Now I'm waiting
> for the __peter__ builtin ;)

And you'll have to 'import web' too.

I've no idea what 'web.de' would be and what happens when you matmul it by you.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: is @ operator popular now?

2017-07-15 Thread Peter Otten

Matt Wheeler wrote:

>> as the title says. has @ been used in projects?

numpy, probably?

> Strictly speaking, @ is not an operator.

In other words it's not popular, not even widely known.

Compare:

$ python3.4 -c '__pete...@web.de'
  File "", line 1
__pete...@web.de
 ^
SyntaxError: invalid syntax
$ python3.5 -c '__pete...@web.de'
Traceback (most recent call last):
  File "", line 1, in 
NameError: name '__peter__' is not defined

Starting with 3.5 my email address is valid Python syntax. Now I'm waiting 
for the __peter__ builtin ;)

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: is @ operator popular now?

2017-07-15 Thread Christian Heimes

On 2017-07-15 14:05, Matt Wheeler wrote:
> On Sat, 15 Jul 2017, 12:35 oyster,  wrote:
> 
>> as the title says. has @ been used in projects?
>>
> 
> Strictly speaking, @ is not an operator.
> It delimits a decorator statement (in python statements and operations are
> not the same thing).
> However, to answer the question you actually asked, yes, all the time.

@ is an actual operator in Python. It was added in Python 3.5 as infix
matrix multiplication operator, e.g.

   m3 = m1 @ m2

The operator is defined in PEP 465,
https://www.python.org/dev/peps/pep-0465/

Christian

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: is @ operator popular now?

2017-07-15 Thread Matt Wheeler

On Sat, 15 Jul 2017, 12:35 oyster,  wrote:

> as the title says. has @ been used in projects?
>

Strictly speaking, @ is not an operator.
It delimits a decorator statement (in python statements and operations are
not the same thing).
However, to answer the question you actually asked, yes, all the time.

For specific examples, see:
pytest's fixtures
contextlib.contextmanager (makes creating context managers mich simpler in
most cases)
@property @classmethod etc. etc. (I sometimes see these used a bit too
freely, when a plain attribute or a function at the module level would be
more appropriate)

> --

--
Matt Wheeler
http://funkyh.at
-- 
https://mail.python.org/mailman/listinfo/python-list

is @ operator popular now?

2017-07-15 Thread oyster

as the title says. has @ been used in projects?
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Steve D'Aprano

On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:

> Steve D'Aprano :
> 
>> On Sat, 15 Jul 2017 04:10 am, Marko Rauhamaa wrote:
>>> Python3's strings don't give me any better random access than UTF-8.
>>
>> Say what? Of course they do.
>>
>> Python 3 strings (since 3.3) are a compact form of UTF-32. Without loss of
>> generality, we can say that each string is an array of four-byte code units.
> 
> Yes, and a UTF-8 byte array gives me random access to the UTF-8
> single-byte code units.

Which is irrelevant. Single code units in UTF-8 aren't important. Nobody needs
to start a slice in the middle byte of a three byte code point in UTF-8. It's
not a useful operation, and allowing slices to occur at arbitrary positions
inside UTF-8 sequences means you soon won't have valid UTF-8 any more.

Now since I am interested in a good faith discussion, I can even point out
something that supports your argument: perhaps we could introduce restrictions
on where you can slice, and ensure that they only occur at code point
boundaries. So if you try to slice string[100:120], say, what you actually get
is string[98:119] because that's where the nearest code point boundaries fall.

Or should it move forward? string[101:122], say.

Perhaps the Zen of Python is better: when faced with ambiguity, avoid the
temptation to guess. We should either prohibit slicing anywhere except on a
code point boundary, or better still use a data structure that doesn't expose
the internal implementation of code points.

Whichever way we go, it doesn't get us any closer to our ultimate aim, which is
a text data type based on graphemes rather than code points. All it does is
give us what Python's unicode strings already give us: code points.

So what does that extra complexity forced on us by UTF-8 give us, apart from a
headache? Why use UTF-8?

> Neither gives me random access to the "Grapheme clusters, a.k.a.real
> characters". For example, the HFS+ file system stores uses a variant of
> NFD for filenames meaning both UTF-32 and UTF-8 give you random access
> to pure ASCII filenames only.

And they're not graphemes either. Normalisation doesn't give you graphemes.

It's ironic that you give the example of Apple using NFD, since that makes the
problem you are railing against *worse* rather than better. Decomposition has
its uses, but the specific problem this thread started with is made worse due
to decomposition.

>> UTF-8 is not: it is a variable-width encoding,
> 
> UTF-32 is a variable-width encoding as well.

No it isn't. All code points are exactly one four-byte code unit in size.

> For example, "baby: medium skin tone" is U+1F476 U+1F3FD:

That's two code points, not one. Variation selectors present the same issues as
combining characters.

>   http://unicode.org/emoji/charts/full-emoji-list.html#1f476_1f3fd>
> 
>> Go ignores this problem by simply not offering random access to code
>> points in strings.
> 
> Random access to code points is as uninteresting as random access to
> UTF-8 bytes.

I have random access to code points in Python right now, and I use it all the
time to extract code points and even build up new strings from slices. I
wouldn't do that with UTF-8 bytes, it's too bloody hard.

> I might want random access to the "Grapheme clusters, a.k.a.real
> characters".

That would be nice to have, but the truth is that for most coders, Unicode code
points are the low-hanging fruit that get you 95% of the way, and for many
applications that's "close enough".

Support for the Unicode grapheme breaking algorithm would get you probably 90%
of the rest of the way. And then some sort of configurable system where
defaults were based on the locale would probably get you a fairly complete
grapheme-based text library.

I'm interested in such a thing. That's why I pointed out the issue on the bug
tracker, to try to garner interest in it. As far as I can tell, you seem to be
more interested in cheap point scoring, digs against Unicode, and an insistence
that UTF-8 is better than strings (which doesn't even make sense).

> As you have pointed out, that wish is impossible to grant 
> unambiguously.

I never said that. Just because it is *difficult*, and that no one answer will
satisfy everyone all of the time, doesn't mean we can't solve the problem.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

ANN: Python bytecode assembler, xasm

2017-07-15 Thread rocky

I may regret this, but there is a very alpha Python bytecode assembler. 
https://pypi.python.org/pypi/xasm
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

2017-07-15 Thread Marko Rauhamaa

Steve D'Aprano :

> On Sat, 15 Jul 2017 04:10 am, Marko Rauhamaa wrote:
>> Python3's strings don't give me any better random access than UTF-8.
>
> Say what? Of course they do.
>
> Python 3 strings (since 3.3) are a compact form of UTF-32. Without loss of
> generality, we can say that each string is an array of four-byte code units.

Yes, and a UTF-8 byte array gives me random access to the UTF-8
single-byte code units.

Neither gives me random access to the "Grapheme clusters, a.k.a.real
characters". For example, the HFS+ file system stores uses a variant of
NFD for filenames meaning both UTF-32 and UTF-8 give you random access
to pure ASCII filenames only.

> UTF-8 is not: it is a variable-width encoding,

UTF-32 is a variable-width encoding as well. For example, "baby: medium
skin tone" is U+1F476 U+1F3FD:

  http://unicode.org/emoji/charts/full-emoji-list.html#1f476_1f3fd>

> Go ignores this problem by simply not offering random access to code
> points in strings.

Random access to code points is as uninteresting as random access to
UTF-8 bytes.

I might want random access to the "Grapheme clusters, a.k.a.real
characters". As you have pointed out, that wish is impossible to grant
unambiguously.

Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Cannot access PySide.version!

2017-07-15 Thread Paulo da Silva

Às 07:55 de 15-07-2017, Paulo da Silva escreveu:
> Hi!
> 
> The problem:
> 
> import PySide
> print(PySide.__version__)
> 
> AttributeError: 'module' object has no attribute '__version__'
> 
> How can I fix this?
> 
> Other PySide examples seem to work fine!
> 
> Thanks for any help.
> 
> Further information:
> /usr/lib64/python3.4/site-packages/PySide contains only .so files
> 
> /usr/lib64/python3.4/site-packages/PySide-1.2 contains 2 files:
> __init__.py  _utils.py

Creating links to __init__.py  _utils.py in
/usr/lib64/python3.4/site-packages/PySide fixes the problem.

-- 
https://mail.python.org/mailman/listinfo/python-list

Cannot access PySide.version!

2017-07-15 Thread Paulo da Silva

Hi!

The problem:

import PySide
print(PySide.__version__)

AttributeError: 'module' object has no attribute '__version__'

How can I fix this?

Other PySide examples seem to work fine!

Thanks for any help.

Further information:
/usr/lib64/python3.4/site-packages/PySide contains only .so files

/usr/lib64/python3.4/site-packages/PySide-1.2 contains 2 files:
__init__.py  _utils.py

__init__.py first lines:
__all__ = ['QtCore', 'QtGui', 'QtNetwork', 'QtOpenGL', 'QtSql', 'QtSvg',
'QtTest', 'QtWebKit', 'QtScript']
__version__ = "1.2.4"
__version_info__= (1, 2, 4, "final", 0)
-- 
https://mail.python.org/mailman/listinfo/python-list

41 matches

Mail list logo