[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

Ezio Melotti Sun, 02 Oct 2011 21:15:59 -0700

Ezio Melotti <ezio.melo...@gmail.com> added the comment:

> But it still has to happen at compile time, of course, so I don't know
> what you could do in Python.  Is there any way to change how the compiler
> behaves even vaguely along these lines?


I think things like "from __future__ import ..." do something similar, but I'm 
not sure it will work in this case (also because you will have to provide the 
list of aliases somehow).

>> Really?  White space makes things harder to read?  I thought Pythonistas
>> believed the opposite of that.  Whitespace is very useful for cognitive
>> chunking: you see how things logically group together.

> I was surprised at that too ;-). One person's opinion in a specific 
> context. Don't generaliza.

Also don't generalize my opinion regarding *where* whitespace makes thing less 
readable: I was just talking about regex.
What I was trying to say here is best summarized by a quote from Paul Graham's 
article "Succinctness is Power":
"""
If you're used to reading novels and newspaper articles, your first experience 
of reading a math paper can be dismaying. It could take half an hour to read a 
single page. And yet, I am pretty sure that the notation is not the problem, 
even though it may feel like it is. The math paper is hard to read because the 
ideas are hard. If you expressed the same ideas in prose (as mathematicians had 
to do before they evolved succinct notations), they wouldn't be any easier to 
read, because the paper would grow to the size of a book.
"""
Try replacing
  s/novels and newspaper articles|prose/Python code/g
  s/single page/single regex/
  s/math paper/regex/g.

To provide an example, I find:

# define a function to capitalize s
def my_capitalize(s):
    """This function capitalizes the argument s and returns it"""
    the_first_letter = s[0]  # 0 means the first char
    the_rest_of_s = s[1:]  # 1: means from the second till the end
    the_first_letter_uppercased = the_first_letter.upper()  # upper makes the 
string uppercase
    the_rest_of_s_lowercased = the_rest_of_s.lower()  # lower makes the string 
lowercase
    s_capitalized = the_first_letter_uppercased + the_rest_of_s_lowercased  # + 
concatenates
    return s_capitalized

less readable than:

def my_capitalize(s):
    return s[0].upper() + s[1:].lower()

You could argue that the first is much more explicit and in a way clearer, but 
overall I think you agree with me that is less readable.  Also this clearly 
depends on how well you know the notation you are reading: if you don't know it 
very well, you might still prefer the commented/verbose/extended/redundant 
version.  Another important thing to mention, is that notation of regular 
expressions is fairly simple (especially if you leave out look-arounds and 
Unicode-related things that are not used too often), but having a similar 
succinct notation for a whole programming language (like Perl) might not work 
as well (I'm not picking on Perl here, as you said you can write readable 
programs if you don't abuse the notation, and the succinctness offered by the 
language has some advantages, but with Python we prefer more readable, even if 
we have to be a little more verbose).  Another example of a trade-off between 
verbosity and succinctness is the new string formatting mini-language.

> That really isn't right.  A cased character is one with the Unicode "Cased"
> property, and a lowercase character is one wiht the Unicode "Lowercase"
> property.  The General Category is actually immaterial here.

You might want to take a look and possibly add a comment on #12204 about this.

> I've spent all bloody day trying to model Python's islower, isupper, and 
> istitle
> functions, but I get all kinds of errors, both in the definitions and in the
> models of the definitions.

If by "model" you mean "trying to figure out how they work", it's probably 
easier to look at the implementation (I assume you know enough C to understand 
what they do).  You can find the code for str.istitle() at 
http://hg.python.org/cpython/file/default/Objects/unicodeobject.c#l10358 and 
the actual implementation of some macros like Py_UNICODE_ISTITLE at 
http://hg.python.org/cpython/file/default/Objects/unicodectype.c.

> I really don't understand any of these functions.  I'm very sad.  I think 
> they are
> wrong, but maybe I am.  It is extremely confusing.

> Shall I file a separate bug report?

If after reading the code and/or the documentation you still think they are 
broken and/or that they can be improved, then you can open another issue.

BTW, instead of writing custom scripts to test things, it might be better to 
use unittest (see 
http://docs.python.org/py3k/library/unittest.html#basic-example), or even 
better write a patch for Lib/test/test_unicode.py.
Using unittest has the advantage that is then easy to integrate those tests 
within our test suite, but on the other hand as soon as something fails the 
failure is returned without evaluating the following assertions in the method.
This as the advantage that

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12753>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

Reply via email to