date:20110813

[issue12747] Move devguide into cpython repo

2011-08-13 Thread Nick Coghlan


Nick Coghlan  added the comment:

I'd say the main reason the dev guide is in a separate repo is the historical 
one (i.e. Brett was working on it as a separate repo prior to the hg migration 
and we never merged it).

However, the version independent nature of the material is the main argument 
against merging it into the Docs tree - it's a document about the development 
community around CPython, not a document about CPython itself.

Personally, I'm happy with the resolution in the python-dev thread - tagging 
the test.support docs to keep them out of indices and search results, while 
leaving the dev guide in a separate version independent repo.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue11866] race condition in threading._newname()

2011-08-13 Thread Peter Saveliev


Peter Saveliev  added the comment:

counter.next() is a C routine and it is atomic from Python's point of view — if 
I understand right.

The test shows that original threading.py leads to a (rare) race here, while 
with counter object there is no race condition.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue11866] race condition in threading._newname()

2011-08-13 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

I think the patch is correct.

FWIW, my style is to prebind the next method, making the counter completely 
self-contained (like a closure):

+_counter = itertools.count().next
 def _newname(template="Thread-%d"):
 global _counter
-_counter = _counter + 1
-return template % _counter
+return template % _counter()

--
assignee:  -> rhettinger
nosy: +rhettinger
resolution:  -> accepted

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12746] normalization is affected by unicode width

2011-08-13 Thread Ezio Melotti


Changes by Ezio Melotti :


--
components: +Unicode
nosy: +ezio.melotti

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2857] Add "java modified utf-8" codec

2011-08-13 Thread STINNER Victor


STINNER Victor  added the comment:

> Python does have other "weird" encodings like bz2 or rot13.

No, it has no more such weird encodings.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-13 Thread Antoine Pitrou


Changes by Antoine Pitrou :


--
nosy: +haypo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue11513] chained exception/incorrect exception from tarfile.open on a non-existent file

2011-08-13 Thread Roundup Robot


Roundup Robot  added the comment:

New changeset 843cd43206b4 by Georg Brandl in branch '3.2':
Fix #11513: wrong exception handling for the case that GzipFile itself raises 
an IOError.
http://hg.python.org/cpython/rev/843cd43206b4

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue11513] chained exception/incorrect exception from tarfile.open on a non-existent file

2011-08-13 Thread Georg Brandl


Georg Brandl  added the comment:

Fixed in 3.2/default.

2.7 has even more primitive error handling; should the gzopen() be adapted to 
the 3.x case?

--
nosy: +georg.brandl

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue10799] Improve webbrowser (.open) doc and behavior

2011-08-13 Thread Ezio Melotti


Changes by Ezio Melotti :


--
nosy: +ezio.melotti

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12737] str.title() is overzealous by upcasing combining marks inappropriately

2011-08-13 Thread Antoine Pitrou


Changes by Antoine Pitrou :


--
nosy: +haypo, loewis
stage:  -> needs patch
versions: +Python 3.3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12746] normalization is affected by unicode width

2011-08-13 Thread Antoine Pitrou


Changes by Antoine Pitrou :


--
nosy: +haypo, lemburg

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12646] zlib.Decompress.decompress/flush do not raise any exceptions when given truncated input streams

2011-08-13 Thread Roundup Robot


Roundup Robot  added the comment:

New changeset bb6c2d5c811d by Nadeem Vawda in branch 'default':
Issue #12646: Add an 'eof' attribute to zlib.Decompress.
http://hg.python.org/cpython/rev/bb6c2d5c811d

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12669] test_curses skipped on buildbots

2011-08-13 Thread Roundup Robot


Roundup Robot  added the comment:

New changeset 4358909ee221 by Nadeem Vawda in branch 'default':
Issue #12669: Fix test_curses so that it can run on the buildbots.
http://hg.python.org/cpython/rev/4358909ee221

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12646] zlib.Decompress.decompress/flush do not raise any exceptions when given truncated input streams

2011-08-13 Thread Roundup Robot


Roundup Robot  added the comment:

New changeset 65d61ed991d9 by Nadeem Vawda in branch 'default':
Fix incorrect comment in zlib.Decompress.flush().
http://hg.python.org/cpython/rev/65d61ed991d9

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12723] Provide an API in tkSimpleDialog for defining custom validation functions

2011-08-13 Thread R. David Murray


R. David Murray  added the comment:

A bit of both, I think.  The current function is actually 'getvalue' and is 
responsible for retrieving the value, validating its type, and converting to 
that type (the current ones do both in the same operation).  It feels to me 
like a cleaner interface to decouple retrieval and validation/conversion, so 
that the validation function gets passed a string and returns the desired type. 
 But in that case, having the string dialog take the validation/coercion 
function makes the name of the askstring function just wrong.

So, I still think the cleaner API is to expose the class and let the 
application subclass to provide the validation function.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12646] zlib.Decompress.decompress/flush do not raise any exceptions when given truncated input streams

2011-08-13 Thread Nadeem Vawda


Changes by Nadeem Vawda :


--
resolution:  -> fixed
stage: patch review -> committed/rejected
status: open -> closed
type: behavior -> feature request

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread R. David Murray


R. David Murray  added the comment:

Tom, note that nobody is arguing that what you are requesting is a bad thing :)

As far as I know, Matthew is the only one currently working on the regex 
support in Python.  (Other developers will commit small fixes if someone 
proposes a patch, but no one that I've seen other than Matthew is working on 
the deeper issues.)  If you want to help out that would be great.

And as far as this particular issue goes, yes the difference between the narrow 
and wide build has been a known issue for a long time, but has become less and 
less ignorable as unicode adoption has grown. Martin's PEP that Matthew 
references is the only proposed fix that I know of.  There is a GSoc project 
working on it, but I have no idea what the status is.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12740] Add struct.Struct.nmemb

2011-08-13 Thread R. David Murray


R. David Murray  added the comment:

As a new feature, this could only go into 3.3.

--
nosy: +r.david.murray
versions:  -Python 3.2

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12740] Add struct.Struct.nmemb

2011-08-13 Thread Antoine Pitrou


Antoine Pitrou  added the comment:

I had never heard of "nmemb". "nmembers" would be less cryptic.
The patch needs a "versionadded" directive in the docs.

--
nosy: +pitrou

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12745] Python2 or Python3 page

2011-08-13 Thread Antoine Pitrou


Antoine Pitrou  added the comment:

It is a wiki page, so you can edit it yourself (you probably need to register, 
though).
If you think your modifications would be too drastic, perhaps you want to 
launch a discussion on the python-dev mailing-list about that page and its 
current contents.

--
nosy: +pitrou

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12740] Add struct.Struct.nmemb

2011-08-13 Thread Georg Brandl


Georg Brandl  added the comment:

While we're at it, let's add str.pbrk() ;)

--
nosy: +georg.brandl

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12744] inefficient pickling of long integers on 64-bit builds

2011-08-13 Thread Antoine Pitrou


Changes by Antoine Pitrou :


--
nosy: +alexandre.vassalotti

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue9552] ssl build under Windows always rebuilds OpenSSL

2011-08-13 Thread Antoine Pitrou


Changes by Antoine Pitrou :


--
resolution:  -> fixed
stage:  -> committed/rejected
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12740] Add struct.Struct.nmemb

2011-08-13 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

How about __len__()?

--
nosy: +rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

2011-08-13 Thread Antoine Pitrou


Antoine Pitrou  added the comment:

> However, because the \w&c issues are bigger, Java addressed the tr18 RL1.2a
> issues differently, this time by creating a new compilation flag called
> UNICODE_CHARACTER_CLASSES (with corresponding embedded "(?U)" regex flag.)
> 
> Truth be told, even Perl has secret pattern compilation flags to govern
> this sort of thing (ascii, locale, unicode), but we (well, I) hope you
> never have to use or even notice them.  
> 
> That too might be a route forward for Python, although I am not quite sure
> how much flexibility and control of your lexical scope you have.  However,
> the "from __future_" imports suggest you may have enough to do something
> slick so that only people who ask for it get it, and also importantly that
> they get it all over the place so don't have to add an extra flag or u'...'
> or whatever every single time.  

If the current behaviour is buggy or sub-optimal, I think we should
simply fix it (which might be done by replacing "re" with "regex" if
someone wants to shepherd its inclusion in the stdlib).

By the way, thanks for the detailed explanations, Tom.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12740] Add struct.Struct.nmemb

2011-08-13 Thread Meador Inge


Meador Inge  added the comment:

The functionality part of the patch looks reasonable.  However, the 
pseudo-randomization in the unit tests seems like a bad idea.  Say someone is 
adding a new feature X.  Runs the unit tests to find one of them failing.  Then 
runs them again to investigate and they are now passing.  Unit tests should be 
repeatable.

--
nosy: +meador.inge

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12744] inefficient pickling of long integers on 64-bit builds

2011-08-13 Thread Roundup Robot


Roundup Robot  added the comment:

New changeset 8e824e09924a by Antoine Pitrou in branch 'default':
Issue #12744: Fix inefficient representation of integers
http://hg.python.org/cpython/rev/8e824e09924a

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12744] inefficient pickling of long integers on 64-bit builds

2011-08-13 Thread Antoine Pitrou


Changes by Antoine Pitrou :


--
resolution:  -> fixed
stage: needs patch -> committed/rejected
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue11241] ctypes: subclassing an already subclassed ArrayType generates AttributeError

2011-08-13 Thread Amaury Forgeot d'Arc


Amaury Forgeot d'Arc  added the comment:

Yes, the patch looks good!

--
resolution:  -> accepted

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12659] Add tests for packaging.tests.support

2011-08-13 Thread Francisco Martín Brugué


Francisco Martín Brugué  added the comment:

I've started with test for “fake_dec” and “TempdirManager”. Please let me know 
if that in the line you want.

Thanks in advance

Francis

--
keywords: +patch
nosy: +francismb
Added file: http://bugs.python.org/file22895/issue12659_v1.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12747] Move devguide into cpython repo

2011-08-13 Thread Eric Snow


Eric Snow  added the comment:

That's fine.  The discussion had moved away from the devguide, so I figured it 
would be worth following up.  You guys have made some good points.

--
resolution:  -> rejected
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen

Tom Christiansen  added the comment:

David Murray  wrote:

> Tom, note that nobody is arguing that what you are requesting is a bad
> thing :)

There looked to be minor some resistance, based on absolute backwards
compatibility even if wrong, regarding changing anything *at all* in re,
even things that to my jaded seem like actual bugs.

There are bugs, and then there are bugs.

In my survey of Unicode support across 7 programming languages for OSCON

http://training.perl.com/OSCON/index.html

I came across a lot of weirdnesses, especially as first when the learning
curve was high.  Sure, I found it odd that unlike Java, Perl, and Ruby,
Python didn't offer regular casemapping on strings, only the simple
character-based mapping.  But that doesn't make it a bug, which is why
I filed it as an feature/enhancement request/wish, not as a bug.

I always count as bugs not handling Unicode text the way Unicode says
it must be handled.  Such things would be:

Emitting CESU-8 when told to emit UTF-8.

Violating the rule that UTF-8 must be in the shortest possible encoding.

Not treating a code point as a letter when the supported version of the
UCD says it is.  (This can happen if internal rules get out of sync
with the current UCD.)

Claiming one does "the expected thing on Unicode" for case-insensitive
matches when not doing what Unicode says you must minimally do: use at
least the simple casefolds, if not in fact the full ones.

Saying \w matches Unicode word characters when one's definition of
word characters differs from that of the supported version of the UCD.

Supporting Unicode vX.Y.Z is more than adding more characters.  All the
behaviors specified in the UCD have to be updated too, or else you are just
ISO 10646.  I believe some of Python's Unicode bugs happened because folks
weren't aware which things in Python were defined by the UCD or by various
UTS reports yet were not being directly tracked that way.  That's why its
important to always fully state which version of these things you follow.

Other bugs, many actually, are a result of the narrow/wide-build untransparency.

There is wiggle room in some of these.  For example, which is the one that
applies to re, in that you could -- in a sense -- remove the bug by no longer
claiming to do case-insensitive matches on Unicode.  I do not find that very
useful. Javascript works this way: it doesn't do Unicode casefolding.  Java you
have to ask nicely with the extra UNICODE_CASE flag, aka "(?u)", used with the
CASE_INSENSITIVE, aka "(?i)".

Sometimes languages provide different but equivalent interfaces to the same
functionality.  For example, you may not support the Unicode property
\p{NAME=foobar} in patterns but instead support \N{foobar} in patterns and
hopefully also in strings.  That's just fine.  On slightly shakier ground but
still I think defensible is how one approaches support for the standard UCD
properties:

  Case_FoldingSimple_Case_Folding
 Titlecase_MappingSimple_Titlecase_Mapping
 Uppercase_MappingSimple_Uppercase_Mapping
 Lowercase_MappingSimple_Lowercase_Mapping

One can support folding, for example, via (?i) and not have to
directly supporting a Case_Folding property like \p{Case_Folding=s},
since "(?i)s" should be the same thing as "\p{Case_Folding=s}".

> As far as I know, Matthew is the only one currently working on the
> regex support in Python.  (Other developers will commit small fixes if
> someone proposes a patch, but no one that I've seen other than Matthew
> is working on the deeper issues.)  If you want to help out that would
> be great.

Yes, I actually would.  At least as I find time for it.  I'm a competent C
programmer and Matthew's C code is very well documented, but that's very
time consuming.  For bang-for-buck, I do best on test and doc work, making
sure things are actually working the way they say do.

I was pretty surprised and disappointed by how much trouble I had with
Unicode work in Python.  A bit of that is learning curve, a bit of it is
suboptimal defaults, but quite a bit of it is that things either don't work
the way Unicode says, or because something is altogether missing.  I'd like
to help at least make the Python documentation clearer about what it is
or is not doing in this regard.

But be warned: one reason that Java 1.7 handles Unicode more according to
the published Unicode Standard in its Character, String, and Pattern
classes is because when they said they'd be supporting Unicode 6.0.0,
I went through those classes and every time I found something in violation
of that Standard, I filed a bug report that included a documentation patch
explaining what they weren't doing right.  Rather than apply my rather
embarrassing doc patches, they instead fixed the code. :)

> And as far as this particular issue goes, yes the difference between
> the narrow and wide build has been a known issue for a long time, but
> has become less an

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Antoine Pitrou


Antoine Pitrou  added the comment:

> Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds.
> Perhaps someone could tell me why the Python documentation says it uses
> UCS-2 on a narrow build.

There's a disagreement on that point between several developers. See an example 
sub-thread at:
http://mail.python.org/pipermail/python-dev/2010-November/105751.html

> Since you are already using a variable-width encoding, why the
> supercilious attitude toward UTF-8?

I think you are reading too much into these decisions. It's simply that no-one 
took the time to write an alternative implementation and demonstrate its 
superiority. I also believe the original implementation was UCS-2 and surrogate 
support was added progressively during the years. Hence the terminological mess 
and the ad-hoc semantics.

I agree that going with UTF-8 and a clever indexing scheme would be a better 
solution.

--
nosy: +pitrou

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Matthew Barnett


Matthew Barnett  added the comment:

There are occasions when you want to do string slicing, often of the form:

pos = my_str.index(x)
endpos = my_str.index(y)
substring = my_str[pos : endpos]

To me that suggests that if UTF-8 is used then it may be worth profiling to see 
whether caching the last 2 positions would be beneficial.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Antoine Pitrou


Antoine Pitrou  added the comment:

> There are occasions when you want to do string slicing, often of the form:
> 
> pos = my_str.index(x)
> endpos = my_str.index(y)
> substring = my_str[pos : endpos]
> 
> To me that suggests that if UTF-8 is used then it may be worth
> profiling to see whether caching the last 2 positions would be
> beneficial.

And/or a lookup table giving the byte offset of, say, every 16th
character. It gives you a O(1) lookup with a relatively reasonable
constant cost (you have to scan for less than 16 characters after the
lookup).

On small strings (< 256 UTF-8 bytes) the space overhead for the lookup
table would be 1/16. It could also be constructed lazily whenever more
than 2 positions are cached.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12740] Add struct.Struct.nmemb

2011-08-13 Thread Raymond Hettinger


Changes by Raymond Hettinger :


--
assignee:  -> rhettinger
priority: normal -> low

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12744] inefficient pickling of long integers on 64-bit builds

2011-08-13 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

Nice.

--
nosy: +rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen


Tom Christiansen  added the comment:

Matthew Barnett  wrote
   on Sat, 13 Aug 2011 20:57:40 -: 

> There are occasions when you want to do string slicing, often of the form:

>   pos = my_str.index(x)
>   endpos = my_str.index(y)
>   substring = my_str[pos : endpos]

Me, I would probably give the second call to index the first  
index position to guarantee the end comes after the start:

str  = "for finding the biggest of all the strings"
x_at = str.index("big")
y_at = str.index("the", x_at)
some = str[x_at:y_at]
print("GOT", some)

But here's a serious question: is that *actually* a common usage pattern
for accessing strings in Python?  I ask because it wouldn't even *occur* to
me to go at such a problem in that way.  I would have always just written
it this way instead:

import re
str  = "for finding the biggest of all the strings"
some = re.search("(big.*?)the", str).group(1)
print("GOT", some)

I know I would use the pattern approach, just because that's 
how I always do such things in Perl:

$str  = "for finding the biggest of all the strings";
($some) = $str =~ /(big.*?)the/;
print "GOT $some\n";

Which is obviously a *whole* lot simpler than the index approach:

$str  = "for finding the biggest of all the strings";
$x_at = index($str, "big");
$y_at = index($str, "the", $x_at);
$len  = $y_at - $x_at;
$some = substr($str, $x_at, $len);
print "GOT $some\n";

With no arithmetic and no need for temporary variables (you can't really
escape needing x_at to pass to the second call to index), it's all a
lot more WYSIWIG.  See how much easier that is?  

Sure, it's a bit cleaner and less noisy in Perl than it is in Python by
virtue of Perl's integrated pattern matching, but I would still use
patterns in Python for this, not index.  

I honestly find the equivalent pattern operations a lot easier to read and write
and maintain than I find the index/substring version.  It's a visual thing.  
I find patterns a win in maintainability over all that busy index monkeywork.  
The index/rindex and substring approach is one I almost never ever turn to.
I bet I use pattern matching 100 or 500 times for each time I use index, and
maybe even more.

I happen to think in patterns.  I don't expect other people to do so.  But
because of this, I usually end up picking patterns even if they might be a
little bit slower, because I think the gain in flexibility and especially
maintability more than makes up for any minor performance concerns.

This might also show you why patterns are so important to me: they're one
of the most important tools we have for processing text.  Index isn't,
which is why I really don't care about whether it has O(1) access.  

> To me that suggests that if UTF-8 is used then it may be worth
> profiling to see whether caching the last 2 positions would be
> beneficial.

Notice how with the pattern approach, which is inherently sequential, you don't
have all that concern about running over the string more than once.  Once you
have the first piece (here, "big"), you proceed directly from there looking for
the second piece in a straightforward, WYSIWIG way.  There is no need to keep an
extra index or even two around on the string structure itself, going at it this 
way.

I would be pretty surprised if Perl could gain any speed by caching a pair of
MRU index values against its UTF-8 [but see footnote], because again, I think
the normal access pattern wouldn't make use of them.  Maybe Python programmers
don't think of strings the same way, though.  That, I really couldn't tell you.

But here's something to think about:

If it *is* true that you guys do all this index stuff that Perl programmers
just never see or do because of our differing comfort levels with regexes,
and so you think Python that might still benefit from that sort of caching 
because its culture has promoted a different access pattern, then that caching 
benefit would still apply even if you were retain the current UTF-16 
representation
instead of going to UTF-8 (which might want it) or to UTF-32 (which wouldn't).

After all, you have the same variable-width caching issue with UTF-16 as with
UTF-8, so if it makes sense to have an MRU cache mapping character indices to
byte indices, then it doesn't matter whether you use UTF-8 or UTF-16!

However, I'd want some passive comparative benchmarks using real programs with
real data, because I would be suspicious of incurring the memory cost of two
more pointers in every string in the whole program.  That's serious.

--tom

FOOTNOTE: The Perl 6 people are thinking about clever ways to set up byte
  offset indices.  You have to do this if you want O(1) access to the
  Nth element for elements that are not simple code points even if you
  use UTF-32.  That's because they want the default string element to be
  a user visible grapheme, not a code point.  I know they have clever
  ideas

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen


Tom Christiansen  added the comment:

Antoine Pitrou  wrote
   on Sat, 13 Aug 2011 21:09:52 -: 

> And/or a lookup table giving the byte offset of, say, every 16th
> character. It gives you a O(1) lookup with a relatively reasonable
> constant cost (you have to scan for less than 16 characters after the
> lookup).

> On small strings (< 256 UTF-8 bytes) the space overhead for the lookup
> table would be 1/16. It could also be constructed lazily whenever more
> than 2 positions are cached.

You really should talk to the Perl 6 people to see whether their current
strategy for caching offset maps for grapheme positions might be of use to
you.  Larry explained it to me once but I no longer recall any details.

I notice though that they don't seem to think it worth doing for UTF-8 
or UTF-16, just for their synthetic "NFG" (Grapheme Normalization Form)
strings, where it would be needed even if they used UTF-32 underneath.

--tom

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Matthew Barnett


Matthew Barnett  added the comment:

You're right about starting the second search from where the first finished. 
Caching the position would be an advantage there.

The memory cost of extra pointers wouldn't be so bad if UTF-8 took less space 
than the current format.

Regex isn't used as much as in Perl. BTW, the current re module was introduced 
in Python 1.5, the previous regex and regsub modules being removed in Python 
2.5.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen


Tom Christiansen  added the comment:

>> Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds.
>> Perhaps someone could tell me why the Python documentation says it uses
>> UCS-2 on a narrow build.

> There's a disagreement on that point between several developers. 
> See an example sub-thread at:

>   http://mail.python.org/pipermail/python-dev/2010-November/105751.html

Some of those folks know what they're talking about, and some do not.

Most of the postings miss the mark.

Python uses UTF-16 for its narrow builds.  It does not use UCS-2.

The argument that it must be UCS-2 because it can store lone surrogates
in memory is spurious.

You have to read The Unicode Standard very *very* closely, but it is not
necessary that all internal buffers always be in well-formed UTF-whatever.
Otherwise it would be impossible to append a code unit at a time to buffer.
I could pull out the reference if I worked at it, because I've had to find
it before.  It's in there.  Trust me.  I know.

It is also spurious to pretend that because you can produce illegal output
when telling it to generate something in UTF-16 that it is somehow not using
UTF-16.  You have simply made a mistake.  You have generated something  that
you have promised you would not generate.   I have more to say about this below.

Finally, it is spurious to argue against UTF-16 because of the code unit
interface.  Java does exactly  the same thing as Python does *in all regards*
here, and no one pretends that Java is UCS-2.  Both are UTF-16.

It is simply a design error to pretend that the number of characters
is the number of code units instead of code points.  A terrible and
ugly one, but it does not mean you are UCS-2.

You are not.  Python uses UTF-16 on narrow builds.  

The ugly terrible design error is digusting and wrong, just as much in 
Python as in Java, and perhaps moreso because of the idiocy of narrow
builds even existing.  But that doesn't make them UCS-2.

If I could wave a magic wand, I would have Python undo its code unit
blunder and go back to code points, no matter what.  That means to stop
talking about serialization schemes and start talking about logical code
points.  It means that slicing and index and length and everything only
report true code points.  This horrible code unit botch from narrow builds
is most easily cured by moving to wide builds only.

However, there is more.

I haven't checked its UTF-16 codecs, but Python's UTF-8 codec is broken
in a bunch of ways.  You should be raising as exception in all kinds of
places and you aren't.  I can see I need to bug report this stuff to.  
I don't to be mean about this.  HONEST!  It's just the way it is.

Unicode currently reserves 66 code points as noncharacters, which it 
guarantees will never be in a legal UTF-anything stream.  I am not talking 
about surrogates, either.

To start with, no code point which when bitwise added with 0xFFFE returns
0xFFFE can never appear in a valid UTF-* stream, but Python allow this
without any error.

That means that both 0xNN_FFFE and 0xNN_ are illegal in all planes,
where NN is 00 through 10 in hex.  So that's 2 noncharacters times 17 
planes = 34 code points illegal for interchange that Python is passing 
through illegally.  

The remaining 32 nonsurrogate code points illegal for open interchange
are 0xFDD0 through 0xFDEF.  Those are not allowed either, but Python
doesn't seem to care.

You simply cannot say you are generating UTF-8 and then generate a byte
sequence that UTF-8 guarantees can never occur.  This is a violation.

***SIGH***

--tom

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12725] Docs: Odd phrase "floating seconds" in socket.html

2011-08-13 Thread Ben Hayden


Ben Hayden  added the comment:

I made the suggested second change - both in the docs & the socketmodule.c 
file. If there's a different way to patch documentation, someone let me know. :D

--
keywords: +patch
nosy: +beardedp
Added file: http://bugs.python.org/file22896/issue12725.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12748] IDLE halts on osx when copy and paste

2011-08-13 Thread hy


New submission from hy :

The IDLE halts on os x when copy and paste
I tried in 10.6.8 and 10.7
Now I could only use IDLE in Windows in VMware

--
assignee: ronaldoussoren
components: IDLE, Macintosh
messages: 142046
nosy: hoyeung, ronaldoussoren
priority: normal
severity: normal
status: open
title: IDLE halts on osx when copy and paste
versions: Python 2.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Ezio Melotti


Ezio Melotti  added the comment:

> It is simply a design error to pretend that the number of characters
> is the number of code units instead of code points.  A terrible and
> ugly one, but it does not mean you are UCS-2.

If you are referring to the value returned by len(unicode_string), it is the 
number of code units.  This is a matter of "practicality beats purity".  
Returning the number of code units is O(1) (num_of_bytes/2).  To calculate the 
number of characters it's instead necessary to scan all the string looking for 
surrogates and then count any surrogate pair as 1 character.  It was therefore 
decided that it was not worth to slow down the common case just to be 100% 
accurate in the "uncommon" case.

That said it would be nice to have an API (maybe in unicodedata or as new str 
methods?) able to return the number of code units, code points, graphemes, etc, 
but I'm not sure that it should be the default behavior of len().

> The ugly terrible design error is digusting and wrong, just as much
> in Python as in Java, and perhaps moreso because of the idiocy of
> narrow builds even existing.

Again, wide builds use twice as much the space than narrow ones, but one the 
other hand you can have fast and correct behavior with e.g. len().  If people 
don't care about/don't need to use non-BMP chars and would rather use less 
space, they can do so.  Until we agree that the difference in space used/speed 
is no longer relevant and/or that non-BMP characters become common enough to 
prefer the "correct behavior" over the "fast-but-inaccurate" one, we will 
probably keep both.

> I haven't checked its UTF-16 codecs, but Python's UTF-8 codec is
> broken in a bunch of ways.  You should be raising as exception in
> all kinds of places and you aren't.

I am aware of some problems of the UTF-8 codec on Python 2.  It used to follow 
RFC 2279 until last year and now it's been updated to follow RFC 3629.
However, for backward compatibility, it still encodes/decodes surrogate pairs.  
This broken behavior has been kept because on Python 2, you can encode every 
code point with UTF-8, and decode it back without errors:
>>> x = [unichr(c).encode('utf-8') for c in range(0x11)]
>>>
and breaking this invariant would probably make more harm than good.  I 
proposed to add a "real" utf-8 codec on Python 2, but no one seems to care 
enough about it.

Also note that this is fixed in Python3:
>>> x = [chr(c).encode('utf-8') for c in range(0x11)]
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 
0: surrogates not allowed

>  I can see I need to bug report this stuff to.  

If you find other places where it's broken (both on Python 2 and/or Python 3), 
please do and feel free to add me to the nosy.  If you can also provide a 
failing test case and/or point to the relevant parts of the Unicode standard, 
it would be great.

--
nosy: +ezio.melotti

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12748] IDLE halts on osx when copy and paste

2011-08-13 Thread Ezio Melotti


Ezio Melotti  added the comment:

Can you specify what version of Python are you using, how do you copy/paste 
(e.g. ctrl+c/v, from the menu), and if it halts regardless of what you 
copy/paste?

--
nosy: +ezio.melotti

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12725] Docs: Odd phrase "floating seconds" in socket.html

2011-08-13 Thread Roundup Robot


Roundup Robot  added the comment:

New changeset dfe6f0a603d2 by Ezio Melotti in branch '2.7':
#12725: fix working. Patch by Ben Hayden.
http://hg.python.org/cpython/rev/dfe6f0a603d2

New changeset ab3432a81c26 by Ezio Melotti in branch '3.2':
#12725: fix working. Patch by Ben Hayden.
http://hg.python.org/cpython/rev/ab3432a81c26

New changeset 49e9e34da512 by Ezio Melotti in branch 'default':
#12725: merge with 3.2.
http://hg.python.org/cpython/rev/49e9e34da512

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12725] Docs: Odd phrase "floating seconds" in socket.html

2011-08-13 Thread Ezio Melotti


Ezio Melotti  added the comment:

Fixed, thanks for the report and the patch!

--
nosy: +ezio.melotti
resolution:  -> fixed
stage: needs patch -> committed/rejected
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12748] IDLE halts on osx when copy and paste

2011-08-13 Thread hy


hy  added the comment:

I use the latest python 2.7.2 binary in a freshly installed os x
I use command c and command v, and also use the menu.
Also, it halts when I cut.
No matter what I cut, copy and paste, it halts.
It happens both in the shell and editor.

I have to remind myself not to use copy and paste now. Once I forget, IDLE 
halts and I have to force quit it and I lost everything unsaved.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12748] IDLE halts on osx when copy and paste

2011-08-13 Thread Ezio Melotti


Changes by Ezio Melotti :


--
nosy: +kbk, ned.deily

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12741] Add function similar to shutil.move that does not overwrite

2011-08-13 Thread David Townshend


David Townshend  added the comment:

A bit of research has shown that the proposed implementation will not work 
either, so my next suggestion is something along the lines of

def move2(src, dst):
try:
os.link(src, dst)
except OSError as err:
# handle error appropriately, raise shutil.Error if dst exists,
# or use shutil.copy2 if dst is on a different filesystem.
pass
else:
os.unlink(src)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

2011-08-13 Thread Tom Christiansen


Tom Christiansen  added the comment:

Ezio Melotti  added the comment:

>> It is simply a design error to pretend that the number of characters
>> is the number of code units instead of code points.  A terrible and
>> ugly one, but it does not mean you are UCS-2.

> If you are referring to the value returned by len(unicode_string), it
> is the number of code units.  This is a matter of "practicality beats
> purity".  Returning the number of code units is O(1) (num_of_bytes/2).
> To calculate the number of characters it's instead necessary to scan
> all the string looking for surrogates and then count any surrogate
> pair as 1 character.  It was therefore decided that it was not worth
> to slow down the common case just to be 100% accurate in the
> "uncommon" case.

If speed is more important than correctness, I can make any algorithm
infinitely fast.  Given the choice between correct and quick, I will 
take correct every single time.

Plus your strings our immutable! You know how long they are and they 
never change.  Correctness comes at a negligible cost.  

It was a bad choice to return the wrong answer.

> That said it would be nice to have an API (maybe in unicodedata or as
> new str methods?) able to return the number of code units, code
> points, graphemes, etc, but I'm not sure that it should be the default
> behavior of len().

Always code points, never code units.  I even use a class whose length
method returns the grapheme count, because even code points aren't good
enough.  Yes of course graphemes have to be counted.  Big deal.   How 
would you like it if you said to move three to the left in vim and 
it *didn't* count each graphemes as one position?  Madness.

>> The ugly terrible design error is digusting and wrong, just as much
>> in Python as in Java, and perhaps moreso because of the idiocy of
>> narrow builds even existing.

> Again, wide builds use twice as much the space than narrow ones, but
> one the other hand you can have fast and correct behavior with e.g.
> len().  If people don't care about/don't need to use non-BMP chars and
> would rather use less space, they can do so.  Until we agree that the
> difference in space used/speed is no longer relevant and/or that non-
> BMP characters become common enough to prefer the "correct behavior"
> over the "fast-but-inaccurate" one, we will probably keep both.

Which is why I always put loud warnings in my Unicode-related Python
programs that they do not work right on Unicode if running under
a narrow build.  I almost feel I should just exit.

>> I haven't checked its UTF-16 codecs, but Python's UTF-8 codec is
>> broken in a bunch of ways.  You should be raising as exception in
>> all kinds of places and you aren't.

> I am aware of some problems of the UTF-8 codec on Python 2.  It used
> to follow RFC 2279 until last year and now it's been updated to follow
> RFC 3629.

Unicode says you can't put surrogates or noncharacters in a UTF-anything 
stream.  It's a bug to do so and pretend it's a UTF-whatever.

Perl has an encoding form, which it does not call "UTF-8", that you 
can use the UTF-8 algorithm on for any code point, include non-characters
and surrogates and even non-Unicode code points far above 0x10_, up
to in fact 0x___ on 64-bit machines.  It's the internal
format we use in memory.  But we don't call it real UTF-8, either.

It sounds like this is the kind of thing that would be useful to you.

> However, for backward compatibility, it still encodes/decodes
> surrogate pairs.  This broken behavior has been kept because on Python
> 2, you can encode every code point with UTF-8, and decode it back
> without errors:

No, that's not UTF-8 then.  By definition.  See the Unicode Standard.

 x = [unichr(c).encode('utf-8') for c in range(0x11)]


> and breaking this invariant would probably make more harm than good.

Why?  Create something called utf8-extended or utf8-lax or utf8-nonstrict
or something.  But you really can't call it UTF-8 and do that.  

We actually equate "UTF-8" and "utf8-strict".  Our internal extended
UTF-8 is something else.  It seems like you're still doing the old
relaxed version we used to have until 2003 or so.  It seems useful
to be able to have both flavors, the strict and the relaxed one,
and to call them different things.  

Perl defaults to the relaxed one, which gives warnings not exceptions,
if you do things like setting PERLUNICODE to S or SD and such for the
default I/I encoding.  If you actually use "UTF-8" as the encoding on the 
stream, though, you
get the version that gives exceptions instead.  

"UTF-8" = "utf8-strict" strictly by the standard, raises exceptions 
otherwise
"utf8"  loosely only, emits warnings on encoding 
illegal things

We currently only emit warnings or raise exceptions on I/O, not on chr
operations and such.  We used to raise exceptions on things like
chr(0xD800), but that was a mistake caused by misunderstanding t

51 matches

Mail list logo