[Python-Dev] Automatic encoding detection [was: Re: Python3 complexity - 2 use cases]

2014-01-13 Thread Jim J. Jewett


 So when it is time to guess [at the character encoding of a file],
 a source of good guesses is an important battery to include.

 The barrier for entry to the standard library is higher than mere
 usefulness.

Agreed.  But most programs will need it, and people will either
include (the same) 3rd-party library themselves, or write their
own workaround, or have buggy code *is* sufficient.

The points of contention are

(1)  How many programs have to deal with documents written
 outside their control -- and probably originating on
 another system.

I'm not ready to say most programs in general, but I think that
barrier is met for both web clients (for which we already supply
several batteries) and quick-and-dirty utilities.

(2)  How serious are the bugs / How annoying are the workarounds?

As someone who mostly sticks to English, and who tends to manually
ignore stray bytes when dealing with a semi-binary file format,
the bugs aren't that serious for me personally.  So I may well
choose to write buggy programs, and the bug may well never get
triggered on my own machine.

But having a batch process crash one run in ten (where it didn't
crash at all under Python 2) is a bad thing.  There are environments
where (once I knew about it) I would add chardet (if I could get
approval for the 3rd-party component).

(3)  How clearcut is the *right* answer?

As I said, at one point (several years ago), the w3c and whatwg
started to standardize the right answer.  They backed that out,
because vendors wanted the option to improve their detection in
the future without violating standards.

There are certainly situations where local knowledge can do
better than a global solution like chardet,  but ... the
right answer is clear most of the time.

Just ignoring the problem is still a 99% answer, because most text
in ASCII-mostly environments really is close enough.  But that
is harder (and the One Obvious Way is less reliable) under Python 3
than it was under Python 2.

An alias for open that defaulted to surrogate-escape (or returned
the new ASCIIstr bytes hybrid) would probably be sufficient to get
back (almost) to Python 2 levels of ease and reliability.  But it
would tend to encourage ASCII/English-only assumptions.

You could fix most of the remaining problems by scripting a web
browser, except that scripting the browser in a cross-platform
manner is slow and problematic, even with webbrowser.py.

Whatever a recent Firefox does is (almost by definition) good
enough, and is available ... but maybe not in a convenient form,
which is one reason that chardet was created as a port thereof.
Also note that firefox assumes you will update more often than
Python does.

Whatever chardet said at the time the Python release was cut
is almost certainly good enough too.

The browser makers go to great lengths to match each other even 
in bizarre corner cases.  (Which is one reason there aren't more
competing solutions.)  But that doesn't mean it is *impossible*
to construct a test case where they disagree -- or even one where
a recent improvement in the algorithms led to regressions for one
particular document.

That said, such regressions should be limited to documents that
were not properly labeled in the first place, and should be rare
even there.  Think of the changes as obscure bugfixes, akin to
a program starting to handle NaN properly, in a place where it
should not ever see one.

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Python3 complexity - 2 use cases

2014-01-10 Thread Jim J. Jewett

 
 Steven D'Aprano wrote:
 I think that heuristics to guess the encoding have their role to play,
 if the caller understands the risks.

Ben Finney wrote:
 In my opinion, content-type guessing heuristics certainly don't belong
 in the standard library.

It would be great if there were never any need to guess.  But in the
real world, there is -- and often the user won't know any more than
python does.  So when it is time to guess, a source of good guesses
is an important battery to include.

The HTML5 specifications go through some fairly extreme contortions
to document what browsers actually do, as opposed to what previous
standards have mandated.  They don't currently specify how to guess
(though I think a draft once tried, since the major browsers all do
it, and at the time did it similarly), but the specs do explicitly
support such a step, and do provide an implementation note
encouraging user-agents to do at least minimal auto-detection.  

http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding

My own opinion is therefore that Python SHOULD provide better support
for both of the following use cases:

(1)  Treat this file like it came from the web -- including
 autodetection and even overriding explicit charset
 declarations for certain charsets.

We should explicitly treat autodetection like time zone data --
there is no promise that the right answer (or at least the
best guess) won't change, even within a release.

I offer no opinion on whether chardet in particular is still
too volatile, but the docs should warn that the API is driven
by possibly changing external data.

(2)  Treat this file as ASCII+, where anything non-ASCII
 will (at most) be written back out unchanged; it doesn't
 even need to be converted to text.

At this time, I don't know whether the right answer is making it
easy to default to surrogate-escape for all error-handling, 
adding more bytes methods, encouraging use of python's latin-1
variant, offering a dedicated (new?) codec, or some new suggestion.

I do know that this use case is important, and that python 3
currently looks clumsy compared to python 2.


-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Which direction is UnTransform? / Unicode is different

2013-11-19 Thread Jim J. Jewett

 
(Fri Nov 15 16:57:00 CET 2013) Stephen J. Turnbull wrote:

  Serhiy Storchaka wrote:

   If the transform() method will be added, I prefer to have only
   one transformation method and specify a direction by the
   transformation name (bzip2/unbzip2).

Me too.  Until I consider special cases like compress, or lower,
and realize that there are enough special cases to become a major wart
if generic transforms ever became popular.  

 People think about these transformations as en- or de-coding, not
 transforming, most of the time.  Even for a transformation that is
 an involution (eg, rot13), people have an very clear idea of what's
 encoded and what's not, and they are going to prefer the names
 encode and decode for these (generic) operations in many cases.

I think this is one of the major stumbling blocks with unicode.

I originally disagreed strongly with what Stephen wrote -- but then
I realized that all my counterexamples involved unicode text.

I can tell whether something is tarred or untarred, zipped or unzipped.

But an 8-bit (even Latin-1, let alone ASCII) bytestring really doesn't
seem encoded, and it doesn't make sense to decode a perfectly
readable (ASCII) string into a sequence of code units.

Nor does it help that http://www.unicode.org/glossary/#code_unit
defines code unit as The minimal bit combination that can represent
a unit of encoded text for processing or interchange. The Unicode
Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code
units in the UTF-16 encoding form, and 32-bit code units in the UTF-32
encoding form. (See definition D77 in Section 3.9, Unicode Encoding
Forms.)

I have to read that very carefully to avoid mentally translating it
into Code Units are *en*coded, and there are lots of different
complicated encodings that I wouldn't use unless I were doing special
processing or interchange.  If I'm not using the network, or if my
interchange format already looks like readable ASCII, then unicode
sure sounds like a complication.  I *will* get confused over which
direction is encoding and which is decoding. (Removing .decode()
from the (unicode) str type in 3 does help a lot, if I have a Python 3
interpreter running to check against.)


I'm not sure exactly what implications the above has, but it certainly
supports separating the Text Processing from the generic codecs, both
in the documentation and in any potential new methods.

Instead of relying on introspection of .decodes_to and .encodes_to, it
would be useful to have charsetcodecs and tranformcodecs as entirely
different modules, with their own separate registries.  I will even
note that the existing help(codecs) seems more appropriate for
charsetcodecs than it does for the current conjoined module.


-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 454 (tracemalloc) disable == clear?

2013-10-29 Thread Jim J. Jewett

 
(Tue Oct 29 12:37:52 CET 2013) Victor Stinner wrote:

 For consistency, you cannot keep traces when tracing is disabled.
 The free() must be enabled to remove allocated memory blocks, or
 next malloc() may get the same address which would raise an assertion
 error (you cannot have two memory blocks at the same address).

That seems like an a quirk of the implementation, particularly since
the actual address is not returned to the user.  Nor do I see any way
of knowing when that allocation is freed.

Well, unless I missed it... I don't see how to get anything beyond
the return value of get_traces, which is a (time-ordered?) list 
of allocation size with then-current call stack.  It doesn't mention
any attribute for indicating that some entries are de-allocations,
let alone the actual address of each allocation.

 For the reason explained above, it's not possible to disable the whole
 module temporarly.

 Internally, tracemalloc uses a thread-local variable (called the
 reentrant flag) to disable temporarly tracing allocations in the
 current thread. It only disables tracing new allocations,
 deallocations are still proceed.

Even assuming the restriction is needed, this just seems to mean that
disabling (or filtering) should not affect de-allocation events, for
fear of corrupting tracemalloc's internal structures.

In that case, I would expect disabling (and filtering) to stop
capturing new allocation events for me, but I would still expect
tracemalloc to do proper internal maintenance.

It would at least explain why you need both disable *and* reset;
reset would empty those internal structures, so that tracemalloc
could shortcut that maintenance.  I would NOT assume that I needed
to call reset when changing the filters, nor would I assume that
changing them threw out existing traces.

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] backported Enum

2013-06-28 Thread Jim J. Jewett

 
(On June 19, 2013) Barry Warsaw wrote about porting mailman from
flufl.enum to the stdlib.enum:


 Switching from call syntax to getitem syntax for looking up an
 enum member by name, e.g.

-delivery_mode = DeliveryMode(data['delivery_mode'])
+delivery_mode = DeliveryMode[data['delivery_mode']]

 Switching from getitem syntax to call syntax for looking up an
 enum member by value, e.g.

-return self._enum[value]
+return self._enum(value)

 Interesting that these two were exactly opposite from flufl.enum.

Is there a reason why these were reversed?

I can sort of convince myself that it makes sense because dicts
work better with strings than with ints, but ... it seems like
such a minor win that I'm not sure it is worth backwards
incompatibility.  (Of course, I also don't know how much use
stdlib.enum has already gotten with the current syntax.)



 Switching from int() to .value to get the integer value of an
 enum member, e.g.

-return (member.list_id, member.address.email, int(member.role))
+return (member.list_id, member.address.email, member.role.value)

Is just this a style preference?

Using a .value attribute certainly makes sense, but I don't see it
mentioned in the PEP as even optional, let alone recommended.  If
you care that the value be specifically an int (as opposed to any
object), then a int constructor may be better.

 [Some additional changes that mean there will be *some* changes,
 which does reduce the pressure for backwards compatibility.] ...


 An unexpected difference is that failing name lookups raise a
 KeyError instead of a ValueError.

I could understand either, as well as AttributeError, since the
instance that would represent that value isn't a class attribute.

Looking at Enum creation, I think ValueError would be better than
TypeError for complaints about duplicate names.  Was TypeError
chosen because it should only happen during setup?

I would also not be shocked if some people expect failed value
lookups to raise an IndexError, though I expect they would
adapt if they get something else that makes sense.

Would it be wrong to create an EnumError that subclasses
(ValueError, KeyError, AttributeError) and to raise that
subclass from everything but _StealthProperty and _get_mixins?


-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Keyword meanings [was: Accept just PEP-0426]

2012-11-20 Thread Jim J. Jewett

 
Vinay Sajip reworded the 'Provides-Dist' definition to explicitly say:

 The use of multiple names in this field *must not* be used for
 bundling distributions together. It is intended for use when
 projects are forked and merged over time ...

(1)  Then how *should* the bundle-of-several-components case be
represented?

(2)  How is 'Provides-Dist' different from 'Obsoletes-Dist'?
The only difference I can see is that it may be a bit more polite
to people who do want to install multiple versions of a (possibly
abstract) package.


-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 362: 4th edition

2012-06-15 Thread Jim J. Jewett

Summary:

*Every* Parameter attribute is optional, even name.  (Think of
builtins, even if they aren't automatically supported yet.)
So go ahead and define some others that are sometimes useful.

Instead of defining a BoundArguments class, just return a copy
of the Signature, with value attributes added to the Parameters.

Use subclasses to distinguish the parameter kind.  (Replacing
most of the is_ methods from the 3rd version.)

[is_]implemented is important information, but the API isn't
quite right; even with tweaks, maybe we should wait a version
before freezing it on the base class.  But I would be happy
to have Larry create a Signature for the os.* functions,
whether that means a subclass or just an extra instance
attribute.

I favor passing a class to Signature.format, because so many of
the formatting arguments would normally change in parallel.
But my tolerance for nested structures may be unusually high.

I make some more specific suggestions below.


In http://mail.python.org/pipermail/python-dev/2012-June/120305.html
Yury Selivanov wrote:

 A Signature object has the following public attributes and methods:

 * return_annotation : object
The annotation for the return type of the function if specified.
If the function has no annotation for its return type, this
attribute is not set.

This means users must already be prepared to use hasattr with the
Signature as well as the Parameters -- in which case, I don't see any
harm in a few extra optional properties.

I would personally prefer to see the name (and qualname) and docstring,
but it would make perfect sense to implement these by keeping a weakref
to the original callable, and just delegating there unless/until the
properties are explicitly changed.  I suspect others will have a use
for additional delegated attributes, such as the self of boundmethods.

I do agree that __eq__ and __hash__ should depend at most on the
parameters (including their order) and the annotation.

 * parameters : OrderedDict
 An ordered mapping of parameters' names to the corresponding
 Parameter objects (keyword-only arguments are in the same order
 as listed in ``code.co_varnames``).

For a specification, that feels a little too tied to the specific
implementation.  How about:

 Parameters will be ordered as they are in the function declaration.

or even just:

 Positional parameters will be ordered as they are in the function
 declaration.

because:
def f(*, a=4, b=5): pass

and:
def f(*, b=5, a=4): pass

should probably have equal signatures.


Wild thought:  Instead of just *having* an OrderedDict of Parameters,
should a Signature *be* that OrderedDict (with other attributes)?
That is, should signature(testfn)[foo] get the foo parameter?


 * bind(\*args, \*\*kwargs) - BoundArguments
 Creates a mapping from positional and keyword arguments to
 parameters.  Raises a ``BindError`` (subclass of ``TypeError``)
 if the passed arguments do not match the signature.
 * bind_partial(\*args, \*\*kwargs) - BoundArguments
 Works the same way as ``bind()``, but allows the omission
 of some required arguments (mimics ``functools.partial``
 behavior.)

Are those descriptions actually correct?

I would expect the mapping to be from parameters (or parameter names)
to values extracted from *args and **kwargs.

And I'm not sure the current patch does even that, since it seems to
instead return a non-Mapping object (but with a mapping attribute)
that could be used to re-create *args, **kwargs in canonical form.
(Though that canonicalization is valuable for calls; it might even
be worth an as_call method.)


I think it should be explicit that this mapping does not include
parameters which would be filled by default arguments.  In fact, if
you stick with this interface, I would like a 3rd method that does
fill out everything.


But I think it would be simpler to just add an optional attribute
to each Parameter instance, and let bind fill that in on the copies,
so that the return value is also a Signature.  (No need for the
BoundArguments class.)  Then the user can decide whether or not to
plug in the defaults for missing values.


 * format(...) - str
 Formats the Signature object to a string.  Optional arguments allow
 for custom render functions for parameter names,
 annotations and default values, along with custom separators.

I think it should state explicitly that by default, the return value
will be a string that could be used to declare an equivalent function,
if Signature were replaced with def funcname.

There are enough customization parameters that would often be changed
together (e.g., to produce HTML output) that it might make sense to use
overridable class defaults -- or even to make format a class itself.

I also think it would make sense to delegate formatting the individual
parameters to the parameter objects.  

[Python-Dev] time.clock_info() field names

2012-04-29 Thread Jim J. Jewett


In http://mail.python.org/pipermail/python-dev/2012-April/119134.html
Benjamin Peterson wrote:

 I see PEP 418 gives time.clock_info() two boolean fields named
 is_monotonic and is_adjusted. I think the is_ is unnecessary and
 a bit ugly, and they could just be renamed monotonic and adjusted.

I agree with monotonic, but I think it should be adjustable.

To me, adjusted and is_adjusted both imply that an adjustment
has already been made; adjustable only implies that it is possible.

I do remember concerns (including Stephen J. Turnbull's
CAL_0O19nmi0+zB+tV8poZDAffNdTnohxo9y5dbw+E2q=9rx...@mail.gmail.com )
that adjustable should imply at least a list of past adjustments,
and preferably a way to make an adjustment.

I just think that stating it is adjustable (without saying how, or
whether and when it already happened) is less wrong than claiming it
is already adjusted just in case it might have been.

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Rename time.steady(strict=True) to time.monotonic()?

2012-03-23 Thread Jim J. Jewett


In http://mail.python.org/pipermail/python-dev/2012-March/118024.html
Steven D'Aprano wrote:

 What makes this steady, given that it can be adjusted
 and it can go backwards?

It is best-effort for steady, but putting best in the name would
be an attractive nuisance.

 Is steady() merely a convenience function to avoid the user
 having to write something like this?

  try:
 mytimer = time.monotonic
  except AttributeError:
 mytimer = time.time

That would still be worth doing.  But I think the main point is
that the clock *should* be monotonic, and *should* be as precise
as possible.

Given that it returns seconds elapsed (since an undefined start),
perhaps it should be

time.seconds()

or even

time.counter()

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Docs of weak stdlib modules should encourage exploration of 3rd-party alternatives

2012-03-19 Thread Jim J. Jewett


In http://mail.python.org/pipermail/python-dev/2012-March/117570.html
Steven D'Aprano posted:

 Need is awfully strong. I don't believe it is the responsibility
 of the standard library to be judge and reviewer of third party
 packages that it doesn't control.

It is, however, user-friendly to indicate when the stdlib selections
are particularly likely to be for reasons other than A bunch of
experts believe this is the best way to do this.  Cpython's
documentation is (de facto) the documentation for python in
general, and pointing people towards other resources (particularly
pypi itself) is quite reasonable.

Many modules are in the stdlib in part because they are an *acceptable*
way of doing something, and the best ways are either changing too
quickly or are so complicated that it doesn't make sense to burden
the *standard* libary for specialist needs.  In those cases, I do
think the documentation should say so.  

Specific examples:

http://docs.python.org/library/numeric.html quite reasonably has
subsections only for what ships with Python.  But I think the
introductory paragraph could stand to have an extra sentence
explaining why and when people should look beyond the stanard
library, such as:

Applications centered around mathematics may benefit from
specialist 3rd party libraries, such as
numpy  http://pypi.python.org/pypi/numpy/ ,
gmpy  http://pypi.python.org/pypi/gmpy , and
scipy http://pypi.python.org/pypi/scipy .


I would add a similar sentence to the web section, or the
internet protocols section if web is still not broken out
separately.  http://docs.python.org/dev/library/internet.html

Note that some web conventions are still evolving too quickly
for covenient encapsulation in a stable library.  Many
applications will therefore prefer functional replacements
from third parties, such as requests or httplib2, or
frameworks such as Django and Zope.  www-related products
can be found by browsing PyPI for top internet subtopic www/http.
 http://pypi.python.org/pypi?:action=browsec=319c=326 

[I think that searching by classifier -- which first requires browse,
and can't be reached from the list of classifiers -- could be improved.]

  
 Should we recommend wxPython over Pyjamas or PyGUI or PyGtk?

Actually, I think the existing http://docs.python.org/library/othergui.html
does a pretty good job; I would not object to adding mentions of
other tools as well, but wiki reference is probably sufficient.


-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Issue #10278 -- why not just an attribute?

2012-03-19 Thread Jim J. Jewett


In http://mail.python.org/pipermail/python-dev/2012-March/117762.html
Georg Brandl posted:

 +   If available, a monotonic clock is used. By default, if *strict* is 
 False,
 +   the function falls back to another clock if the monotonic clock failed 
 or is
 +   not available. If *strict* is True, raise an :exc:`OSError` on error or
 +   :exc:`NotImplementedError` if no monotonic clock is available.

 This is not clear to me.  Why wouldn't it raise OSError on error even with
 strict=False?  Please clarify which exception is raised in which case.

Passing strict as an argument seems like overkill since it will always
be meaningless on some (most?) platforms.  Why not just use a function
attribute?  Those few users who do care can check the value of
time.steady.monotonic before calling time.steady(); exceptions raised
will always be whatever the clock actually raises.

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Python install layout and the PATH on win32

2012-03-14 Thread Jim J. Jewett


In 
view-source:http://mail.python.org/pipermail/python-dev/2012-March/117586.html
van.lindberg at gmail.com posted:

 1) The layout for the python root directory for all platforms should be
 as follows:
 
 stdlib = {base/userbase}/lib/python{py_version_short}
 platstdlib = {base/userbase}/lib/python{py_version_short}
 purelib = {base/userbase}/lib/python{py_version_short}/site-packages
 platlib = {base/userbase}/lib/python{py_version_short}/site-packages
 include = {base/userbase}/include/python{py_version_short}
 scripts = {base/userbase}/bin
 data = {base/userbase}

Why?

Pure python vs compiled C doesn't need to be separated at the directory
level, except for cleanliness.

Some (generally unix) systems prefer to split the libraries into several
additional pieces depending on CPU architecture.

The structure listed above doesn't have a location for docs.

Some packages (such as tcl) may be better off in their own area.

What is data?  Is this an extra split compared to today, or does it
refer to things like LICENSE.txt, README.txt, and NEWS.txt?

And even once I figure out where files have moved, and assume that
the split is perfect -- what does this buy me over the current
situation?  I was under the impression that programs like distutils
already handled finding the appropriate directories for a program;
if you're rewriting that logic, you're just asking for bugs on a
strange platform that you don't use.

If you're looking for things interactively, then platform conventions
are probably more important than consistency across platforms.  If you
disagree, you are welcome to reorganize your personal linux installation
so that it matches windows, and see whether it causes you any problems.

 ... We *already* have this. The only difference in this proposal is
 that we go from py_version_nodot to py_version_short, i.e. from

 c:\python33\lib\python33 

 to

 c:\python33\lib\python3.3

I have not seen that redundancy before on windows.

I'm pretty sure that it is a relic of your Linux provider wanting
to support multiple python versions using shared filesystems.  The
Windows standard is to use a local disk, and to bundle it all up
into its own directory, similar to the way that java apps sometimes
ship with their own JVM.

Also note that using the dot in a directory name is incautious.
I haven't personally had trouble in several years, but doing so is
odd enough that some should be expected.  Python already causes
some grief by not installing in Program Files, but that is at
least justified by the spaces in filenames problem; what is the
advantange of 3.3?


I'm using windows, and I just followed the defaults at installation.
It is possible that the installer continued to do something based
on an earlier installation, but I don't think this machine has ever
had a customized installation of any python version.

C:\python32\*
Everything is under here; I assume {base/userbase} would be
set to C:\python32

As is customary for windows, the base directory contains the
license/readme/news and all executables that the user is
expected to use directly.  (python.exe, pythonw.exe.  It also
contains w9xpopen.exe that users do not use, but that too is
fairly common.)

There is no data directory.

Subdirectories are:

C:\python32\DLLs
In additional to regular DLL files, it contains .pyd files
and icons.  It looks like modules from the stdlib that happen
to be written in C.  Most users will never bother to look here.

C:\python32\Doc
A .chm file; full html would be fine too, but removing it
would be a bad idea.

C:\python32\include
These are the header files, though most users will never have
any use for them, as there isn't generally a compiler.

C:\python32\Lib
The standard library -- or at least the portion implemented
in python.

Note that site-packages is a subdirectory here.  It doesn't
happen to have an __init__.py, but to an ordinary user it
looks just like any other stdlib package, such as xml or
multiprocessing.

I personally happen to keep things in subdirectories of
site-packages, but I can't say what is standard.

Moving site-packages out of the Lib directory might make
sense, but probably isn't worth the backward compatibility hit.

C:\python32\libs
.lib files.  I'm not entirely sure what these (as opposed to
the DLLs) are for; lib files aren't that common on windows.
My machine does not appear to have any that aren't associated
with cross-platform tools or unix emulation.

C:\python32\tcl
Note that this is in addition to associated files under DLLs
and libs.  I would prefer to see them in one place, but
moving it in with non-tcl stuff would not be an improvement.
Most users will never look (or care); those that do usually
appreciate knowing that, for example, the dde subdirectory
is for tcl.

C:\python32\Tools
This has three subdirectories (i18n, 

[Python-Dev] Python install layout and the PATH on win32

2012-03-14 Thread Jim J. Jewett


In http://mail.python.org/pipermail/python-dev/2012-March/117617.html
van.lindberg at gmail.com posted:

 As noted earlier in the thread, I also change my proposal to maintain 
 the existing differences between system installs and user installs.

[Wanted lower case, which should be irrelevant; sysconfig.get_python_inc
already assumes lower case despite the configuration file.]

[Wanted bin instead of Scripts, even though they aren't binaries.]

If there are to be any changes, I *am* tempted to at least harmonize
the two install types, but to use the less redundant system form.  If
the user is deliberately trying to hide that it is version 33 (or even
that it is python), then so be it; defaulting to redundant information
is not an improvement.

Set the base/userbase at install time, with defaults of

base = %SystemDrive%\{py_version_nodot}
userbase = %USERPROFILE%\Application Data\{py_version_nodot}

usedbase = base for system installs; userbase for per-user installs.

Then let the rest default to subdirectories; sysconfig.get_config_vars
on windows explicitly doesn't provide as many variables as unix, just
INCLUDEPY (which should default to {usedbase}/include) and
LIBDEST and BINLIBDEST (both of which should default to {usedbase}/lib).

And no, I'm not forgetting data or scripts.  As best I can tell,
sysconfig doesn't actually expose them, and there is no Scripts
directory on my machine (except inside Tools).  Perhaps some
installers create it when they install their own extensions?

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] problem with recursive yield from delegation

2012-03-07 Thread Jim J. Jewett


http://mail.python.org/pipermail/python-dev/2012-March/117396.html
Stefan Behnel posted:

 I found a problem in the current yield from implementation ...

[paraphrasing]

g1 yields from g2
g2 yields from g1
XXX python follows the existing delegation without checking re-entrancy
g2 (2nd call) checks re-entrancy, and raises an exception
g1 (2nd call) gets to handle the exception, and doesn't
g2 (1st call) gets to handle the exception, and does


How is this a problem?

Re-entering a generator is a bug.  Python caught it and raised an
appropriate exception.

It would be nice if python caught the generator cycle as soon as it was
created, just as it would be nice if reference cycles were collected as
soon as they became garbage.  But python doesn't promise to catch cycles
immediately, and the checks required to do so would slow down all code,
so in practice the checks are delayed.


-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Adding a builtins parameter to eval(), exec() and __import__().

2012-03-07 Thread Jim J. Jewett


http://mail.python.org/pipermail/python-dev/2012-March/117395.html
Brett Cannon posted:

[in reply to Mark Shannon's suggestion of adding a builtins parameter
to match locals and globals]

 It's a mess right now to try to grab the __import__()
 implementation and this would actually help clarify import semantics by
 saying that __import__() for any chained imports comes from __import__()s
 locals, globals, or builtins arguments (in that order) or from the builtins
 module itself (i.e. tstate-builtins).

How does that differ from today?

If you're saying that the locals and (module-level) globals aren't
always checked in order, then that is a semantic change.  Probably
a good change, but still a change -- and it can be made indepenently
of Mark's suggestion.

Also note that I would assume this was for sandboxing, and that
missing names should *not* fall back to the real globals, although
I would understand if bootstrapping required the import statement to
get special treatment.


(Note that I like Mark's proposed change; I just don't see how it
cleans up import.)


-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] [RELEASED] Python 3.3.0 alpha 1

2012-03-06 Thread Jim J. Jewett


In http://mail.python.org/pipermail/python-dev/2012-March/117348.html
Georg Brandl ge...@python.org  posted:

 Python 3.3 includes a range of improvements of the 3.x series, as well as 
 easier
 porting between 2.x and 3.x.  Major new features in the 3.3 release series 
 are:

As much as it is nice to just celebrate improvements, I think
readers (particularly on the download page
http://www.python.org/download/releases/3.3.0/  ) would be better
served if there were an additional point about porting and the
hash changes.

http://docs.python.org/dev/whatsnew/3.3.html#porting-to-python-3-3
also failed to mention this, and even the changelog didn't seem to
warn people about failing tests or tell them how to work around it.

Perhaps something like:

Hash Randomization (issue 13703) is now on by default.  Unfortunately,
this does break some tests; it can be temporarily turned off by setting
the environment variable PYTHONHASHSEED to 0 before launching python.


-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 416: Add a frozendict builtin type

2012-02-29 Thread Jim J. Jewett


In http://mail.python.org/pipermail/python-dev/2012-February/117113.html
Victor Stinner posted:

 An immutable mapping can be implemented using frozendict::

 class immutabledict(frozendict):
 def __new__(cls, *args, **kw):
 # ensure that all values are immutable
 for key, value in itertools.chain(args, kw.items()):
 if not isinstance(value, (int, float, complex, str, bytes)):
 hash(value)
 # frozendict ensures that all keys are immutable
 return frozendict.__new__(cls, *args, **kw)

What is the purpose of this?  Is it just a hashable frozendict?

If it is for security (as some other messages suggest), then I don't
think it really helps.

class Proxy:
def __eq__(self, other): return self.value == other
def __hash__(self): return hash(self.value)

An instance of Proxy is hashable, and the hash is not object.hash,
but it is still mutable.  You're welcome to call that buggy, but a
secure sandbox will have to deal with much worse.

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 414 - Unicode Literals for Python 3

2012-02-28 Thread Jim J. Jewett


In http://mail.python.org/pipermail/python-dev/2012-February/117070.html
Vinay Sajip wrote:

 It's moot, but as I see it: the purpose of PEP 414 is to facilitate a
 single codebase across 2.x and 3.x. However, it only does this if your
 3.x interest is 3.3+

For many people -- particularly those who haven't ported yet -- 3.x
will mean 3.3+.  There are some who will support 3.2 because it is a
LTS release on some distribution, just as there were some who supported
Python 1.5 (but not 1.6) long into the 2.x cycle, but I expect them to
be the minority.

I certainly don't expect 3.2 to remain a primary development target,
the way that 2.7 is.  IIRC, the only ways to use 3.2 even today are:

  (a)  Make an explicit choice to use something other than the default
  (b)  Download directly and choose 3.x without OS support
  (c)  Use Arch Linux

These are the sort of people who can be expected to upgrade.

Now also remember that we're talking specifically about projects that
have *not* been ported to 3.x (== no existing users to support), and
that won't be ported until 3.2 is already in maintenance mode.

 If you also want to or need to support 3.0 - 3.2, it makes your
 workflow more painful,

Compared to dropping 3.2, yes.  Compared to supporting 3.2 today?
I don't see how.

 because you can't run tests on 2.x or 3.3 and then run them on 3.2
 without an intermediate source conversion step - just like the 2to3
 step that people find painful when it's part of maintenance workflow,
 and which in part prompted the PEP in the first place.

So the only differences compared to today are that:

(a)  Fewer branches are after the auto-conversion.
(b)  No current branches are after the auto-conversion.
(c)  The auto-conversion is much more limited in scope.


-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 414 - Unicode Literals for Python 3

2012-02-27 Thread Jim J. Jewett


In http://mail.python.org/pipermail/python-dev/2012-February/116953.html
Terry J. Reedy wrote:

 I presume that most 2.6 code has problems other than u'' when
 attempting to run under 3.x.

Why?

If you're talking about generic code that has seen minimal changes
since 2.0, sure.  But I think this request is specifically for
projects that are thinking about python 3, but are trying to use
a single source base regardless of version.  

Using an automatic translation step means that python (or at least
python 3) would no longer be the actual source code.  I've worked
with enough generated source code in other languages that it is
worth some pain to avoid even a slippery slope.

By the time you drop 2.5, the subset language is already pretty
good; if I have to write something version-specific, I prefer to
treat that as a sign that I am using the wrong approach.


-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Add a frozendict builtin type

2012-02-27 Thread Jim J. Jewett


In http://mail.python.org/pipermail/python-dev/2012-February/116955.html
Victor Stinner proposed:

 The blacklist implementation has a major issue: it is still possible
 to call write methods of the dict class (e.g. dict.set(my_frozendict,
 key, value)).

It is also possible to use ctypes and violate even more invariants.
For most purposes, this falls under consenting adults.

 The whitelist implementation has an issue: frozendict and dict are not
 compatible, dict is not a subclass of frozendict (and frozendict is
 not a subclass of dict).

And because of Liskov substitutability, they shouldn't be; they should
be sibling children of a basedict that doesn't have the the mutating
methods, but also doesn't *promise* not to mutate.

  * frozendict values must be immutable, as dict keys

Why?  That may be useful, but an immutable dict whose values
might mutate is also useful; by forcing that choice, it starts
to feel too specialized for a builtin.

 * Add an hash field to the PyDictObject structure

That is another indication that it should really be a sibling class;
most of the uses I have had for immutable dicts still didn't need
hashing.  It might be a worth adding anyhow, but only to immutable
dicts -- not to every instance dict or keywords parameter.

  * frozendict.__hash__ computes hash(frozenset(self.items())) and
 caches the result is its private hash attribute

Why?  hash(frozenset(selk.keys())) would still meet the hash contract,
but it would be approximately twice as fast, and I can think of only
one case where it wouldn't work just as well.  (That case is wanting
to store a dict of alternative configuration dicts (with no defaulting
of values), but ALSO wanting to use the configurations themselves
(as opposed to their names) as the dict keys.)

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP for new dictionary implementation

2012-02-16 Thread Jim J. Jewett


PEP author Mark Shannon wrote
(in 
http://mail.python.org/pipermail/python-dev/attachments/20120208/05be469a/attachment.txt):

 ... allows ... (the ``__dict__`` attribute of an object) to share
 keys with other attribute dictionaries of instances of the same class.

Is the same class a deliberate restriction, or just a convenience
of implementation?  I have often created subclasses (or even families
of subclasses) where instances (as opposed to the type) aren't likely
to have additional attributes.  These would benefit from key-sharing
across classes, but I grant that it is a minority use case that isn't
worth optimizing if it complicates the implementation.

 By separating the keys (and hashes) from the values it is possible
 to share the keys between multiple dictionaries and improve memory use.

Have you timed not storing the hash (in the dict) at all, at least for
(unicode) str-only dicts?  Going to the string for its own cached hash
breaks locality a bit more, but saves 1/3 of the memory for combined
tables, and may make a big difference for classes that have relatively
few instances.

 Reduction in memory use is directly related to the number of dictionaries
 with shared keys in existence at any time. These dictionaries are typically
 half the size of the current dictionary implementation.

How do you measure that?  The limit for huge N across huge numbers
of dicts should be 1/3 (because both hashes and keys are shared); I
assume that gets swamped by object overhead in typical small dicts.

 If a table is split the values in the keys table are ignored,
 instead the values are held in a separate array.

If they're just dead weight, then why not use them to hold indices
into the array, so that values arrays only have to be as long as
the number of keys, rather than rounding them up to a large-enough
power-of-two?  (On average, this should save half the slots.)

 A combined-table dictionary never becomes a split-table dictionary.

I thought it did (at least temporarily) as part of resizing; are you
saying that it will be re-split by the time another thread is
allowed to see it, so that it is never observed as combined?



Given that this optimization is limited to class instances, I think
there should be some explanation of why you didn't just automatically
add slots for each variable assigned (by hard-coded name) within a
method; the keys would still be stored on the type, and array storage
could still be used for the values; the __dict__ slot could initially
be a NULL pointer, and instance dicts could be added exactly when they
were needed, covering only the oddball keys.


I would reword (or at least reformat) the Cons section; at the
moment, it looks like there are four separate objections, and seems
to be a bit dismissive towards backwards copmatibility.  Perhaps
something like:

While this PEP does not change any documented APIs or invariants,
it does break some de facto invariants.

C extension modules may be relying on the current physical layout
of a dictionary.  That said, extensions which rely on internals may
already need to be recompiled with each feature release; there are
already changes planned for both Unicode (for efficiency) and dicts
(for security) that would require authors of these extensions to
at least review their code.

Because iteration (and repr) order can depend on the order in which
keys are inserted, it will be possible to construct instances that
iterate in a different order than they would under the current
implementation.  Note, however, that this will happen very rarely
in code which does not deliberately trigger the differences, and
that test cases which rely on a particular iteration order will
already need to be corrected in order to take advantage of the
security enhancements being discussed under hash randomization, or
for use with Jython and PyPy.



-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Store timestamps as decimal.Decimal objects

2012-02-16 Thread Jim J. Jewett


In http://mail.python.org/pipermail/python-dev/2012-February/116073.html
Nick Coghlan wrote:

 Besides, float128 is a bad example - such a type could just be
 returned directly where we return float64 now. (The only reason we
 can't do that with Decimal is because we deliberately don't allow
 implicit conversion of float values to Decimal values in binary
 operations).

If we could really replace float with another type, then there is
no reason that type couldn't be a nearly trivial Decimal subclass
which simply flips the default value of the (never used by any
caller) allow_float parameter to internal function _convert_other.

Since decimal inherits straight from object, this subtype could
even be made to inherit from float as well, and to store the lower-
precision value there.  It could even produce the decimal version
lazily, so as to minimize slowdown on cases that do not need the
greater precision.

Of course, that still doesn't answer questions on whether the higher
precision is a good idea ...

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] plugging the hash attack

2012-02-16 Thread Jim J. Jewett


In http://mail.python.org/pipermail/python-dev/2012-January/116003.html


  Benjamin Peterson wrote:
  2. It will be off by default in stable releases ... This will
  prevent code breakage ...

 2012/1/27 Steven D'Aprano steve at pearwood.info:
  ... it will become on by default in some future release?

 On Fri, Jan 27, 2012, Benjamin Peterson benjamin at python.org wrote:
 Yes, 3.3. The solution in 3.3 could even be one of the more
 sophisticated proposals we have today.

Brett Cannon (Mon Jan 30) wrote:

 I think that would be good. And I would  even argue we remove support for
 turning it off to force people to no longer lean on dict ordering as a
 crutch (in 3.3 obviously).

Turning it on by default is fine.

Removing the ability to turn it off is bad.

If regression tests fail with python 3, the easiest thing to do is just
not to migrate to python 3.  Some decisions (certainly around unittest,
but I think even around hash codes) were settled precisely because tests
shouldn't break unless the functionality has really changed.  Python 3
isn't yet so dominant as to change that tradeoff.

I would go so far as to add an extra step in the porting recommendations;
before porting to python 3.x, run your test suite several times with
hash randomization turned on; any failures at this point are relying on
formally undefined behavior and should be fixed, but can *probably* be
fixed just by wrapping the results in sorted.

(I would offer a patch to the porting-to-py3 recommendation, except that
I couldn't find any not associated specifically with 3.0)

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Counting collisions for the win

2012-02-16 Thread Jim J. Jewett


In http://mail.python.org/pipermail/python-dev/2012-January/115715.html
Frank Sievertsen wrote:

Am 20.01.2012 13:08, schrieb Victor Stinner:
 I'm surprised we haven't seen bug reports about it from users
 of 64-bit Pythons long ago
 A Python dictionary only uses the lower bits of a hash value. If your
 dictionary has less than 2**32 items, the dictionary order is exactly
 the same on 32 and 64 bits system: hash32(str)  mask == hash64(str)
 mask for mask= 2**32-1.

 No, that's not true.
 Whenever a collision happens, other bits are mixed in very fast.

 Frank

Bits are mixed in quickly from a denial-of-service standpoint, but
Victor is correct from a Why don't the tests already fail? standpoint.

A dict with 2**12 slots, holding over 2700 entries, will be far larger
than most test cases -- particularly those with visible output.  In a
dict that size, 32-bit and 64-bit machines will still probe the same
first, second, third, fourth, fifth, and sixth slots.  Even on the
rare cases when there are at least 6 collisions, the next slots may
well be either the same, or close enough that it doesn't show up in a
changed iteration order.

-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 410 (Decimal timestamp): the implementation is ready for a review

2012-02-15 Thread Jim J. Jewett


PEP author Victor asked
(in http://mail.python.org/pipermail/python-dev/2012-February/116499.html):

 Maybe I missed the answer, but how do you handle timestamp with an
 unspecified starting point like os.times() or time.clock()? Should we
 leave these function unchanged?

If *all* you know is that it is monotonic, then you can't -- but then
you don't really have resolution either, as the clock may well speed up
or slow down.

If you do have resolution, and the only problem is that you don't know
what the epoch was, then you can figure that out well enough by (once
per type per process) comparing it to something that does have an epoch,
like time.gmtime().


-jJ

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 411: Provisional packages in the Python standard library

2012-02-10 Thread Jim J. Jewett

Eli Bendersky wrote (in
http://mail.python.org/pipermail/python-dev/2012-February/116393.html ):

 A package will be marked provisional by including the 
 following paragraph as a note at the top of its
 documentation page:

I really would like some marker available from within Python 
itself.  

Use cases:

(1)  During development, the documentation I normally read 
first is whatever results from import module; help(module),
or possibly dir(module).

(2)  At BigCorp, there were scheduled times to move as much
as possible to the current (or current-1) version.  
Regardless of policy, full regression test suites don't 
generally exist.  If Python were viewed as part of the 
infrastructure (rather than as part of a specific 
application), or if I were responsible for maintaining an
internal application built on python, that would be the time 
to upgrade python -- and I would want an easy way to figure 
out which applications and libraries I should concentrate on 
for testing.

 * Encapsulation of the import state (PEP 368)

Wrong PEP number.  I'm guessing that you meant 406.

-- 

If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


<    1   2   3