Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Gregory P. Smith
On Fri, Jun 27, 2014 at 2:58 PM, Nick Coghlan ncogh...@gmail.com wrote:

  * -1 on including Windows specific globbing support in the API
 * -0 on including cross platform globbing support in the initial iteration
 of the API (that could be done later as a separate RFE instead)

Agreed.  Globbing or filtering support should not hold this up.  If that
part isn't settled, just don't include it and work out what it should be as
a future enhancement.

 * +1 on a new section in the PEP covering rejected design options (calling
 it iterdir, returning a 2-tuple instead of a dedicated DirEntry type)

+1.  IMNSHO, one of the most important part of PEPs: capturing the entire
decision process to document the why nots.

 * regarding why not a 2-tuple, we know from experience that operating
 systems evolve and we end up wanting to add additional info to this kind of
 API. A dedicated DirEntry type lets us adjust the information returned over
 time, without breaking backwards compatibility and without resorting to
 ugly hacks like those in some of the time and stat APIs (or even our own
 codec info APIs)
 * it would be nice to see some relative performance numbers for NFS and
 CIFS network shares - the additional network round trips can make excessive
 stat calls absolutely brutal from a speed perspective when using a network
 drive (that's why the stat caching added to the import system in 3.3
 dramatically sped up the case of having network drives on sys.path, and why
 I thought AJ had a point when he was complaining about the fact we didn't
 expose the dirent data from os.listdir)

fwiw, I wouldn't wait for benchmark numbers.

A needless stat call when you've got the information from an earlier API
call is already brutal. It is easy to compute from existing ballparks
remote file server / cloud access: ~100ms, local spinning disk seek+read:
~10ms. fetch of stat info cached in memory on file server on the local
network: ~500us.  You can go down further to local system call overhead
which can vary wildly but should likely be assumed to be at least 10us.

You don't need a benchmark to tell you that adding needless = 500us-100ms
blocking operations to your program is bad. :)

-gps
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Nick Coghlan
On 28 June 2014 16:17, Gregory P. Smith g...@krypto.org wrote:
 On Fri, Jun 27, 2014 at 2:58 PM, Nick Coghlan ncogh...@gmail.com wrote:
 * it would be nice to see some relative performance numbers for NFS and
 CIFS network shares - the additional network round trips can make excessive
 stat calls absolutely brutal from a speed perspective when using a network
 drive (that's why the stat caching added to the import system in 3.3
 dramatically sped up the case of having network drives on sys.path, and why
 I thought AJ had a point when he was complaining about the fact we didn't
 expose the dirent data from os.listdir)

 fwiw, I wouldn't wait for benchmark numbers.

 A needless stat call when you've got the information from an earlier API
 call is already brutal. It is easy to compute from existing ballparks remote
 file server / cloud access: ~100ms, local spinning disk seek+read: ~10ms.
 fetch of stat info cached in memory on file server on the local network:
 ~500us.  You can go down further to local system call overhead which can
 vary wildly but should likely be assumed to be at least 10us.

 You don't need a benchmark to tell you that adding needless = 500us-100ms
 blocking operations to your program is bad. :)

Agreed, but walking even a moderately large tree over the network can
really hammer home the point that this offers a significant
performance enhancement as the latency of access increases. I've found
that kind of comparison can be eye-opening for folks that are used to
only operating on local disks (even spinning disks, let alone SSDs)
and/or relatively small trees (distro build trees aren't *that* big,
but they're big enough for this kind of difference in access overhead
to start getting annoying).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Fix Unicode-disabled build of Python 2.7

2014-06-28 Thread Paul Sokolovsky
Hello,

On Thu, 26 Jun 2014 22:49:40 +1000
Chris Angelico ros...@gmail.com wrote:

 On Thu, Jun 26, 2014 at 9:04 PM, Antoine Pitrou anto...@python.org
 wrote:
  For the same reason, I agree with Victor that we should ditch the
  threading-disabled builds. It's too much of a hassle for no actual,
  practical benefit. People who want a threadless unicodeless Python
  can install Python 1.5.2 for all I care.
 
 Or some other implementation of Python. It's looking like micropython
 will be permanently supporting a non-Unicode build 

Yes.

 (although I stepped
 away from the project after a strong disagreement over what would and
 would not make sense, and haven't been following it since). 

Your patches with my further additions were finally merged. Unicode
strings still cannot be enabled by default due to
https://github.com/micropython/micropython/issues/726 . Any help with
reviewing/testing what's currently available is welcome.

 If someone
 wants a Python that doesn't have stuff that the core CPython devs
 treat as essential, s/he probably wants something like uPy anyway.

I hinted it during previous discussions of MicroPython, and would like
to say it again, that MicroPython already embraced a lot of ideas
rejected from CPython, like GC-only operation (which alone not
something to be proud of, but can you start up and do something in 2K
heap?) or tagged pointers
(https://mail.python.org/pipermail/python-dev/2004-July/046139.html).
So, it should be good vehicle to try any unorthodox ideas(*) or
implementations.


* MicroPython already implements intra-module constants for example.



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Akira Li
Ben Hoyt benh...@gmail.com writes:

 Hi Python dev folks,

 I've written a PEP proposing a specific os.scandir() API for a
 directory iterator that returns the stat-like info from the OS, *the
 main advantage of which is to speed up os.walk() and similar
 operations between 4-20x, depending on your OS and file system.*
 ...
 http://legacy.python.org/dev/peps/pep-0471/
 ...
 Specifically, this PEP proposes adding a single function to the ``os``
 module in the standard library, ``scandir``, that takes a single,
 optional string as its argument::

 scandir(path='.') - generator of DirEntry objects


Have you considered adding support for paths relative to directory
descriptors [1] via keyword only dir_fd=None parameter if it may lead to
more efficient implementations on some platforms?

[1]: https://docs.python.org/3.4/library/os.html#dir-fd


--
akira

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Chris Angelico
On Sat, Jun 28, 2014 at 11:05 PM, Akira Li 4kir4...@gmail.com wrote:
 Have you considered adding support for paths relative to directory
 descriptors [1] via keyword only dir_fd=None parameter if it may lead to
 more efficient implementations on some platforms?

 [1]: https://docs.python.org/3.4/library/os.html#dir-fd

Potentially more efficient and also potentially safer (see 'man
openat')... but an enhancement that can wait, if necessary.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Ben Hoyt
 But the underlying system calls -- ``FindFirstFile`` /
 ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --

 What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir?

I guess it'd be better to say Windows and Unix-based OSs
throughout the PEP? Because all of these (including Mac OS X) are
Unix-based.

 It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we
 should mimic stat_result recent addition: the new
 stat_result.file_attributes field. Add DirEntry.file_attributes which
 would only be available on Windows.

 The Windows structure also contains

   FILETIME ftCreationTime;
   FILETIME ftLastAccessTime;
   FILETIME ftLastWriteTime;
   DWORDnFileSizeHigh;
   DWORDnFileSizeLow;

 It would be nice to expose them as well. I'm  no more surprised that
 the exact API is different depending on the OS for functions of the os
 module.

I think you've misunderstood how DirEntry.lstat() works on Windows --
it's basically a no-op, as Windows returns the full stat information
with the original FindFirst/FindNext OS calls. This is fairly explict
in the PEP, but I'm sure I could make it clearer:

DirEntry.lstat(): like os.lstat(), but requires no system calls on Windows

So you can already get the dwFileAttributes for free by saying
entry.lstat().st_file_attributes. You can also get all the other
fields you mentioned for free via .lstat() with no additional OS calls
on Windows, for example: entry.lstat().st_size.

Feel free to suggest changes to the PEP or scandir docs if this isn't
clear. Note that is_dir()/is_file()/is_symlink() are free on all
systems, but .lstat() is only free on Windows.

 Does your implementation uses a free list to avoid the cost of memory
 allocation? A short free list of 10 or maybe just 1 may help. The free
 list may be stored directly in the generator object.

No, it doesn't. I might add this to the PEP under possible
improvements. However, I think the speed increase by removing the
extra OS call and/or disk seek is going to be way more than memory
allocation improvements, so I'm not sure this would be worth it.

 Does it support also bytes filenames on UNIX?

 Python now supports undecodable filenames thanks to the PEP 383
 (surrogateescape). I prefer to use the same type for filenames on
 Linux and Windows, so Unicode is better. But some users might prefer
 bytes for other reasons.

I forget exactly now what my scandir module does, but for os.scandir()
I think this should behave exactly like os.listdir() does for
Unicode/bytes filenames.

 Crazy idea: would it be possible to convert a DirEntry object to a
 pathlib.Path object without losing the cache? I guess that
 pathlib.Path expects a full  stat_result object.

The main problem is that pathlib.Path objects explicitly don't cache
stat info (and Guido doesn't want them to, for good reason I think).
There's a thread on python-dev about this earlier. I'll add it to a
Rejected ideas section.

 I don't understand how you can build a full lstat() result without
 really calling stat. I see that WIN32_FIND_DATA contains the size, but
 here you call lstat().

See above.

 Do you plan to continue to maintain your module for Python  3.5, but
 upgrade your module for the final PEP?

Yes, I intend to maintain the standalone scandir module for 2.6 =
Python  3.5, at least for a good while. For integration into the
Python 3.5 stdlib, the implementation will be integrated into
posixmodule.c, of course.

 Should there be a way to access the full path?
 --

 Should ``DirEntry``'s have a way to get the full path without using
 ``os.path.join(path, entry.name)``? This is a pretty common pattern,
 and it may be useful to add pathlib-like ``str(entry)`` functionality.
 This functionality has also been requested in `issue 13`_ on GitHub.

 .. _`issue 13`: https://github.com/benhoyt/scandir/issues/13

 I think that it would be very convinient to store the directory name
 in the DirEntry. It should be light, it's just a reference.

 And provide a fullname() name which would just return
 os.path.join(path, entry.name) without trying to resolve path to get
 an absolute path.

Yeah, fair suggestion. I'm still slightly on the fence about this, but
I think an explicit fullname() is a good suggestion. Ideally I think
it'd be better to mimic pathlib.Path.__str__() which is kind of the
equivalent of fullname(). But how does pathlib deal with unicode/bytes
issues if it's the str function which has to return a str object? Or
at least, it'd be very weird if __str__() returned bytes. But I think
it'd need to if you passed bytes into scandir(). Do others have
thoughts?

 Would it be hard to implement the wildcard feature on UNIX to compare
 performances of scandir('*.jpg') with and without the wildcard built
 in os.scandir?

It's a good idea, the problem with this is that the Windows wildcard
implementation has a bunch of crazy edge cases where *.ext will catch
more 

Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Ben Hoyt
Re is_dir etc being properties rather than methods:

 I find this behaviour a bit misleading: using methods and have them
 return cached results. How much (implementation and/or performance
 and/or memory) overhead would incur by using property-like access here?
 I think this would underline the static nature of the data.

 This would break the semantics with respect to pathlib, but they're only
 marginally equal anyways -- and as far as I understand it, pathlib won't
 cache, so I think this has a fair point here.

 Indeed - using properties rather than methods may help emphasise the
 deliberate *difference* from pathlib in this case (i.e. value when the
 result was retrieved from the OS, rather than the value right now). The main
 benefit is that switching from using the DirEntry object to a pathlib Path
 will require touching all the places where the performance characteristics
 switch from memory access to system call. This benefit is also the main
 downside, so I'd actually be OK with either decision on this one.

The problem with this is that properties look free, they look just
like attribute access, so you wouldn't normally handle exceptions when
accessing them. But .lstat() and .is_dir() etc may do an OS call, so
if you're needing to be careful with error handling, you may want to
handle errors on them. Hence I think it's best practice to make them
functions().

Some of us discussed this on python-dev or python-ideas a while back,
and I think there was general agreement with what I've stated above
and therefore they should be methods. But I'll dig up the links and
add to a Rejected ideas section.

 * +1 on a new section in the PEP covering rejected design options (calling
 it iterdir, returning a 2-tuple instead of a dedicated DirEntry type)

Great idea. I'll add a bunch of stuff, including the above, to a new
section, Rejected Design Options.

 * regarding why not a 2-tuple, we know from experience that operating
 systems evolve and we end up wanting to add additional info to this kind of
 API. A dedicated DirEntry type lets us adjust the information returned over
 time, without breaking backwards compatibility and without resorting to ugly
 hacks like those in some of the time and stat APIs (or even our own codec
 info APIs)

Fully agreed.

 * it would be nice to see some relative performance numbers for NFS and CIFS
 network shares - the additional network round trips can make excessive stat
 calls absolutely brutal from a speed perspective when using a network drive
 (that's why the stat caching added to the import system in 3.3 dramatically
 sped up the case of having network drives on sys.path, and why I thought AJ
 had a point when he was complaining about the fact we didn't expose the
 dirent data from os.listdir)

Don't know if you saw, but there are actually some benchmarks,
including one over NFS, on the scandir GitHub page:

https://github.com/benhoyt/scandir#benchmarks

os.walk() was 23 times faster with scandir() than the current
listdir() + stat() implementation on the Windows NFS file system I
tried. Pretty good speedup!

-Ben
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Nick Coghlan
On 29 June 2014 05:55, Ben Hoyt benh...@gmail.com wrote:
 Re is_dir etc being properties rather than methods:

 I find this behaviour a bit misleading: using methods and have them
 return cached results. How much (implementation and/or performance
 and/or memory) overhead would incur by using property-like access here?
 I think this would underline the static nature of the data.

 This would break the semantics with respect to pathlib, but they're only
 marginally equal anyways -- and as far as I understand it, pathlib won't
 cache, so I think this has a fair point here.

 Indeed - using properties rather than methods may help emphasise the
 deliberate *difference* from pathlib in this case (i.e. value when the
 result was retrieved from the OS, rather than the value right now). The main
 benefit is that switching from using the DirEntry object to a pathlib Path
 will require touching all the places where the performance characteristics
 switch from memory access to system call. This benefit is also the main
 downside, so I'd actually be OK with either decision on this one.

 The problem with this is that properties look free, they look just
 like attribute access, so you wouldn't normally handle exceptions when
 accessing them. But .lstat() and .is_dir() etc may do an OS call, so
 if you're needing to be careful with error handling, you may want to
 handle errors on them. Hence I think it's best practice to make them
 functions().

 Some of us discussed this on python-dev or python-ideas a while back,
 and I think there was general agreement with what I've stated above
 and therefore they should be methods. But I'll dig up the links and
 add to a Rejected ideas section.

Yes, only the stuff that *never* needs a system call (regardless of
OS) would be a candidate for handling as a property rather than a
method call. Consistency of access would likely trump that idea
anyway, but it would still be worth ensuring that the PEP is clear on
which values are guaranteed to reflect the state at the time of the
directory scanning and which may imply an additional stat call.

 * it would be nice to see some relative performance numbers for NFS and CIFS
 network shares - the additional network round trips can make excessive stat
 calls absolutely brutal from a speed perspective when using a network drive
 (that's why the stat caching added to the import system in 3.3 dramatically
 sped up the case of having network drives on sys.path, and why I thought AJ
 had a point when he was complaining about the fact we didn't expose the
 dirent data from os.listdir)

 Don't know if you saw, but there are actually some benchmarks,
 including one over NFS, on the scandir GitHub page:

 https://github.com/benhoyt/scandir#benchmarks

No, I hadn't seen those - may be worth referencing explicitly from the
PEP (and if there's already a reference... oops!)

 os.walk() was 23 times faster with scandir() than the current
 listdir() + stat() implementation on the Windows NFS file system I
 tried. Pretty good speedup!

Ah, nice!

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com