Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-07-01 Thread Victor Stinner
2014-07-01 4:04 GMT+02:00 Glenn Linderman v+pyt...@g.nevcal.com:
 +0 for stat fields to be None on all platforms unless ensure_lstat=True.

 This won't work well if lstat info is only needed for some entries. Is
 that a common use-case? It was mentioned earlier in the thread.

 If it is, use ensure_lstat=False, and use the proposed (by me) .refresh()
 API to update the data for those that need it.

We should make DirEntry as simple as possible. In Python, the classic
behaviour is to not define an attribute if it's not available on a
platform. For example, stat().st_file_attributes is only available on
Windows.

I don't like the idea of the ensure_lstat parameter because os.scandir
would have to call two system calls, it makes harder to guess which
syscall failed (readdir or lstat). If you need lstat on UNIX, write:

if hasattr(entry, 'lstat_result'):
size = entry.lstat_result.st_size
else:
size = os.lstat(entry.fullname()).st_size

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-07-01 Thread Janzert

On 6/26/2014 6:59 PM, Ben Hoyt wrote:

Rationale
=

Python's built-in ``os.walk()`` is significantly slower than it needs
to be, because -- in addition to calling ``os.listdir()`` on each
directory -- it executes the system call ``os.stat()`` or
``GetFileAttributes()`` on each file to determine whether the entry is
a directory or not.

But the underlying system calls -- ``FindFirstFile`` /
``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
already tell you whether the files returned are directories or not, so
no further system calls are needed. In short, you can reduce the
number of system calls from approximately 2N to N, where N is the
total number of files and directories in the tree. (And because
directory trees are usually much wider than they are deep, it's often
much better than this.)



One of the major reasons for this seems to be efficiently using 
information that is already available from the OS for free. 
Unfortunately it seems that the current API and most of the leading 
alternate proposals hide from the user what information is actually 
there free and what is going to incur an extra cost.


I would prefer an API that simply gives whatever came for free from the 
OS and then let the user decide if the extra expense is worth the extra 
information. Maybe that stat information was only going to be used for 
an informational log that can be skipped if it's going to incur extra 
expense?


Janzert

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-30 Thread Ben Hoyt
 So, here's my alternative proposal: add an ensure_lstat flag to
 scandir() itself, and don't have *any* methods on DirEntry, only
 attributes.

 That would make the DirEntry attributes:

 is_dir: boolean, always populated
 is_file: boolean, always populated
 is_symlink boolean, always populated
 lstat_result: stat result, may be None on POSIX systems if
 ensure_lstat is False

 (I'm not particularly sold on lstat_result as the name, but lstat
 reads as a verb to me, so doesn't sound right as an attribute name)

 What this would allow:

 - by default, scanning is efficient everywhere, but lstat_result may
 be None on POSIX systems
 - if you always need the lstat result, setting ensure_lstat will
 trigger the extra system call implicitly
 - if you only sometimes need the stat result, you can call os.lstat()
 explicitly when the DirEntry lstat attribute is None

 Most importantly, *regardless of platform*, the cached stat result (if
 not None) would reflect the state of the entry at the time the
 directory was scanned, rather than at some arbitrary later point in
 time when lstat() was first called on the DirEntry object.

 There'd still be a slight window of discrepancy (since the filesystem
 state may change between reading the directory entry and making the
 lstat() call), but this could be effectively eliminated from the
 perspective of the Python code by making the result of the lstat()
 call authoritative for the whole DirEntry object.

Yeah, I quite like this. It does make the caching more explicit and
consistent. It's slightly annoying that it's less like pathlib.Path
now, but DirEntry was never pathlib.Path anyway, so maybe it doesn't
matter. The differences in naming may highlight the difference in
caching, so maybe it's a good thing.

Two further questions from me:

1) How does error handling work? Now os.stat() will/may be called
during iteration, so in __next__. But it hard to catch errors because
you don't call __next__ explicitly. Is this a problem? How do other
iterators that make system calls or raise errors handle this?

2) There's still the open question in the PEP of whether to include a
way to access the full path. This is cheap to build, it has to be
built anyway on POSIX systems, and it's quite useful for further
operations on the file. I think the best way to handle this is a
.fullname or .full_name attribute as suggested elsewhere. Thoughts?

-Ben
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-30 Thread Tim Delaney
On 1 July 2014 03:05, Ben Hoyt benh...@gmail.com wrote:

  So, here's my alternative proposal: add an ensure_lstat flag to
  scandir() itself, and don't have *any* methods on DirEntry, only
  attributes.
 ...
  Most importantly, *regardless of platform*, the cached stat result (if
  not None) would reflect the state of the entry at the time the
  directory was scanned, rather than at some arbitrary later point in
  time when lstat() was first called on the DirEntry object.


I'm torn between whether I'd prefer the stat fields to be populated on
Windows if ensure_lstat=False or not. There are good arguments each way,
but overall I'm inclining towards having it consistent with POSIX - don't
populate them unless ensure_lstat=True.

+0 for stat fields to be None on all platforms unless ensure_lstat=True.


 Yeah, I quite like this. It does make the caching more explicit and
 consistent. It's slightly annoying that it's less like pathlib.Path
 now, but DirEntry was never pathlib.Path anyway, so maybe it doesn't
 matter. The differences in naming may highlight the difference in
 caching, so maybe it's a good thing.


See my comments below on .fullname.


 Two further questions from me:

 1) How does error handling work? Now os.stat() will/may be called
 during iteration, so in __next__. But it hard to catch errors because
 you don't call __next__ explicitly. Is this a problem? How do other
 iterators that make system calls or raise errors handle this?


I think it just needs to be documented that iterating may throw the same
exceptions as os.lstat(). It's a little trickier if you don't want the
scope of your exception to be too broad, but you can always wrap the
iteration in a generator to catch and handle the exceptions you care about,
and allow the rest to propagate.

def scandir_accessible(path='.'):
gen = os.scandir(path)

while True:
try:
yield next(gen)
except PermissionError:
pass

2) There's still the open question in the PEP of whether to include a
 way to access the full path. This is cheap to build, it has to be
 built anyway on POSIX systems, and it's quite useful for further
 operations on the file. I think the best way to handle this is a
 .fullname or .full_name attribute as suggested elsewhere. Thoughts?


+1 for .fullname. The earlier suggestion to have __str__ return the name is
killed I think by the fact that .fullname could be bytes.

It would be nice if pathlib.Path objects were enhanced to take a DirEntry
and use the .fullname automatically, but you could always call
Path(direntry.fullname).

Tim Delaney
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-30 Thread Ethan Furman

On 06/30/2014 03:07 PM, Tim Delaney wrote:

On 1 July 2014 03:05, Ben Hoyt wrote:


So, here's my alternative proposal: add an ensure_lstat flag to
scandir() itself, and don't have *any* methods on DirEntry, only
attributes.
...
Most importantly, *regardless of platform*, the cached stat result (if
not None) would reflect the state of the entry at the time the
directory was scanned, rather than at some arbitrary later point in
time when lstat() was first called on the DirEntry object.


I'm torn between whether I'd prefer the stat fields to be populated
on Windows if ensure_lstat=False or not. There are good arguments each
 way, but overall I'm inclining towards having it consistent with POSIX
- don't populate them unless ensure_lstat=True.

+0 for stat fields to be None on all platforms unless ensure_lstat=True.


If a Windows user just needs the free info, why should s/he have to pay the price of a full stat call?  I see no reason 
to hold the Windows side back and not take advantage of what it has available.  There are plenty of posix calls that 
Windows is not able to use, after all.


--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-30 Thread Tim Delaney
On 1 July 2014 08:38, Ethan Furman et...@stoneleaf.us wrote:

 On 06/30/2014 03:07 PM, Tim Delaney wrote:

 I'm torn between whether I'd prefer the stat fields to be populated
 on Windows if ensure_lstat=False or not. There are good arguments each
  way, but overall I'm inclining towards having it consistent with POSIX
 - don't populate them unless ensure_lstat=True.

 +0 for stat fields to be None on all platforms unless ensure_lstat=True.


 If a Windows user just needs the free info, why should s/he have to pay
 the price of a full stat call?  I see no reason to hold the Windows side
 back and not take advantage of what it has available.  There are plenty of
 posix calls that Windows is not able to use, after all.


On Windows ensure_lstat would either be either a NOP (if the fields are
always populated), or it simply determines if the fields get populated. No
extra stat call.

On POSIX it's the difference between an extra stat call or not.

Tim Delaney
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-30 Thread Devin Jeanpierre
On Mon, Jun 30, 2014 at 3:07 PM, Tim Delaney
timothy.c.dela...@gmail.com wrote:
 On 1 July 2014 03:05, Ben Hoyt benh...@gmail.com wrote:

  So, here's my alternative proposal: add an ensure_lstat flag to
  scandir() itself, and don't have *any* methods on DirEntry, only
  attributes.
 ...

  Most importantly, *regardless of platform*, the cached stat result (if
  not None) would reflect the state of the entry at the time the
  directory was scanned, rather than at some arbitrary later point in
  time when lstat() was first called on the DirEntry object.


 I'm torn between whether I'd prefer the stat fields to be populated on
 Windows if ensure_lstat=False or not. There are good arguments each way, but
 overall I'm inclining towards having it consistent with POSIX - don't
 populate them unless ensure_lstat=True.

 +0 for stat fields to be None on all platforms unless ensure_lstat=True.

This won't work well if lstat info is only needed for some entries. Is
that a common use-case? It was mentioned earlier in the thread.

-- Devin
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-30 Thread Ethan Furman

On 06/30/2014 04:15 PM, Tim Delaney wrote:

On 1 July 2014 08:38, Ethan Furman wrote:

On 06/30/2014 03:07 PM, Tim Delaney wrote:


I'm torn between whether I'd prefer the stat fields to be populated
on Windows if ensure_lstat=False or not. There are good arguments each
way, but overall I'm inclining towards having it consistent with POSIX
- don't populate them unless ensure_lstat=True.

+0 for stat fields to be None on all platforms unless ensure_lstat=True.


If a Windows user just needs the free info, why should s/he have to pay
the price of a full stat call?  I see no reason to hold the Windows side
 back and not take advantage of what it has available.  There are plenty
of posix calls that Windows is not able to use, after all.


On Windows ensure_lstat would either be either a NOP (if the fields are
always populated), or it simply determines if the fields get populated.
 No extra stat call.


I suppose the exact behavior is still under discussion, as there are only two or three fields one gets for free on 
Windows (I think...), where as an os.stat call would get everything available for the platform.




On POSIX it's the difference between an extra stat call or not.


Agreed on this part.

Still, no reason to slow down the Windows side by throwing away info 
unnecessarily -- that's why this PEP exists, after all.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-30 Thread Ben Hoyt
 I suppose the exact behavior is still under discussion, as there are only
 two or three fields one gets for free on Windows (I think...), where as an
 os.stat call would get everything available for the platform.

No, Windows is nice enough to give you all the same stat_result fields
during scandir (via FindFirstFile/FindNextFile) as a regular
os.stat().

-Ben
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-30 Thread Glenn Linderman

On 6/30/2014 4:25 PM, Devin Jeanpierre wrote:

On Mon, Jun 30, 2014 at 3:07 PM, Tim Delaney
timothy.c.dela...@gmail.com wrote:

On 1 July 2014 03:05, Ben Hoyt benh...@gmail.com wrote:

So, here's my alternative proposal: add an ensure_lstat flag to
scandir() itself, and don't have *any* methods on DirEntry, only
attributes.

...


Most importantly, *regardless of platform*, the cached stat result (if
not None) would reflect the state of the entry at the time the
directory was scanned, rather than at some arbitrary later point in
time when lstat() was first called on the DirEntry object.


I'm torn between whether I'd prefer the stat fields to be populated on
Windows if ensure_lstat=False or not. There are good arguments each way, but
overall I'm inclining towards having it consistent with POSIX - don't
populate them unless ensure_lstat=True.

+0 for stat fields to be None on all platforms unless ensure_lstat=True.

This won't work well if lstat info is only needed for some entries. Is
that a common use-case? It was mentioned earlier in the thread.


If it is, use ensure_lstat=False, and use the proposed (by me) 
.refresh() API to update the data for those that need it.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-30 Thread Nick Coghlan
On 30 Jun 2014 19:13, Glenn Linderman v+pyt...@g.nevcal.com wrote:


 If it is, use ensure_lstat=False, and use the proposed (by me) .refresh()
API to update the data for those that need it.

I'm -1 on a refresh API for DirEntry - just use pathlib in that case.

Cheers,
Nick.


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-30 Thread Ethan Furman

On 06/30/2014 06:28 PM, Ben Hoyt wrote:

I suppose the exact behavior is still under discussion, as there are only
two or three fields one gets for free on Windows (I think...), where as an
os.stat call would get everything available for the platform.


No, Windows is nice enough to give you all the same stat_result fields
during scandir (via FindFirstFile/FindNextFile) as a regular
os.stat().


Very nice.  Even less reason then to throw it away.  :)

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-30 Thread Eric V. Smith
On 6/30/2014 10:17 PM, Nick Coghlan wrote:
 
 On 30 Jun 2014 19:13, Glenn Linderman v+pyt...@g.nevcal.com
 mailto:v%2bpyt...@g.nevcal.com wrote:


 If it is, use ensure_lstat=False, and use the proposed (by me)
 .refresh() API to update the data for those that need it.
 
 I'm -1 on a refresh API for DirEntry - just use pathlib in that case.

I'm not sure refresh() is the best name, but I think a
get_stat_info_from_direntry_or_call_stat() (hah!) makes sense. If you
really need the stat info, then you can write simple code like:

for entry in os.scandir(path):
mtime = entry.get_stat_info_from_direntry_or_call_stat().st_mtime

And it won't call stat() any more times than needed. Once per file on
Posix, zero times per file on Windows.

Without an API like this, you'll need a check in the application code on
whether or not to call stat().

Eric.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-30 Thread Terry Reedy

On 6/30/2014 9:44 PM, Ethan Furman wrote:

On 06/30/2014 06:28 PM, Ben Hoyt wrote:

I suppose the exact behavior is still under discussion, as there are
only
two or three fields one gets for free on Windows (I think...),
where as an
os.stat call would get everything available for the platform.


No, Windows is nice enough to give you all the same stat_result fields
during scandir (via FindFirstFile/FindNextFile) as a regular
os.stat().


Very nice.  Even less reason then to throw it away.  :)


I agree.

--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Gregory P. Smith
On Jun 28, 2014 12:49 PM, Ben Hoyt benh...@gmail.com wrote:

  But the underlying system calls -- ``FindFirstFile`` /
  ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
 
  What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide
readdir?

 I guess it'd be better to say Windows and Unix-based OSs
 throughout the PEP? Because all of these (including Mac OS X) are
 Unix-based.

No, Just say POSIX.


  It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we
  should mimic stat_result recent addition: the new
  stat_result.file_attributes field. Add DirEntry.file_attributes which
  would only be available on Windows.
 
  The Windows structure also contains
 
FILETIME ftCreationTime;
FILETIME ftLastAccessTime;
FILETIME ftLastWriteTime;
DWORDnFileSizeHigh;
DWORDnFileSizeLow;
 
  It would be nice to expose them as well. I'm  no more surprised that
  the exact API is different depending on the OS for functions of the os
  module.

 I think you've misunderstood how DirEntry.lstat() works on Windows --
 it's basically a no-op, as Windows returns the full stat information
 with the original FindFirst/FindNext OS calls. This is fairly explict
 in the PEP, but I'm sure I could make it clearer:

 DirEntry.lstat(): like os.lstat(), but requires no system calls on
Windows

 So you can already get the dwFileAttributes for free by saying
 entry.lstat().st_file_attributes. You can also get all the other
 fields you mentioned for free via .lstat() with no additional OS calls
 on Windows, for example: entry.lstat().st_size.

 Feel free to suggest changes to the PEP or scandir docs if this isn't
 clear. Note that is_dir()/is_file()/is_symlink() are free on all
 systems, but .lstat() is only free on Windows.

  Does your implementation uses a free list to avoid the cost of memory
  allocation? A short free list of 10 or maybe just 1 may help. The free
  list may be stored directly in the generator object.

 No, it doesn't. I might add this to the PEP under possible
 improvements. However, I think the speed increase by removing the
 extra OS call and/or disk seek is going to be way more than memory
 allocation improvements, so I'm not sure this would be worth it.

  Does it support also bytes filenames on UNIX?

  Python now supports undecodable filenames thanks to the PEP 383
  (surrogateescape). I prefer to use the same type for filenames on
  Linux and Windows, so Unicode is better. But some users might prefer
  bytes for other reasons.

 I forget exactly now what my scandir module does, but for os.scandir()
 I think this should behave exactly like os.listdir() does for
 Unicode/bytes filenames.

  Crazy idea: would it be possible to convert a DirEntry object to a
  pathlib.Path object without losing the cache? I guess that
  pathlib.Path expects a full  stat_result object.

 The main problem is that pathlib.Path objects explicitly don't cache
 stat info (and Guido doesn't want them to, for good reason I think).
 There's a thread on python-dev about this earlier. I'll add it to a
 Rejected ideas section.

  I don't understand how you can build a full lstat() result without
  really calling stat. I see that WIN32_FIND_DATA contains the size, but
  here you call lstat().

 See above.

  Do you plan to continue to maintain your module for Python  3.5, but
  upgrade your module for the final PEP?

 Yes, I intend to maintain the standalone scandir module for 2.6 =
 Python  3.5, at least for a good while. For integration into the
 Python 3.5 stdlib, the implementation will be integrated into
 posixmodule.c, of course.

  Should there be a way to access the full path?
  --
 
  Should ``DirEntry``'s have a way to get the full path without using
  ``os.path.join(path, entry.name)``? This is a pretty common pattern,
  and it may be useful to add pathlib-like ``str(entry)`` functionality.
  This functionality has also been requested in `issue 13`_ on GitHub.
 
  .. _`issue 13`: https://github.com/benhoyt/scandir/issues/13
 
  I think that it would be very convinient to store the directory name
  in the DirEntry. It should be light, it's just a reference.
 
  And provide a fullname() name which would just return
  os.path.join(path, entry.name) without trying to resolve path to get
  an absolute path.

 Yeah, fair suggestion. I'm still slightly on the fence about this, but
 I think an explicit fullname() is a good suggestion. Ideally I think
 it'd be better to mimic pathlib.Path.__str__() which is kind of the
 equivalent of fullname(). But how does pathlib deal with unicode/bytes
 issues if it's the str function which has to return a str object? Or
 at least, it'd be very weird if __str__() returned bytes. But I think
 it'd need to if you passed bytes into scandir(). Do others have
 thoughts?

  Would it be hard to implement the wildcard feature on UNIX to compare
  performances of scandir('*.jpg') with and without the 

Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Walter Dörwald

On 28 Jun 2014, at 21:48, Ben Hoyt wrote:


[...]

Crazy idea: would it be possible to convert a DirEntry object to a
pathlib.Path object without losing the cache? I guess that
pathlib.Path expects a full  stat_result object.


The main problem is that pathlib.Path objects explicitly don't cache
stat info (and Guido doesn't want them to, for good reason I think).
There's a thread on python-dev about this earlier. I'll add it to a
Rejected ideas section.


However, it would be bad to have two implementations of the concept of 
filename with different attribute and method names.


The best way to ensure compatible APIs would be if one class was derived 
from the other.



[...]


Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Steven D'Aprano
On Sat, Jun 28, 2014 at 03:55:00PM -0400, Ben Hoyt wrote:
 Re is_dir etc being properties rather than methods:
[...]
 The problem with this is that properties look free, they look just
 like attribute access, so you wouldn't normally handle exceptions when
 accessing them. But .lstat() and .is_dir() etc may do an OS call, so
 if you're needing to be careful with error handling, you may want to
 handle errors on them. Hence I think it's best practice to make them
 functions().

I think this one could go either way. Methods look like they actually 
re-test the value each time you call it. I can easily see people not 
realising that the value is cached and writing code like this toy 
example:


# Detect a file change.
t = the_file.lstat().st_mtime
while the_file.lstat().st_mtime == t:
 sleep(0.1)
print(Changed!)


I know that's not the best way to detect file changes, but I'm sure 
people will do something like that and not realise that the call to 
lstat is cached.

Personally, I would prefer a property. If I forget to wrap a call in a 
try...except, it will fail hard and I will get an exception. But with a 
method call, the failure is silent and I keep getting the cached result.

Speaking of caching, is there a way to freshen the cached values?


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Nick Coghlan
On 29 June 2014 20:52, Steven D'Aprano st...@pearwood.info wrote:
 Speaking of caching, is there a way to freshen the cached values?

Switch to a full Path object instead of relying on the cached DirEntry data.

This is what makes me wary of including lstat, even though Windows
offers it without the extra stat call. Caching behaviour is *really*
hard to make intuitive, especially when it *sometimes* returns data
that looks fresh (as it on first call on POSIX systems).

Regards,
Nick.


-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Paul Moore
On 29 June 2014 12:08, Nick Coghlan ncogh...@gmail.com wrote:
 This is what makes me wary of including lstat, even though Windows
 offers it without the extra stat call. Caching behaviour is *really*
 hard to make intuitive, especially when it *sometimes* returns data
 that looks fresh (as it on first call on POSIX systems).

If it matters that much we *could* simply call it cached_lstat(). It's
ugly, but I really don't like the idea of throwing the information
away - after all, the fact that we currently throw data away is why
there's even a need for scandir. Let's not make the same mistake
again...

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Nick Coghlan
On 29 June 2014 21:45, Paul Moore p.f.mo...@gmail.com wrote:
 On 29 June 2014 12:08, Nick Coghlan ncogh...@gmail.com wrote:
 This is what makes me wary of including lstat, even though Windows
 offers it without the extra stat call. Caching behaviour is *really*
 hard to make intuitive, especially when it *sometimes* returns data
 that looks fresh (as it on first call on POSIX systems).

 If it matters that much we *could* simply call it cached_lstat(). It's
 ugly, but I really don't like the idea of throwing the information
 away - after all, the fact that we currently throw data away is why
 there's even a need for scandir. Let's not make the same mistake
 again...

Future-proofing is the reason DirEntry is a full fledged class in the
first place, though.

Effectively communicating the behavioural difference between DirEntry
and pathlib.Path is the main thing that makes me nervous about
adhering too closely to the Path API.

To restate the problem and the alternative proposal, these are the
DirEntry methods under discussion:

is_dir(): like os.path.isdir(), but requires no system calls on at
least POSIX and Windows
is_file(): like os.path.isfile(), but requires no system calls on
at least POSIX and Windows
is_symlink(): like os.path.islink(), but requires no system calls
on at least POSIX and Windows
lstat(): like os.lstat(), but requires no system calls on Windows

For the almost-certain-to-be-cached items, the suggestion is to make
them properties (or just ordinary attributes):

is_dir
is_file
is_symlink

What do with lstat() is currently less clear, since POSIX directory
scanning doesn't provide that level of detail by default.

The PEP also doesn't currently state whether the is_dir(), is_file()
and is_symlink() results would be updated if a call to lstat()
produced different answers than the original directory scanning
process, which further suggests to me that allowing the stat call to
be delayed on POSIX systems is a potentially problematic and
inherently confusing design. We would have two options:

- update them, meaning calling lstat() may change those results from
being a snapshot of the setting at the time the directory was scanned
- leave them alone, meaning the DirEntry object and the
DirEntry.lstat() result may give different answers

Those both sound ugly to me.

So, here's my alternative proposal: add an ensure_lstat flag to
scandir() itself, and don't have *any* methods on DirEntry, only
attributes.

That would make the DirEntry attributes:

is_dir: boolean, always populated
is_file: boolean, always populated
is_symlink boolean, always populated
lstat_result: stat result, may be None on POSIX systems if
ensure_lstat is False

(I'm not particularly sold on lstat_result as the name, but lstat
reads as a verb to me, so doesn't sound right as an attribute name)

What this would allow:

- by default, scanning is efficient everywhere, but lstat_result may
be None on POSIX systems
- if you always need the lstat result, setting ensure_lstat will
trigger the extra system call implicitly
- if you only sometimes need the stat result, you can call os.lstat()
explicitly when the DirEntry lstat attribute is None

Most importantly, *regardless of platform*, the cached stat result (if
not None) would reflect the state of the entry at the time the
directory was scanned, rather than at some arbitrary later point in
time when lstat() was first called on the DirEntry object.

There'd still be a slight window of discrepancy (since the filesystem
state may change between reading the directory entry and making the
lstat() call), but this could be effectively eliminated from the
perspective of the Python code by making the result of the lstat()
call authoritative for the whole DirEntry object.

Regards,
Nick.

P.S. We'd be generating quite a few of these, so we can use __slots__
to keep the memory overhead to a minimum (that's just a general
comment - it's really irrelevant to the methods-or-attributes
question).


-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Jonas Wielicki
On 29.06.2014 13:08, Nick Coghlan wrote:
 On 29 June 2014 20:52, Steven D'Aprano st...@pearwood.info wrote:
 Speaking of caching, is there a way to freshen the cached values?
 
 Switch to a full Path object instead of relying on the cached DirEntry data.
 
 This is what makes me wary of including lstat, even though Windows
 offers it without the extra stat call. Caching behaviour is *really*
 hard to make intuitive, especially when it *sometimes* returns data
 that looks fresh (as it on first call on POSIX systems).

This bugs me too. An idea I had was adding a keyword argument to scandir
which specifies whether stat data should be added to the direntry or not.

If the flag is set to True, This would implicitly call lstat on POSIX
before returning the DirEntry, and use the available data on Windows.

If the flag is set to False, all the fields in the DirEntry will be
None, for consistency, even on Windows.


This is not optimal in cases where the stat information is needed only
for some of the DirEntry objects, but would also reduce the required
logic in the DirEntry object.

Thoughts?

 
 Regards,
 Nick.
 
 

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Ethan Furman

On 06/29/2014 05:28 AM, Nick Coghlan wrote:


So, here's my alternative proposal: add an ensure_lstat flag to
scandir() itself, and don't have *any* methods on DirEntry, only
attributes.

That would make the DirEntry attributes:

 is_dir: boolean, always populated
 is_file: boolean, always populated
 is_symlink boolean, always populated
 lstat_result: stat result, may be None on POSIX systems if
ensure_lstat is False

(I'm not particularly sold on lstat_result as the name, but lstat
reads as a verb to me, so doesn't sound right as an attribute name)


+1

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Ethan Furman

On 06/29/2014 04:12 AM, Jonas Wielicki wrote:


If the flag is set to False, all the fields in the DirEntry will be
None, for consistency, even on Windows.


-1

This consistency is unnecessary.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Akira Li
Chris Angelico ros...@gmail.com writes:

 On Sat, Jun 28, 2014 at 11:05 PM, Akira Li 4kir4...@gmail.com wrote:
 Have you considered adding support for paths relative to directory
 descriptors [1] via keyword only dir_fd=None parameter if it may lead to
 more efficient implementations on some platforms?

 [1]: https://docs.python.org/3.4/library/os.html#dir-fd

 Potentially more efficient and also potentially safer (see 'man
 openat')... but an enhancement that can wait, if necessary.


Introducing the feature later creates unnecessary incompatibilities.
Either it should be explicitly rejected in the PEP 471 and
something-like `os.scandir(os.open(relative_path, dir_fd=fd))` recommended
instead (assuming `os.scandir in os.supports_fd` like `os.listdir()`).

At C level it could be implemented using fdopendir/openat or scandirat.

Here's the function description using Argument Clinic DSL:

/*[clinic input]

os.scandir

path : path_t(allow_fd=True, nullable=True) = '.'

*path* can be specified as either str or bytes. On some
platforms, *path* may also be specified as an open file
descriptor; the file descriptor must refer to a directory.  If
this functionality is unavailable, using it raises
NotImplementedError.

*

dir_fd : dir_fd = None

If not None, it should be a file descriptor open to a
directory, and *path* should be a relative string; path will
then be relative to that directory.  if *dir_fd* is
unavailable, using it raises NotImplementedError.

Yield a DirEntry object for each file and directory in *path*.

Just like os.listdir, the '.' and '..' pseudo-directories are skipped,
and the entries are yielded in system-dependent order.

{parameters}
It's an error to use *dir_fd* when specifying *path* as an open file
descriptor.

[clinic start generated code]*/


And corresponding tests (from test_posix:PosixTester), to show the
compatibility with os.listdir argument parsing in detail:

def test_scandir_default(self):
# When scandir is called without argument,
# it's the same as scandir(os.curdir).
self.assertIn(support.TESTFN, [e.name for e in posix.scandir()])

def _test_scandir(self, curdir):
filenames = sorted(e.name for e in posix.scandir(curdir))
self.assertIn(support.TESTFN, filenames)
#NOTE: assume listdir, scandir accept the same types on the platform
self.assertEqual(sorted(posix.listdir(curdir)), filenames)

def test_scandir(self):
self._test_scandir(os.curdir)

def test_scandir_none(self):
# it's the same as scandir(os.curdir).
self._test_scandir(None)

def test_scandir_bytes(self):
# When scandir is called with a bytes object,
# the returned entries names are still of type str.
# Call `os.fsencode(entry.name)` to get bytes
self.assertIn('a', {'a'})
self.assertNotIn(b'a', {'a'})
self._test_scandir(b'.')

@unittest.skipUnless(posix.scandir in os.supports_fd,
 test needs fd support for posix.scandir())
def test_scandir_fd_minus_one(self):
# it's the same as scandir(os.curdir).
self._test_scandir(-1)

def test_scandir_float(self):
# invalid args
self.assertRaises(TypeError, posix.scandir, -1.0)

@unittest.skipUnless(posix.scandir in os.supports_fd,
 test needs fd support for posix.scandir())
def test_scandir_fd(self):
fd = posix.open(posix.getcwd(), posix.O_RDONLY)
self.addCleanup(posix.close, fd)
self._test_scandir(fd)
self.assertEqual(
sorted(posix.scandir('.')),
sorted(posix.scandir(fd)))
# call 2nd time to test rewind
self.assertEqual(
sorted(posix.scandir('.')),
sorted(posix.scandir(fd)))

@unittest.skipUnless(posix.scandir in os.supports_dir_fd,
 test needs dir_fd support for os.scandir())
def test_scandir_dir_fd(self):
relpath = 'relative_path'
with support.temp_dir() as parent:
fullpath = os.path.join(parent, relpath)
with support.temp_dir(path=fullpath):
support.create_empty_file(os.path.join(parent, 'a'))
support.create_empty_file(os.path.join(fullpath, 'b'))
fd = posix.open(parent, posix.O_RDONLY)
self.addCleanup(posix.close, fd)
self.assertEqual(
sorted(posix.scandir(relpath, dir_fd=fd)),
sorted(posix.scandir(fullpath)))
# check that fd is still useful
self.assertEqual(
sorted(posix.scandir(relpath, dir_fd=fd)),
sorted(posix.scandir(fullpath)))


--
Akira

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev

Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Jonas Wielicki
On 29.06.2014 19:04, Ethan Furman wrote:
 On 06/29/2014 04:12 AM, Jonas Wielicki wrote:

 If the flag is set to False, all the fields in the DirEntry will be
 None, for consistency, even on Windows.
 
 -1

 This consistency is unnecessary.

I’m not sure -- similar to the windows_wildcard option this might be a
temptation to write platform dependent code, although possibly by
accident (i.e. not reading the docs carefully).

 
 -- 
 ~Ethan~
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 https://mail.python.org/mailman/options/python-dev/j.wielicki%40sotecware.net
 

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Glenn Linderman

On 6/29/2014 5:28 AM, Nick Coghlan wrote:

There'd still be a slight window of discrepancy (since the filesystem
state may change between reading the directory entry and making the
lstat() call), but this could be effectively eliminated from the
perspective of the Python code by making the result of the lstat()
call authoritative for the whole DirEntry object.


+1 to this in particular, but this whole refresh of the semantics sounds 
better overall.


Finally, for the case where someone does want to keep the DirEntry 
around, a .refresh() API could rerun lstat() and update all the data.


And with that (initial data potentially always populated, or None, and 
an explicit refresh() API), the data could all be returned as 
properties, implying that they aren't fetching new data themselves, 
because they wouldn't be.


Glenn
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Gregory P. Smith
On Fri, Jun 27, 2014 at 2:58 PM, Nick Coghlan ncogh...@gmail.com wrote:

  * -1 on including Windows specific globbing support in the API
 * -0 on including cross platform globbing support in the initial iteration
 of the API (that could be done later as a separate RFE instead)

Agreed.  Globbing or filtering support should not hold this up.  If that
part isn't settled, just don't include it and work out what it should be as
a future enhancement.

 * +1 on a new section in the PEP covering rejected design options (calling
 it iterdir, returning a 2-tuple instead of a dedicated DirEntry type)

+1.  IMNSHO, one of the most important part of PEPs: capturing the entire
decision process to document the why nots.

 * regarding why not a 2-tuple, we know from experience that operating
 systems evolve and we end up wanting to add additional info to this kind of
 API. A dedicated DirEntry type lets us adjust the information returned over
 time, without breaking backwards compatibility and without resorting to
 ugly hacks like those in some of the time and stat APIs (or even our own
 codec info APIs)
 * it would be nice to see some relative performance numbers for NFS and
 CIFS network shares - the additional network round trips can make excessive
 stat calls absolutely brutal from a speed perspective when using a network
 drive (that's why the stat caching added to the import system in 3.3
 dramatically sped up the case of having network drives on sys.path, and why
 I thought AJ had a point when he was complaining about the fact we didn't
 expose the dirent data from os.listdir)

fwiw, I wouldn't wait for benchmark numbers.

A needless stat call when you've got the information from an earlier API
call is already brutal. It is easy to compute from existing ballparks
remote file server / cloud access: ~100ms, local spinning disk seek+read:
~10ms. fetch of stat info cached in memory on file server on the local
network: ~500us.  You can go down further to local system call overhead
which can vary wildly but should likely be assumed to be at least 10us.

You don't need a benchmark to tell you that adding needless = 500us-100ms
blocking operations to your program is bad. :)

-gps
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Nick Coghlan
On 28 June 2014 16:17, Gregory P. Smith g...@krypto.org wrote:
 On Fri, Jun 27, 2014 at 2:58 PM, Nick Coghlan ncogh...@gmail.com wrote:
 * it would be nice to see some relative performance numbers for NFS and
 CIFS network shares - the additional network round trips can make excessive
 stat calls absolutely brutal from a speed perspective when using a network
 drive (that's why the stat caching added to the import system in 3.3
 dramatically sped up the case of having network drives on sys.path, and why
 I thought AJ had a point when he was complaining about the fact we didn't
 expose the dirent data from os.listdir)

 fwiw, I wouldn't wait for benchmark numbers.

 A needless stat call when you've got the information from an earlier API
 call is already brutal. It is easy to compute from existing ballparks remote
 file server / cloud access: ~100ms, local spinning disk seek+read: ~10ms.
 fetch of stat info cached in memory on file server on the local network:
 ~500us.  You can go down further to local system call overhead which can
 vary wildly but should likely be assumed to be at least 10us.

 You don't need a benchmark to tell you that adding needless = 500us-100ms
 blocking operations to your program is bad. :)

Agreed, but walking even a moderately large tree over the network can
really hammer home the point that this offers a significant
performance enhancement as the latency of access increases. I've found
that kind of comparison can be eye-opening for folks that are used to
only operating on local disks (even spinning disks, let alone SSDs)
and/or relatively small trees (distro build trees aren't *that* big,
but they're big enough for this kind of difference in access overhead
to start getting annoying).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Akira Li
Ben Hoyt benh...@gmail.com writes:

 Hi Python dev folks,

 I've written a PEP proposing a specific os.scandir() API for a
 directory iterator that returns the stat-like info from the OS, *the
 main advantage of which is to speed up os.walk() and similar
 operations between 4-20x, depending on your OS and file system.*
 ...
 http://legacy.python.org/dev/peps/pep-0471/
 ...
 Specifically, this PEP proposes adding a single function to the ``os``
 module in the standard library, ``scandir``, that takes a single,
 optional string as its argument::

 scandir(path='.') - generator of DirEntry objects


Have you considered adding support for paths relative to directory
descriptors [1] via keyword only dir_fd=None parameter if it may lead to
more efficient implementations on some platforms?

[1]: https://docs.python.org/3.4/library/os.html#dir-fd


--
akira

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Chris Angelico
On Sat, Jun 28, 2014 at 11:05 PM, Akira Li 4kir4...@gmail.com wrote:
 Have you considered adding support for paths relative to directory
 descriptors [1] via keyword only dir_fd=None parameter if it may lead to
 more efficient implementations on some platforms?

 [1]: https://docs.python.org/3.4/library/os.html#dir-fd

Potentially more efficient and also potentially safer (see 'man
openat')... but an enhancement that can wait, if necessary.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Ben Hoyt
 But the underlying system calls -- ``FindFirstFile`` /
 ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --

 What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir?

I guess it'd be better to say Windows and Unix-based OSs
throughout the PEP? Because all of these (including Mac OS X) are
Unix-based.

 It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we
 should mimic stat_result recent addition: the new
 stat_result.file_attributes field. Add DirEntry.file_attributes which
 would only be available on Windows.

 The Windows structure also contains

   FILETIME ftCreationTime;
   FILETIME ftLastAccessTime;
   FILETIME ftLastWriteTime;
   DWORDnFileSizeHigh;
   DWORDnFileSizeLow;

 It would be nice to expose them as well. I'm  no more surprised that
 the exact API is different depending on the OS for functions of the os
 module.

I think you've misunderstood how DirEntry.lstat() works on Windows --
it's basically a no-op, as Windows returns the full stat information
with the original FindFirst/FindNext OS calls. This is fairly explict
in the PEP, but I'm sure I could make it clearer:

DirEntry.lstat(): like os.lstat(), but requires no system calls on Windows

So you can already get the dwFileAttributes for free by saying
entry.lstat().st_file_attributes. You can also get all the other
fields you mentioned for free via .lstat() with no additional OS calls
on Windows, for example: entry.lstat().st_size.

Feel free to suggest changes to the PEP or scandir docs if this isn't
clear. Note that is_dir()/is_file()/is_symlink() are free on all
systems, but .lstat() is only free on Windows.

 Does your implementation uses a free list to avoid the cost of memory
 allocation? A short free list of 10 or maybe just 1 may help. The free
 list may be stored directly in the generator object.

No, it doesn't. I might add this to the PEP under possible
improvements. However, I think the speed increase by removing the
extra OS call and/or disk seek is going to be way more than memory
allocation improvements, so I'm not sure this would be worth it.

 Does it support also bytes filenames on UNIX?

 Python now supports undecodable filenames thanks to the PEP 383
 (surrogateescape). I prefer to use the same type for filenames on
 Linux and Windows, so Unicode is better. But some users might prefer
 bytes for other reasons.

I forget exactly now what my scandir module does, but for os.scandir()
I think this should behave exactly like os.listdir() does for
Unicode/bytes filenames.

 Crazy idea: would it be possible to convert a DirEntry object to a
 pathlib.Path object without losing the cache? I guess that
 pathlib.Path expects a full  stat_result object.

The main problem is that pathlib.Path objects explicitly don't cache
stat info (and Guido doesn't want them to, for good reason I think).
There's a thread on python-dev about this earlier. I'll add it to a
Rejected ideas section.

 I don't understand how you can build a full lstat() result without
 really calling stat. I see that WIN32_FIND_DATA contains the size, but
 here you call lstat().

See above.

 Do you plan to continue to maintain your module for Python  3.5, but
 upgrade your module for the final PEP?

Yes, I intend to maintain the standalone scandir module for 2.6 =
Python  3.5, at least for a good while. For integration into the
Python 3.5 stdlib, the implementation will be integrated into
posixmodule.c, of course.

 Should there be a way to access the full path?
 --

 Should ``DirEntry``'s have a way to get the full path without using
 ``os.path.join(path, entry.name)``? This is a pretty common pattern,
 and it may be useful to add pathlib-like ``str(entry)`` functionality.
 This functionality has also been requested in `issue 13`_ on GitHub.

 .. _`issue 13`: https://github.com/benhoyt/scandir/issues/13

 I think that it would be very convinient to store the directory name
 in the DirEntry. It should be light, it's just a reference.

 And provide a fullname() name which would just return
 os.path.join(path, entry.name) without trying to resolve path to get
 an absolute path.

Yeah, fair suggestion. I'm still slightly on the fence about this, but
I think an explicit fullname() is a good suggestion. Ideally I think
it'd be better to mimic pathlib.Path.__str__() which is kind of the
equivalent of fullname(). But how does pathlib deal with unicode/bytes
issues if it's the str function which has to return a str object? Or
at least, it'd be very weird if __str__() returned bytes. But I think
it'd need to if you passed bytes into scandir(). Do others have
thoughts?

 Would it be hard to implement the wildcard feature on UNIX to compare
 performances of scandir('*.jpg') with and without the wildcard built
 in os.scandir?

It's a good idea, the problem with this is that the Windows wildcard
implementation has a bunch of crazy edge cases where *.ext will catch
more 

Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Ben Hoyt
Re is_dir etc being properties rather than methods:

 I find this behaviour a bit misleading: using methods and have them
 return cached results. How much (implementation and/or performance
 and/or memory) overhead would incur by using property-like access here?
 I think this would underline the static nature of the data.

 This would break the semantics with respect to pathlib, but they're only
 marginally equal anyways -- and as far as I understand it, pathlib won't
 cache, so I think this has a fair point here.

 Indeed - using properties rather than methods may help emphasise the
 deliberate *difference* from pathlib in this case (i.e. value when the
 result was retrieved from the OS, rather than the value right now). The main
 benefit is that switching from using the DirEntry object to a pathlib Path
 will require touching all the places where the performance characteristics
 switch from memory access to system call. This benefit is also the main
 downside, so I'd actually be OK with either decision on this one.

The problem with this is that properties look free, they look just
like attribute access, so you wouldn't normally handle exceptions when
accessing them. But .lstat() and .is_dir() etc may do an OS call, so
if you're needing to be careful with error handling, you may want to
handle errors on them. Hence I think it's best practice to make them
functions().

Some of us discussed this on python-dev or python-ideas a while back,
and I think there was general agreement with what I've stated above
and therefore they should be methods. But I'll dig up the links and
add to a Rejected ideas section.

 * +1 on a new section in the PEP covering rejected design options (calling
 it iterdir, returning a 2-tuple instead of a dedicated DirEntry type)

Great idea. I'll add a bunch of stuff, including the above, to a new
section, Rejected Design Options.

 * regarding why not a 2-tuple, we know from experience that operating
 systems evolve and we end up wanting to add additional info to this kind of
 API. A dedicated DirEntry type lets us adjust the information returned over
 time, without breaking backwards compatibility and without resorting to ugly
 hacks like those in some of the time and stat APIs (or even our own codec
 info APIs)

Fully agreed.

 * it would be nice to see some relative performance numbers for NFS and CIFS
 network shares - the additional network round trips can make excessive stat
 calls absolutely brutal from a speed perspective when using a network drive
 (that's why the stat caching added to the import system in 3.3 dramatically
 sped up the case of having network drives on sys.path, and why I thought AJ
 had a point when he was complaining about the fact we didn't expose the
 dirent data from os.listdir)

Don't know if you saw, but there are actually some benchmarks,
including one over NFS, on the scandir GitHub page:

https://github.com/benhoyt/scandir#benchmarks

os.walk() was 23 times faster with scandir() than the current
listdir() + stat() implementation on the Windows NFS file system I
tried. Pretty good speedup!

-Ben
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-28 Thread Nick Coghlan
On 29 June 2014 05:55, Ben Hoyt benh...@gmail.com wrote:
 Re is_dir etc being properties rather than methods:

 I find this behaviour a bit misleading: using methods and have them
 return cached results. How much (implementation and/or performance
 and/or memory) overhead would incur by using property-like access here?
 I think this would underline the static nature of the data.

 This would break the semantics with respect to pathlib, but they're only
 marginally equal anyways -- and as far as I understand it, pathlib won't
 cache, so I think this has a fair point here.

 Indeed - using properties rather than methods may help emphasise the
 deliberate *difference* from pathlib in this case (i.e. value when the
 result was retrieved from the OS, rather than the value right now). The main
 benefit is that switching from using the DirEntry object to a pathlib Path
 will require touching all the places where the performance characteristics
 switch from memory access to system call. This benefit is also the main
 downside, so I'd actually be OK with either decision on this one.

 The problem with this is that properties look free, they look just
 like attribute access, so you wouldn't normally handle exceptions when
 accessing them. But .lstat() and .is_dir() etc may do an OS call, so
 if you're needing to be careful with error handling, you may want to
 handle errors on them. Hence I think it's best practice to make them
 functions().

 Some of us discussed this on python-dev or python-ideas a while back,
 and I think there was general agreement with what I've stated above
 and therefore they should be methods. But I'll dig up the links and
 add to a Rejected ideas section.

Yes, only the stuff that *never* needs a system call (regardless of
OS) would be a candidate for handling as a property rather than a
method call. Consistency of access would likely trump that idea
anyway, but it would still be worth ensuring that the PEP is clear on
which values are guaranteed to reflect the state at the time of the
directory scanning and which may imply an additional stat call.

 * it would be nice to see some relative performance numbers for NFS and CIFS
 network shares - the additional network round trips can make excessive stat
 calls absolutely brutal from a speed perspective when using a network drive
 (that's why the stat caching added to the import system in 3.3 dramatically
 sped up the case of having network drives on sys.path, and why I thought AJ
 had a point when he was complaining about the fact we didn't expose the
 dirent data from os.listdir)

 Don't know if you saw, but there are actually some benchmarks,
 including one over NFS, on the scandir GitHub page:

 https://github.com/benhoyt/scandir#benchmarks

No, I hadn't seen those - may be worth referencing explicitly from the
PEP (and if there's already a reference... oops!)

 os.walk() was 23 times faster with scandir() than the current
 listdir() + stat() implementation on the Windows NFS file system I
 tried. Pretty good speedup!

Ah, nice!

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-27 Thread Paul Moore
On 26 June 2014 23:59, Ben Hoyt benh...@gmail.com wrote:
 Would love feedback on the PEP, but also of course on the proposal itself.

A solid +1 from me.

Some specific points:

- I'm in favour of it being in the os module. It's more discoverable
there, as well as the other reasons mentioned.
- I prefer scandir as the name, for the reason you gave (the output
isn't the same as an iterator version of listdir)
- I'm mildly against windows_wildcard (even though I'm a windows user)
- You mention the caching behaviour of DirEntry objects. The
limitations should be clearly covered in the final docs, as it's the
sort of thing people will get wrong otherwise.

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-27 Thread Victor Stinner
Hi,

You wrote a great PEP Ben, thanks :-) But it's now time  for comments!

 But the underlying system calls -- ``FindFirstFile`` /
 ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --

What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide readdir?

You should add a link to FindFirstFile doc:
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364418%28v=vs.85%29.aspx

It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we
should mimic stat_result recent addition: the new
stat_result.file_attributes field. Add DirEntry.file_attributes which
would only be available on Windows.

The Windows structure also contains

  FILETIME ftCreationTime;
  FILETIME ftLastAccessTime;
  FILETIME ftLastWriteTime;
  DWORDnFileSizeHigh;
  DWORDnFileSizeLow;

It would be nice to expose them as well. I'm  no more surprised that
the exact API is different depending on the OS for functions of the os
module.

 * Instead of bare filename strings, it returns lightweight
   ``DirEntry`` objects that hold the filename string and provide
   simple methods that allow access to the stat-like data the operating
   system returned.

Does your implementation uses a free list to avoid the cost of memory
allocation? A short free list of 10 or maybe just 1 may help. The free
list may be stored directly in the generator object.

 ``scandir()`` yields a ``DirEntry`` object for each file and directory
 in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'``
 pseudo-directories are skipped, and the entries are yielded in
 system-dependent order. Each ``DirEntry`` object has the following
 attributes and methods:

Does it support also bytes filenames on UNIX?

Python now supports undecodable filenames thanks to the PEP 383
(surrogateescape). I prefer to use the same type for filenames on
Linux and Windows, so Unicode is better. But some users might prefer
bytes for other reasons.

 The ``DirEntry`` attribute and method names were chosen to be the same
 as those in the new ``pathlib`` module for consistency.

Great! That's exactly what I expected :-) Consistency with other modules.

 Notes on caching
 

 The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute
 is obviously always cached, and the ``is_X`` and ``lstat`` methods
 cache their values (immediately on Windows via ``FindNextFile``, and
 on first use on Linux / OS X via a ``stat`` call) and never refetch
 from the system.

 For this reason, ``DirEntry`` objects are intended to be used and
 thrown away after iteration, not stored in long-lived data structured
 and the methods called again and again.

 If a user wants to do that (for example, for watching a file's size
 change), they'll need to call the regular ``os.lstat()`` or
 ``os.path.getsize()`` functions which force a new system call each
 time.

Crazy idea: would it be possible to convert a DirEntry object to a
pathlib.Path object without losing the cache? I guess that
pathlib.Path expects a full  stat_result object.

 Or, for getting the total size of files in a directory tree -- showing
 use of the ``DirEntry.lstat()`` method::

 def get_tree_size(path):
 Return total size of files in path and subdirs.
 size = 0
 for entry in scandir(path):
 if entry.is_dir():
 sub_path = os.path.join(path, entry.name)
 size += get_tree_size(sub_path)
 else:
 size += entry.lstat().st_size
 return size

 Note that ``get_tree_size()`` will get a huge speed boost on Windows,
 because no extra stat call are needed, but on Linux and OS X the size
 information is not returned by the directory iteration functions, so
 this function won't gain anything there.

I don't understand how you can build a full lstat() result without
really calling stat. I see that WIN32_FIND_DATA contains the size, but
here you call lstat(). If you know that it's not a symlink, you
already know the size, but you still have to call stat() to retrieve
all fields required to build a stat_result no?

 Support
 ===

 The scandir module on GitHub has been forked and used quite a bit (see
 Use in the wild in this PEP),

Do you plan to continue to maintain your module for Python  3.5, but
upgrade your module for the final PEP?

 Should scandir be in its own module?
 

 Should the function be included in the standard library in a new
 module, ``scandir.scandir()``, or just as ``os.scandir()`` as
 discussed? The preference of this PEP's author (Ben Hoyt) would be
 ``os.scandir()``, as it's just a single function.

Yes, put it in the os module which is already bloated :-)

 Should there be a way to access the full path?
 --

 Should ``DirEntry``'s have a way to get the full path without using
 ``os.path.join(path, entry.name)``? This is a pretty common pattern,
 and it may be useful to add pathlib-like 

Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-27 Thread Chris Barker - NOAA Federal
On Jun 26, 2014, at 4:38 PM, Tim Delaney timothy.c.dela...@gmail.com
wrote:

On 27 June 2014 09:28, MRAB pyt...@mrabarnett.plus.com wrote:


 -1 for windows_wildcard (it would be an attractive nuisance to write
windows-only code)


Could you emulate it on other platforms?

+1 on the rest of it.

-Chris
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-27 Thread Paul Sokolovsky
Hello,

On Thu, 26 Jun 2014 21:52:43 -0400
Ben Hoyt benh...@gmail.com wrote:

[]

 It's a fair point that os.walk() can be implemented efficiently
 without adding a new function and API. However, often you'll want more
 info, like the file size, which scandir() can give you via
 DirEntry.lstat(), which is free on Windows. So opening up this
 efficient API is beneficial.
 
 In CPython, I think the DirEntry objects are as lightweight as
 stat_result objects.
 
 I'm an embedded developer by background, so I know the constraints
 here, but I really don't think Python's development should be tailored
 to fit MicroPython. If os.scandir() is not very efficient on
 MicroPython, so be it -- 99% of all desktop/server users will gain
 from it.

Surely, tailoring Python to MicroPython's needs is completely not what
I suggest. It was an example of alternative implementation which
optimized os.walk() without need for any additional public module APIs.
Vice-versa, high-level nature of API call like os.walk() and
underspecification of low-level details (like which function
implemented in terms of which others) allow MicroPython provide
optimized implementation even with its resource constraints. So, power
of high-level interfaces and underspecification should not be
underestimated ;-).

But I don't want to argue that os.scandir() is not needed, because
that's hardly productive. Something I'd like to prototype in uPy and
ideally lead further up to PEP status is to add iterator-based string
methods, and I pretty much can expect we lived without it response,
so don't want to go the same way regarding addition of other
iterator-based APIs - it's clear that more iterator/generator based APIs
is a good direction for Python to evolve.

  It would be better if os.scandir() was specified to return a struct
  (named tuple) compatible with return value of os.stat() (with only
  fields relevant to underlying readdir()-like system call). The
  grounds for that are obvious: it's already existing data interface
  in module os, which is also based on open standard for operating
  systems - POSIX, so if one is to expect something about file
  attributes, it's what one can reasonably base expectations on.
 
 Yes, we considered this early on (see the python-ideas and python-dev
 threads referenced in the PEP), but decided it wasn't a great API to
 overload stat_result further, and have most of the attributes None or
 not present on Linux.
 
[]

 
 However, for scandir() to be useful, you also need the name. My
 original version of this directory iterator returned two-tuples of
 (name, stat_result). But most people didn't like the API, and I don't
 really either. You could overload stat_result with a .name attribute
 in this case, but it still isn't a nice API to have most of the
 attributes None, and then you have to test for that, etc.

Yes, returning (name, stat_result) would be my first motion too, I
don't see why someone wouldn't like pair of 2 values, with each value
of obvious type and semantics within os module. Regarding stat
result, os.stat() provides full information about a file,
and intuitively, one may expect that os.scandir() would provide subset
of that info, asymptotically reaching volume of what os.stat() may
provide, depending on OS capabilities. So, if truly OS-independent
interface is wanted to salvage more data from a dir scanning, using
os.stat struct as data interface is hard to ignore.


But well, if it was rejected already, what can be said? Perhaps, at
least the PEP could be extended to explicitly mention other approached
which were discussed and rejected, not just link to a discussion
archive (from experience with reading other PEPs, they oftentimes
contained such subsections, so hope this suggestion is not ungrounded).

 
 So basically we tweaked the API to do what was best, and ended up with
 it returning DirEntry objects with is_file() and similar methods.
 
 Hope that helps give a bit more context. If you haven't read the
 relevant python-ideas and python-dev threads, those are interesting
 too.
 
 -Ben



-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-27 Thread Paul Sokolovsky
Hello,

On Fri, 27 Jun 2014 12:08:41 +1000
Steven D'Aprano st...@pearwood.info wrote:

 On Fri, Jun 27, 2014 at 03:07:46AM +0300, Paul Sokolovsky wrote:
 
  With my MicroPython hat on, os.scandir() would make things only
  worse. With current interface, one can either have inefficient
  implementation (like CPython chose) or efficient implementation
  (like MicroPython chose) - all transparently. os.scandir()
  supposedly opens up efficient implementation for everyone, but at
  the price of bloating API and introducing heavy-weight objects to
  wrap info. 
 
 os.scandir is not part of the Python API, it is not a built-in
 function. It is part of the CPython standard library. 

Ok, so standard library also has API, and that's the API being
discussed. 

 That means (in
 my opinion) that there is an expectation that other Pythons should
 provide it, but not an absolute requirement. Especially for the os
 module, which by definition is platform-specific. 

Yes, that's intuitive, but not strict and formal, so is subject to
interpretations. As a developer working on alternative Python
implementation, I'd like to have better understanding of what needs to
be done to be a compliant implementation (in particular, because I need
to pass that info down to the users). So, I was told that
https://docs.python.org/3/reference/index.html describes Python, not
CPython. Next step is figuring out whether 
https://docs.python.org/3/library/index.html describes Python or
CPython, and if the latter, how to separate Python's stdlib essence from
extended library CPython provides?

 In my opinion that
 means you have four options:
 
 1. provide os.scandir, with exactly the same semantics as on CPython;
 
 2. provide os.scandir, but change its semantics to be more
 lightweight (e.g. return an ordinary tuple, as you already suggest);
 
 3. don't provide os.scandir at all; or
 
 4. do something different depending on whether the platform is Linux
or an embedded system.
 
 I would consider any of those acceptable for a library feature, but
 not for a language feature.

Good, thanks. If that represents shared opinion of (C)Python developers
(so, there won't be claims like MicroPython is not Python because it
doesn't provide os.scandir() (or hundred of other missing stdlib
functions ;-) )) that's good enough already.

With that in mind, I wished that any Python implementation was as
complete and as efficient as possible, and one way to achieve that is
to not add stdlib entities without real need (be it more API calls or
more data types). So, I'm glad to know that os.scandir() passed thru
Occam's Razor in this respect and specified the way it is really for
common good.


[]

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-27 Thread Jonas Wielicki
On 27.06.2014 03:50, MRAB wrote:
 On 2014-06-27 02:37, Ben Hoyt wrote:
 I don't mind iterdir() and would take it :-), but I'll just say why I
 chose the name scandir() -- though it wasn't my suggestion originally:

 iterdir() sounds like just an iterator version of listdir(), kinda
 like keys() and iterkeys() in Python 2. Whereas in actual fact the
 return values are quite different (DirEntry objects vs strings), and
 so the name change reflects that difference a little.

 [snip]
 
 The re module has 'findall', which returns a list of strings, and
 'finditer', which returns an iterator that yields match objects, so
 there's a precedent. :-)

A bad precedent in my opinion though -- I was just recently bitten by
that, and I find it very untypical for python.

regards,
Jonas
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-27 Thread Jonas Wielicki
On 27.06.2014 00:59, Ben Hoyt wrote:
 Specifics of proposal
 =
 [snip] Each ``DirEntry`` object has the following
 attributes and methods:
 [snip]
 Notes on caching
 
 
 The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute
 is obviously always cached, and the ``is_X`` and ``lstat`` methods
 cache their values (immediately on Windows via ``FindNextFile``, and
 on first use on Linux / OS X via a ``stat`` call) and never refetch
 from the system.

I find this behaviour a bit misleading: using methods and have them
return cached results. How much (implementation and/or performance
and/or memory) overhead would incur by using property-like access here?
I think this would underline the static nature of the data.

This would break the semantics with respect to pathlib, but they’re only
marginally equal anyways -- and as far as I understand it, pathlib won’t
cache, so I think this has a fair point here.

regards,
jwi
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-27 Thread Nick Coghlan
On 28 Jun 2014 01:27, Jonas Wielicki j.wieli...@sotecware.net wrote:

 On 27.06.2014 00:59, Ben Hoyt wrote:
  Specifics of proposal
  =
  [snip] Each ``DirEntry`` object has the following
  attributes and methods:
  [snip]
  Notes on caching
  
 
  The ``DirEntry`` objects are relatively dumb -- the ``name`` attribute
  is obviously always cached, and the ``is_X`` and ``lstat`` methods
  cache their values (immediately on Windows via ``FindNextFile``, and
  on first use on Linux / OS X via a ``stat`` call) and never refetch
  from the system.

 I find this behaviour a bit misleading: using methods and have them
 return cached results. How much (implementation and/or performance
 and/or memory) overhead would incur by using property-like access here?
 I think this would underline the static nature of the data.

 This would break the semantics with respect to pathlib, but they’re only
 marginally equal anyways -- and as far as I understand it, pathlib won’t
 cache, so I think this has a fair point here.

Indeed - using properties rather than methods may help emphasise the
deliberate *difference* from pathlib in this case (i.e. value when the
result was retrieved from the OS, rather than the value right now). The
main benefit is that switching from using the DirEntry object to a pathlib
Path will require touching all the places where the performance
characteristics switch from memory access to system call. This benefit
is also the main downside, so I'd actually be OK with either decision on
this one.

Other comments:

* +1 on the general idea
* +1 on scandir() over iterdir, since it *isn't* just an iterator version
of listdir
* -1 on including Windows specific globbing support in the API
* -0 on including cross platform globbing support in the initial iteration
of the API (that could be done later as a separate RFE instead)
* +1 on a new section in the PEP covering rejected design options (calling
it iterdir, returning a 2-tuple instead of a dedicated DirEntry type)
* regarding why not a 2-tuple, we know from experience that operating
systems evolve and we end up wanting to add additional info to this kind of
API. A dedicated DirEntry type lets us adjust the information returned over
time, without breaking backwards compatibility and without resorting to
ugly hacks like those in some of the time and stat APIs (or even our own
codec info APIs)
* it would be nice to see some relative performance numbers for NFS and
CIFS network shares - the additional network round trips can make excessive
stat calls absolutely brutal from a speed perspective when using a network
drive (that's why the stat caching added to the import system in 3.3
dramatically sped up the case of having network drives on sys.path, and why
I thought AJ had a point when he was complaining about the fact we didn't
expose the dirent data from os.listdir)

Regards,
Nick.


 regards,
 jwi
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-26 Thread Ben Hoyt
Hi Python dev folks,

I've written a PEP proposing a specific os.scandir() API for a
directory iterator that returns the stat-like info from the OS, the
main advantage of which is to speed up os.walk() and similar
operations between 4-20x, depending on your OS and file system. Full
details, background info, and context links are in the PEP, which
Victor Stinner has uploaded at the following URL, and I've also copied
inline below.

http://legacy.python.org/dev/peps/pep-0471/

Would love feedback on the PEP, but also of course on the proposal itself.

-Ben


PEP: 471
Title: os.scandir() function -- a better and faster directory iterator
Version: $Revision$
Last-Modified: $Date$
Author: Ben Hoyt benh...@gmail.com
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 30-May-2014
Python-Version: 3.5


Abstract


This PEP proposes including a new directory iteration function,
``os.scandir()``, in the standard library. This new function adds
useful functionality and increases the speed of ``os.walk()`` by 2-10
times (depending on the platform and file system) by significantly
reducing the number of times ``stat()`` needs to be called.


Rationale
=

Python's built-in ``os.walk()`` is significantly slower than it needs
to be, because -- in addition to calling ``os.listdir()`` on each
directory -- it executes the system call ``os.stat()`` or
``GetFileAttributes()`` on each file to determine whether the entry is
a directory or not.

But the underlying system calls -- ``FindFirstFile`` /
``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
already tell you whether the files returned are directories or not, so
no further system calls are needed. In short, you can reduce the
number of system calls from approximately 2N to N, where N is the
total number of files and directories in the tree. (And because
directory trees are usually much wider than they are deep, it's often
much better than this.)

In practice, removing all those extra system calls makes ``os.walk()``
about **8-9 times as fast on Windows**, and about **2-3 times as fast
on Linux and Mac OS X**. So we're not talking about micro-
optimizations. See more `benchmarks`_.

.. _`benchmarks`: https://github.com/benhoyt/scandir#benchmarks

Somewhat relatedly, many people (see Python `Issue 11406`_) are also
keen on a version of ``os.listdir()`` that yields filenames as it
iterates instead of returning them as one big list. This improves
memory efficiency for iterating very large directories.

So as well as providing a ``scandir()`` iterator function for calling
directly, Python's existing ``os.walk()`` function could be sped up a
huge amount.

.. _`Issue 11406`: http://bugs.python.org/issue11406


Implementation
==

The implementation of this proposal was written by Ben Hoyt (initial
version) and Tim Golden (who helped a lot with the C extension
module). It lives on GitHub at `benhoyt/scandir`_.

.. _`benhoyt/scandir`: https://github.com/benhoyt/scandir

Note that this module has been used and tested (see Use in the wild
section in this PEP), so it's more than a proof-of-concept. However,
it is marked as beta software and is not extensively battle-tested.
It will need some cleanup and more thorough testing before going into
the standard library, as well as integration into `posixmodule.c`.



Specifics of proposal
=

Specifically, this PEP proposes adding a single function to the ``os``
module in the standard library, ``scandir``, that takes a single,
optional string as its argument::

scandir(path='.') - generator of DirEntry objects

Like ``listdir``, ``scandir`` calls the operating system's directory
iteration system calls to get the names of the files in the ``path``
directory, but it's different from ``listdir`` in two ways:

* Instead of bare filename strings, it returns lightweight
  ``DirEntry`` objects that hold the filename string and provide
  simple methods that allow access to the stat-like data the operating
  system returned.

* It returns a generator instead of a list, so that ``scandir`` acts
  as a true iterator instead of returning the full list immediately.

``scandir()`` yields a ``DirEntry`` object for each file and directory
in ``path``. Just like ``listdir``, the ``'.'`` and ``'..'``
pseudo-directories are skipped, and the entries are yielded in
system-dependent order. Each ``DirEntry`` object has the following
attributes and methods:

* ``name``: the entry's filename, relative to ``path`` (corresponds to
  the return values of ``os.listdir``)

* ``is_dir()``: like ``os.path.isdir()``, but requires no system calls
  on most systems (Linux, Windows, OS X)

* ``is_file()``: like ``os.path.isfile()``, but requires no system
  calls on most systems (Linux, Windows, OS X)

* ``is_symlink()``: like ``os.path.islink()``, but requires no system
  calls on most systems (Linux, Windows, OS X)

* ``lstat()``: like ``os.lstat()``, but requires no system calls on
  Windows


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-26 Thread MRAB

On 2014-06-26 23:59, Ben Hoyt wrote:

Hi Python dev folks,

I've written a PEP proposing a specific os.scandir() API for a
directory iterator that returns the stat-like info from the OS, the
main advantage of which is to speed up os.walk() and similar
operations between 4-20x, depending on your OS and file system. Full
details, background info, and context links are in the PEP, which
Victor Stinner has uploaded at the following URL, and I've also
copied inline below.

http://legacy.python.org/dev/peps/pep-0471/

Would love feedback on the PEP, but also of course on the proposal
itself.


[snip]
Personally, I'd prefer the name 'iterdir' because it emphasises that
it's an iterator.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-26 Thread Tim Delaney
On 27 June 2014 09:28, MRAB pyt...@mrabarnett.plus.com wrote:

 Personally, I'd prefer the name 'iterdir' because it emphasises that
 it's an iterator.


Exactly what I was going to post (with the added note that thee's an
obvious symmetry with listdir).

+1 for iterdir rather than scandir

Other than that:

+1 for adding scandir to the stdlib
-1 for windows_wildcard (it would be an attractive nuisance to write
windows-only code)

Tim Delaney
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-26 Thread Paul Sokolovsky
Hello,

On Thu, 26 Jun 2014 18:59:45 -0400
Ben Hoyt benh...@gmail.com wrote:

 Hi Python dev folks,
 
 I've written a PEP proposing a specific os.scandir() API for a
 directory iterator that returns the stat-like info from the OS, the
 main advantage of which is to speed up os.walk() and similar
 operations between 4-20x, depending on your OS and file system. Full
 details, background info, and context links are in the PEP, which
 Victor Stinner has uploaded at the following URL, and I've also copied
 inline below.

I noticed obvious inefficiency of os.walk() implemented in terms of
os.listdir() when I worked on os module for MicroPython. I essentially
did what your PEP suggests - introduced internal generator function
(ilistdir_ex() in
https://github.com/micropython/micropython-lib/blob/master/os/os/__init__.py#L85
), in terms of which both os.listdir() and os.walk() are implemented.


With my MicroPython hat on, os.scandir() would make things only worse.
With current interface, one can either have inefficient implementation
(like CPython chose) or efficient implementation (like MicroPython
chose) - all transparently. os.scandir() supposedly opens up efficient
implementation for everyone, but at the price of bloating API and
introducing heavy-weight objects to wrap info. PEP calls it
lightweight DirEntry objects, but that cannot be true, because all
Python objects are heavy-weight, especially those which have methods.

It would be better if os.scandir() was specified to return a struct
(named tuple) compatible with return value of os.stat() (with only
fields relevant to underlying readdir()-like system call). The grounds
for that are obvious: it's already existing data interface in module
os, which is also based on open standard for operating systems -
POSIX, so if one is to expect something about file attributes, it's
what one can reasonably base expectations on.


But reusing os.stat struct is glaringly not what's proposed. And
it's clear where that comes from - [DirEntry.]lstat(): like os.lstat(),
but requires no system calls on Windows. Nice, but OS FooBar can do
much more than Windows - it has a system call to send a file by email,
right when scanning a directory containing it. So, why not to have
DirEntry.send_by_email(recipient) method? I hear the answer - it's
because CPython strives to support Windows well, while doesn't care
about FooBar OS.

And then it again leads to the question I posed several times - where's
line between CPython and Python? Is it grounded for CPython to add
(or remove) to Python stdlib something which is useful for its users,
but useless or complicating for other Python implementations?
Especially taking into account that there's win32api module allowing
Windows users to use all wonders of its API? Especially that os.stat
struct is itself pretty extensible
(https://docs.python.org/3.4/library/os.html#os.stat : On other Unix
systems (such as FreeBSD), the following attributes may be
available ..., On Mac OS systems..., - so extra fields can be added
for Windows just the same, if really needed).


 
 http://legacy.python.org/dev/peps/pep-0471/
 
 Would love feedback on the PEP, but also of course on the proposal
 itself.
 
 -Ben
 

[]

-- 
Best regards,
 Paul  mailto:pmis...@gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-26 Thread Ethan Furman

On 06/26/2014 04:36 PM, Tim Delaney wrote:

On 27 June 2014 09:28, MRAB wrote:


Personally, I'd prefer the name 'iterdir' because it emphasises that
it's an iterator.


Exactly what I was going to post (with the added note that thee's an obvious 
symmetry with listdir).

+1 for iterdir rather than scandir

Other than that:

+1 for adding [it] to the stdlib


+1 for all of above

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-26 Thread Benjamin Peterson
On Thu, Jun 26, 2014, at 17:07, Paul Sokolovsky wrote:
 
 With my MicroPython hat on, os.scandir() would make things only worse.
 With current interface, one can either have inefficient implementation
 (like CPython chose) or efficient implementation (like MicroPython
 chose) - all transparently. os.scandir() supposedly opens up efficient
 implementation for everyone, but at the price of bloating API and
 introducing heavy-weight objects to wrap info. PEP calls it
 lightweight DirEntry objects, but that cannot be true, because all
 Python objects are heavy-weight, especially those which have methods.

Why do you think methods make an object more heavyweight? namedtuples
have methods.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-26 Thread Ryan
+1 for scandir.
-1 for iterdir(scandir sounds fancier).
- for windows_wildcard.

Tim Delaney timothy.c.dela...@gmail.com wrote:
On 27 June 2014 09:28, MRAB pyt...@mrabarnett.plus.com wrote:

 Personally, I'd prefer the name 'iterdir' because it emphasises that
 it's an iterator.


Exactly what I was going to post (with the added note that thee's an
obvious symmetry with listdir).

+1 for iterdir rather than scandir

Other than that:

+1 for adding scandir to the stdlib
-1 for windows_wildcard (it would be an attractive nuisance to write
windows-only code)

Tim Delaney




___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/rymg19%40gmail.com

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-26 Thread Ben Hoyt
I don't mind iterdir() and would take it :-), but I'll just say why I
chose the name scandir() -- though it wasn't my suggestion originally:

iterdir() sounds like just an iterator version of listdir(), kinda
like keys() and iterkeys() in Python 2. Whereas in actual fact the
return values are quite different (DirEntry objects vs strings), and
so the name change reflects that difference a little.

I'm also -1 on windows_wildcard. I think it's asking for trouble, and
wouldn't gain much on Windows in most cases anyway.

-Ben

On Thu, Jun 26, 2014 at 7:43 PM, Ethan Furman et...@stoneleaf.us wrote:
 On 06/26/2014 04:36 PM, Tim Delaney wrote:

 On 27 June 2014 09:28, MRAB wrote:


 Personally, I'd prefer the name 'iterdir' because it emphasises that
 it's an iterator.


 Exactly what I was going to post (with the added note that thee's an
 obvious symmetry with listdir).

 +1 for iterdir rather than scandir

 Other than that:

 +1 for adding [it] to the stdlib


 +1 for all of above

 --
 ~Ethan~

 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 https://mail.python.org/mailman/options/python-dev/benhoyt%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-26 Thread MRAB

On 2014-06-27 02:37, Ben Hoyt wrote:

I don't mind iterdir() and would take it :-), but I'll just say why I
chose the name scandir() -- though it wasn't my suggestion originally:

iterdir() sounds like just an iterator version of listdir(), kinda
like keys() and iterkeys() in Python 2. Whereas in actual fact the
return values are quite different (DirEntry objects vs strings), and
so the name change reflects that difference a little.


[snip]

The re module has 'findall', which returns a list of strings, and
'finditer', which returns an iterator that yields match objects, so
there's a precedent. :-)

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-26 Thread Ben Hoyt
 os.listdir() when I worked on os module for MicroPython. I essentially
 did what your PEP suggests - introduced internal generator function
 (ilistdir_ex() in
 https://github.com/micropython/micropython-lib/blob/master/os/os/__init__.py#L85
 ), in terms of which both os.listdir() and os.walk() are implemented.

Nice (though I see the implementation is very *nix specific).

 With my MicroPython hat on, os.scandir() would make things only worse.
 With current interface, one can either have inefficient implementation
 (like CPython chose) or efficient implementation (like MicroPython
 chose) - all transparently. os.scandir() supposedly opens up efficient
 implementation for everyone, but at the price of bloating API and
 introducing heavy-weight objects to wrap info. PEP calls it
 lightweight DirEntry objects, but that cannot be true, because all
 Python objects are heavy-weight, especially those which have methods.

It's a fair point that os.walk() can be implemented efficiently
without adding a new function and API. However, often you'll want more
info, like the file size, which scandir() can give you via
DirEntry.lstat(), which is free on Windows. So opening up this
efficient API is beneficial.

In CPython, I think the DirEntry objects are as lightweight as
stat_result objects.

I'm an embedded developer by background, so I know the constraints
here, but I really don't think Python's development should be tailored
to fit MicroPython. If os.scandir() is not very efficient on
MicroPython, so be it -- 99% of all desktop/server users will gain
from it.

 It would be better if os.scandir() was specified to return a struct
 (named tuple) compatible with return value of os.stat() (with only
 fields relevant to underlying readdir()-like system call). The grounds
 for that are obvious: it's already existing data interface in module
 os, which is also based on open standard for operating systems -
 POSIX, so if one is to expect something about file attributes, it's
 what one can reasonably base expectations on.

Yes, we considered this early on (see the python-ideas and python-dev
threads referenced in the PEP), but decided it wasn't a great API to
overload stat_result further, and have most of the attributes None or
not present on Linux.

 Especially that os.stat struct is itself pretty extensible
 (https://docs.python.org/3.4/library/os.html#os.stat : On other Unix
 systems (such as FreeBSD), the following attributes may be
 available ..., On Mac OS systems..., - so extra fields can be added
 for Windows just the same, if really needed).

Yes. Incidentally, I just submitted an (accepted) patch for Python 3.5
that adds the full Win32 file attribute data to stat_result objects on
Windows (see https://docs.python.org/3.5/whatsnew/3.5.html#os).

However, for scandir() to be useful, you also need the name. My
original version of this directory iterator returned two-tuples of
(name, stat_result). But most people didn't like the API, and I don't
really either. You could overload stat_result with a .name attribute
in this case, but it still isn't a nice API to have most of the
attributes None, and then you have to test for that, etc.

So basically we tweaked the API to do what was best, and ended up with
it returning DirEntry objects with is_file() and similar methods.

Hope that helps give a bit more context. If you haven't read the
relevant python-ideas and python-dev threads, those are interesting
too.

-Ben
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-26 Thread Gregory P. Smith
+1 on getting this in for 3.5.

If the only objection people are having is the stupid paint color of the
name I don't care what it's called!  scandir matches the libc API of the
same name.  iterdir also makes sense to anyone reading it.  Whoever checks
this in can pick one and be done with it.  We have other Python APIs with
iter in the name and tend not to be trying to mirror C so much these days
so the iterdir folks do have a valid point.

I'm not a huge fan of the DirEntry object and the method calls on it
instead of simply yielding tuples of (filename,
partially_filled_in_stat_result) but I don't *really* care which is used as
they both work fine and it is trivial to wrap with another generator
expression to turn it into exactly what you want anyways.

Python not having the ability to operate on large directories means Python
simply cannot be used for common system maintenance tasks.  Python being
slow to walk a file system due to unnecessary stat calls (often each an
entire io op. requiring a disk seek!) due to the existing information that
it throws away not being used via listdir is similarly a problem. This
addresses both.

IMNSHO, it is a single function, it belongs in the os module right next to
listdir.

-gps



On Thu, Jun 26, 2014 at 6:37 PM, Ben Hoyt benh...@gmail.com wrote:

 I don't mind iterdir() and would take it :-), but I'll just say why I
 chose the name scandir() -- though it wasn't my suggestion originally:

 iterdir() sounds like just an iterator version of listdir(), kinda
 like keys() and iterkeys() in Python 2. Whereas in actual fact the
 return values are quite different (DirEntry objects vs strings), and
 so the name change reflects that difference a little.

 I'm also -1 on windows_wildcard. I think it's asking for trouble, and
 wouldn't gain much on Windows in most cases anyway.

 -Ben

 On Thu, Jun 26, 2014 at 7:43 PM, Ethan Furman et...@stoneleaf.us wrote:
  On 06/26/2014 04:36 PM, Tim Delaney wrote:
 
  On 27 June 2014 09:28, MRAB wrote:
 
 
  Personally, I'd prefer the name 'iterdir' because it emphasises that
  it's an iterator.
 
 
  Exactly what I was going to post (with the added note that thee's an
  obvious symmetry with listdir).
 
  +1 for iterdir rather than scandir
 
  Other than that:
 
  +1 for adding [it] to the stdlib
 
 
  +1 for all of above
 
  --
  ~Ethan~
 
  ___
  Python-Dev mailing list
  Python-Dev@python.org
  https://mail.python.org/mailman/listinfo/python-dev
  Unsubscribe:
  https://mail.python.org/mailman/options/python-dev/benhoyt%40gmail.com
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 https://mail.python.org/mailman/options/python-dev/greg%40krypto.org

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-26 Thread Steven D'Aprano
On Fri, Jun 27, 2014 at 03:07:46AM +0300, Paul Sokolovsky wrote:

 With my MicroPython hat on, os.scandir() would make things only worse.
 With current interface, one can either have inefficient implementation
 (like CPython chose) or efficient implementation (like MicroPython
 chose) - all transparently. os.scandir() supposedly opens up efficient
 implementation for everyone, but at the price of bloating API and
 introducing heavy-weight objects to wrap info. 

os.scandir is not part of the Python API, it is not a built-in function. 
It is part of the CPython standard library. That means (in my opinion) 
that there is an expectation that other Pythons should provide it, but 
not an absolute requirement. Especially for the os module, which by 
definition is platform-specific. In my opinion that means you have four 
options:

1. provide os.scandir, with exactly the same semantics as on CPython;

2. provide os.scandir, but change its semantics to be more lightweight 
   (e.g. return an ordinary tuple, as you already suggest);

3. don't provide os.scandir at all; or

4. do something different depending on whether the platform is Linux
   or an embedded system.

I would consider any of those acceptable for a library feature, but not 
for a language feature.


[...]
 But reusing os.stat struct is glaringly not what's proposed. And
 it's clear where that comes from - [DirEntry.]lstat(): like os.lstat(),
 but requires no system calls on Windows. Nice, but OS FooBar can do
 much more than Windows - it has a system call to send a file by email,
 right when scanning a directory containing it. So, why not to have
 DirEntry.send_by_email(recipient) method? I hear the answer - it's
 because CPython strives to support Windows well, while doesn't care
 about FooBar OS.

Correct. If there is sufficient demand for FooBar, then CPython may 
support it. Until then, FooBarPython can support it, and offer whatever 
platform-specific features are needed within its standard library.


 And then it again leads to the question I posed several times - where's
 line between CPython and Python? Is it grounded for CPython to add
 (or remove) to Python stdlib something which is useful for its users,
 but useless or complicating for other Python implementations?

I think so. And other implementations are free to do the same thing.

Of course there is an expectation that the standard library of most 
implementations will be broadly similar, but not that they will be 
identical.

I am surprised that both Jython and IronPython offer an non-functioning 
dis module: you can import it successfully, but if there's a way to 
actually use it, I haven't found it:


steve@orac:~$ jython
Jython 2.5.1+ (Release_2_5_1, Aug 4 2010, 07:18:19)
[OpenJDK Server VM (Sun Microsystems Inc.)] on java1.6.0_27
Type help, copyright, credits or license for more information.
 import dis
 dis.dis(lambda x: x+1)
Traceback (most recent call last):
  File stdin, line 1, in module
  File /usr/share/jython/Lib/dis.py, line 42, in dis
disassemble(x)
  File /usr/share/jython/Lib/dis.py, line 64, in disassemble
linestarts = dict(findlinestarts(co))
  File /usr/share/jython/Lib/dis.py, line 183, in findlinestarts
byte_increments = [ord(c) for c in code.co_lnotab[0::2]]
AttributeError: 'tablecode' object has no attribute 'co_lnotab'


IronPython gives a different exception:

steve@orac:~$ ipy
IronPython 2.6 Beta 2 DEBUG (2.6.0.20) on .NET 2.0.50727.1433
Type help, copyright, credits or license for more information.
 import dis
 dis.dis(lambda x: x+1)
Traceback (most recent call last):
TypeError: don't know how to disassemble code objects


It's quite annoying, I would have rather that they just removed the 
module altogether. Better still would have been to disassemble code 
objects to whatever byte code the Java and .Net platforms use. But 
there's surely no requirement to disassemble to CPython byte code!



-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-26 Thread Steven D'Aprano
On Thu, Jun 26, 2014 at 09:37:50PM -0400, Ben Hoyt wrote:
 I don't mind iterdir() and would take it :-), but I'll just say why I
 chose the name scandir() -- though it wasn't my suggestion originally:
 
 iterdir() sounds like just an iterator version of listdir(), kinda
 like keys() and iterkeys() in Python 2. Whereas in actual fact the
 return values are quite different (DirEntry objects vs strings), and
 so the name change reflects that difference a little.

+1 

I think that's a good objective reason to prefer scandir, which suits 
me, because my subjective opinion is that iterdir is an inelegant 
and less than attractive name.


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-26 Thread Glenn Linderman

I'm generally +1, with opinions noted below on these two topics.

On 6/26/2014 3:59 PM, Ben Hoyt wrote:

Should there be a way to access the full path?
--

Should ``DirEntry``'s have a way to get the full path without using
``os.path.join(path, entry.name)``? This is a pretty common pattern,
and it may be useful to add pathlib-like ``str(entry)`` functionality.
This functionality has also been requested in `issue 13`_ on GitHub.

.. _`issue 13`:https://github.com/benhoyt/scandir/issues/13


+1


Should it expose Windows wildcard functionality?


Should ``scandir()`` have a way of exposing the wildcard functionality
in the Windows ``FindFirstFile`` / ``FindNextFile`` functions? The
scandir module on GitHub exposes this as a ``windows_wildcard``
keyword argument, allowing Windows power users the option to pass a
custom wildcard to ``FindFirstFile``, which may avoid the need to use
``fnmatch`` or similar on the resulting names. It is named the
unwieldly ``windows_wildcard`` to remind you you're writing power-
user, Windows-only code if you use it.

This boils down to whether ``scandir`` should be about exposing all of
the system's directory iteration features, or simply providing a fast,
simple, cross-platform directory iteration API.

This PEP's author votes for not including ``windows_wildcard`` in the
standard library version, because even though it could be useful in
rare cases (say the Windows Dropbox client?), it'd be too easy to use
it just because you're a Windows developer, and create code that is
not cross-platform.


Because another common pattern is to check for name matches pattern, I 
think it would be good to have a feature that provides such. I do that 
in my own private directory listing extensions, and also some command 
lines expose it to the user.  Where exposed to the user, I use -p 
windows-pattern and -P regexp. My implementation converts the 
windows-pattern to a regexp, and then uses common code, but for this 
particular API, because the windows_wildcard can be optimized by the 
window API call used, it would make more sense to pass windows_wildcard 
directly to FindFirst on Windows, but on *nix convert it to a regexp. 
Both Windows and *nix would call re to process pattern matches except 
for the case on Windows of having a Windows pattern passed in. The 
alternate parameter could simply be called wildcard, and would be a 
regexp. If desired, other flavors of wildcard bsd_wildcard? could also 
be implemented, but I'm not sure there are any benefits to them, as 
there are, as far as I am aware, no optimizations for those patterns in 
those systems.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com