Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Gregory P. Smith
On Jun 28, 2014 12:49 PM, Ben Hoyt benh...@gmail.com wrote:

  But the underlying system calls -- ``FindFirstFile`` /
  ``FindNextFile`` on Windows and ``readdir`` on Linux and OS X --
 
  What about FreeBSD, OpenBSD, NetBSD, Solaris, etc. They don't provide
readdir?

 I guess it'd be better to say Windows and Unix-based OSs
 throughout the PEP? Because all of these (including Mac OS X) are
 Unix-based.

No, Just say POSIX.


  It looks like the WIN32_FIND_DATA has a dwFileAttributes field. So we
  should mimic stat_result recent addition: the new
  stat_result.file_attributes field. Add DirEntry.file_attributes which
  would only be available on Windows.
 
  The Windows structure also contains
 
FILETIME ftCreationTime;
FILETIME ftLastAccessTime;
FILETIME ftLastWriteTime;
DWORDnFileSizeHigh;
DWORDnFileSizeLow;
 
  It would be nice to expose them as well. I'm  no more surprised that
  the exact API is different depending on the OS for functions of the os
  module.

 I think you've misunderstood how DirEntry.lstat() works on Windows --
 it's basically a no-op, as Windows returns the full stat information
 with the original FindFirst/FindNext OS calls. This is fairly explict
 in the PEP, but I'm sure I could make it clearer:

 DirEntry.lstat(): like os.lstat(), but requires no system calls on
Windows

 So you can already get the dwFileAttributes for free by saying
 entry.lstat().st_file_attributes. You can also get all the other
 fields you mentioned for free via .lstat() with no additional OS calls
 on Windows, for example: entry.lstat().st_size.

 Feel free to suggest changes to the PEP or scandir docs if this isn't
 clear. Note that is_dir()/is_file()/is_symlink() are free on all
 systems, but .lstat() is only free on Windows.

  Does your implementation uses a free list to avoid the cost of memory
  allocation? A short free list of 10 or maybe just 1 may help. The free
  list may be stored directly in the generator object.

 No, it doesn't. I might add this to the PEP under possible
 improvements. However, I think the speed increase by removing the
 extra OS call and/or disk seek is going to be way more than memory
 allocation improvements, so I'm not sure this would be worth it.

  Does it support also bytes filenames on UNIX?

  Python now supports undecodable filenames thanks to the PEP 383
  (surrogateescape). I prefer to use the same type for filenames on
  Linux and Windows, so Unicode is better. But some users might prefer
  bytes for other reasons.

 I forget exactly now what my scandir module does, but for os.scandir()
 I think this should behave exactly like os.listdir() does for
 Unicode/bytes filenames.

  Crazy idea: would it be possible to convert a DirEntry object to a
  pathlib.Path object without losing the cache? I guess that
  pathlib.Path expects a full  stat_result object.

 The main problem is that pathlib.Path objects explicitly don't cache
 stat info (and Guido doesn't want them to, for good reason I think).
 There's a thread on python-dev about this earlier. I'll add it to a
 Rejected ideas section.

  I don't understand how you can build a full lstat() result without
  really calling stat. I see that WIN32_FIND_DATA contains the size, but
  here you call lstat().

 See above.

  Do you plan to continue to maintain your module for Python  3.5, but
  upgrade your module for the final PEP?

 Yes, I intend to maintain the standalone scandir module for 2.6 =
 Python  3.5, at least for a good while. For integration into the
 Python 3.5 stdlib, the implementation will be integrated into
 posixmodule.c, of course.

  Should there be a way to access the full path?
  --
 
  Should ``DirEntry``'s have a way to get the full path without using
  ``os.path.join(path, entry.name)``? This is a pretty common pattern,
  and it may be useful to add pathlib-like ``str(entry)`` functionality.
  This functionality has also been requested in `issue 13`_ on GitHub.
 
  .. _`issue 13`: https://github.com/benhoyt/scandir/issues/13
 
  I think that it would be very convinient to store the directory name
  in the DirEntry. It should be light, it's just a reference.
 
  And provide a fullname() name which would just return
  os.path.join(path, entry.name) without trying to resolve path to get
  an absolute path.

 Yeah, fair suggestion. I'm still slightly on the fence about this, but
 I think an explicit fullname() is a good suggestion. Ideally I think
 it'd be better to mimic pathlib.Path.__str__() which is kind of the
 equivalent of fullname(). But how does pathlib deal with unicode/bytes
 issues if it's the str function which has to return a str object? Or
 at least, it'd be very weird if __str__() returned bytes. But I think
 it'd need to if you passed bytes into scandir(). Do others have
 thoughts?

  Would it be hard to implement the wildcard feature on UNIX to compare
  performances of scandir('*.jpg') with and without the 

Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Walter Dörwald

On 28 Jun 2014, at 21:48, Ben Hoyt wrote:


[...]

Crazy idea: would it be possible to convert a DirEntry object to a
pathlib.Path object without losing the cache? I guess that
pathlib.Path expects a full  stat_result object.


The main problem is that pathlib.Path objects explicitly don't cache
stat info (and Guido doesn't want them to, for good reason I think).
There's a thread on python-dev about this earlier. I'll add it to a
Rejected ideas section.


However, it would be bad to have two implementations of the concept of 
filename with different attribute and method names.


The best way to ensure compatible APIs would be if one class was derived 
from the other.



[...]


Servus,
   Walter
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Steven D'Aprano
On Sat, Jun 28, 2014 at 03:55:00PM -0400, Ben Hoyt wrote:
 Re is_dir etc being properties rather than methods:
[...]
 The problem with this is that properties look free, they look just
 like attribute access, so you wouldn't normally handle exceptions when
 accessing them. But .lstat() and .is_dir() etc may do an OS call, so
 if you're needing to be careful with error handling, you may want to
 handle errors on them. Hence I think it's best practice to make them
 functions().

I think this one could go either way. Methods look like they actually 
re-test the value each time you call it. I can easily see people not 
realising that the value is cached and writing code like this toy 
example:


# Detect a file change.
t = the_file.lstat().st_mtime
while the_file.lstat().st_mtime == t:
 sleep(0.1)
print(Changed!)


I know that's not the best way to detect file changes, but I'm sure 
people will do something like that and not realise that the call to 
lstat is cached.

Personally, I would prefer a property. If I forget to wrap a call in a 
try...except, it will fail hard and I will get an exception. But with a 
method call, the failure is silent and I keep getting the cached result.

Speaking of caching, is there a way to freshen the cached values?


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Nick Coghlan
On 29 June 2014 20:52, Steven D'Aprano st...@pearwood.info wrote:
 Speaking of caching, is there a way to freshen the cached values?

Switch to a full Path object instead of relying on the cached DirEntry data.

This is what makes me wary of including lstat, even though Windows
offers it without the extra stat call. Caching behaviour is *really*
hard to make intuitive, especially when it *sometimes* returns data
that looks fresh (as it on first call on POSIX systems).

Regards,
Nick.


-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Paul Moore
On 29 June 2014 12:08, Nick Coghlan ncogh...@gmail.com wrote:
 This is what makes me wary of including lstat, even though Windows
 offers it without the extra stat call. Caching behaviour is *really*
 hard to make intuitive, especially when it *sometimes* returns data
 that looks fresh (as it on first call on POSIX systems).

If it matters that much we *could* simply call it cached_lstat(). It's
ugly, but I really don't like the idea of throwing the information
away - after all, the fact that we currently throw data away is why
there's even a need for scandir. Let's not make the same mistake
again...

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Nick Coghlan
On 29 June 2014 21:45, Paul Moore p.f.mo...@gmail.com wrote:
 On 29 June 2014 12:08, Nick Coghlan ncogh...@gmail.com wrote:
 This is what makes me wary of including lstat, even though Windows
 offers it without the extra stat call. Caching behaviour is *really*
 hard to make intuitive, especially when it *sometimes* returns data
 that looks fresh (as it on first call on POSIX systems).

 If it matters that much we *could* simply call it cached_lstat(). It's
 ugly, but I really don't like the idea of throwing the information
 away - after all, the fact that we currently throw data away is why
 there's even a need for scandir. Let's not make the same mistake
 again...

Future-proofing is the reason DirEntry is a full fledged class in the
first place, though.

Effectively communicating the behavioural difference between DirEntry
and pathlib.Path is the main thing that makes me nervous about
adhering too closely to the Path API.

To restate the problem and the alternative proposal, these are the
DirEntry methods under discussion:

is_dir(): like os.path.isdir(), but requires no system calls on at
least POSIX and Windows
is_file(): like os.path.isfile(), but requires no system calls on
at least POSIX and Windows
is_symlink(): like os.path.islink(), but requires no system calls
on at least POSIX and Windows
lstat(): like os.lstat(), but requires no system calls on Windows

For the almost-certain-to-be-cached items, the suggestion is to make
them properties (or just ordinary attributes):

is_dir
is_file
is_symlink

What do with lstat() is currently less clear, since POSIX directory
scanning doesn't provide that level of detail by default.

The PEP also doesn't currently state whether the is_dir(), is_file()
and is_symlink() results would be updated if a call to lstat()
produced different answers than the original directory scanning
process, which further suggests to me that allowing the stat call to
be delayed on POSIX systems is a potentially problematic and
inherently confusing design. We would have two options:

- update them, meaning calling lstat() may change those results from
being a snapshot of the setting at the time the directory was scanned
- leave them alone, meaning the DirEntry object and the
DirEntry.lstat() result may give different answers

Those both sound ugly to me.

So, here's my alternative proposal: add an ensure_lstat flag to
scandir() itself, and don't have *any* methods on DirEntry, only
attributes.

That would make the DirEntry attributes:

is_dir: boolean, always populated
is_file: boolean, always populated
is_symlink boolean, always populated
lstat_result: stat result, may be None on POSIX systems if
ensure_lstat is False

(I'm not particularly sold on lstat_result as the name, but lstat
reads as a verb to me, so doesn't sound right as an attribute name)

What this would allow:

- by default, scanning is efficient everywhere, but lstat_result may
be None on POSIX systems
- if you always need the lstat result, setting ensure_lstat will
trigger the extra system call implicitly
- if you only sometimes need the stat result, you can call os.lstat()
explicitly when the DirEntry lstat attribute is None

Most importantly, *regardless of platform*, the cached stat result (if
not None) would reflect the state of the entry at the time the
directory was scanned, rather than at some arbitrary later point in
time when lstat() was first called on the DirEntry object.

There'd still be a slight window of discrepancy (since the filesystem
state may change between reading the directory entry and making the
lstat() call), but this could be effectively eliminated from the
perspective of the Python code by making the result of the lstat()
call authoritative for the whole DirEntry object.

Regards,
Nick.

P.S. We'd be generating quite a few of these, so we can use __slots__
to keep the memory overhead to a minimum (that's just a general
comment - it's really irrelevant to the methods-or-attributes
question).


-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Jonas Wielicki
On 29.06.2014 13:08, Nick Coghlan wrote:
 On 29 June 2014 20:52, Steven D'Aprano st...@pearwood.info wrote:
 Speaking of caching, is there a way to freshen the cached values?
 
 Switch to a full Path object instead of relying on the cached DirEntry data.
 
 This is what makes me wary of including lstat, even though Windows
 offers it without the extra stat call. Caching behaviour is *really*
 hard to make intuitive, especially when it *sometimes* returns data
 that looks fresh (as it on first call on POSIX systems).

This bugs me too. An idea I had was adding a keyword argument to scandir
which specifies whether stat data should be added to the direntry or not.

If the flag is set to True, This would implicitly call lstat on POSIX
before returning the DirEntry, and use the available data on Windows.

If the flag is set to False, all the fields in the DirEntry will be
None, for consistency, even on Windows.


This is not optimal in cases where the stat information is needed only
for some of the DirEntry objects, but would also reduce the required
logic in the DirEntry object.

Thoughts?

 
 Regards,
 Nick.
 
 

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Ethan Furman

On 06/29/2014 05:28 AM, Nick Coghlan wrote:


So, here's my alternative proposal: add an ensure_lstat flag to
scandir() itself, and don't have *any* methods on DirEntry, only
attributes.

That would make the DirEntry attributes:

 is_dir: boolean, always populated
 is_file: boolean, always populated
 is_symlink boolean, always populated
 lstat_result: stat result, may be None on POSIX systems if
ensure_lstat is False

(I'm not particularly sold on lstat_result as the name, but lstat
reads as a verb to me, so doesn't sound right as an attribute name)


+1

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Ethan Furman

On 06/29/2014 04:12 AM, Jonas Wielicki wrote:


If the flag is set to False, all the fields in the DirEntry will be
None, for consistency, even on Windows.


-1

This consistency is unnecessary.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Akira Li
Chris Angelico ros...@gmail.com writes:

 On Sat, Jun 28, 2014 at 11:05 PM, Akira Li 4kir4...@gmail.com wrote:
 Have you considered adding support for paths relative to directory
 descriptors [1] via keyword only dir_fd=None parameter if it may lead to
 more efficient implementations on some platforms?

 [1]: https://docs.python.org/3.4/library/os.html#dir-fd

 Potentially more efficient and also potentially safer (see 'man
 openat')... but an enhancement that can wait, if necessary.


Introducing the feature later creates unnecessary incompatibilities.
Either it should be explicitly rejected in the PEP 471 and
something-like `os.scandir(os.open(relative_path, dir_fd=fd))` recommended
instead (assuming `os.scandir in os.supports_fd` like `os.listdir()`).

At C level it could be implemented using fdopendir/openat or scandirat.

Here's the function description using Argument Clinic DSL:

/*[clinic input]

os.scandir

path : path_t(allow_fd=True, nullable=True) = '.'

*path* can be specified as either str or bytes. On some
platforms, *path* may also be specified as an open file
descriptor; the file descriptor must refer to a directory.  If
this functionality is unavailable, using it raises
NotImplementedError.

*

dir_fd : dir_fd = None

If not None, it should be a file descriptor open to a
directory, and *path* should be a relative string; path will
then be relative to that directory.  if *dir_fd* is
unavailable, using it raises NotImplementedError.

Yield a DirEntry object for each file and directory in *path*.

Just like os.listdir, the '.' and '..' pseudo-directories are skipped,
and the entries are yielded in system-dependent order.

{parameters}
It's an error to use *dir_fd* when specifying *path* as an open file
descriptor.

[clinic start generated code]*/


And corresponding tests (from test_posix:PosixTester), to show the
compatibility with os.listdir argument parsing in detail:

def test_scandir_default(self):
# When scandir is called without argument,
# it's the same as scandir(os.curdir).
self.assertIn(support.TESTFN, [e.name for e in posix.scandir()])

def _test_scandir(self, curdir):
filenames = sorted(e.name for e in posix.scandir(curdir))
self.assertIn(support.TESTFN, filenames)
#NOTE: assume listdir, scandir accept the same types on the platform
self.assertEqual(sorted(posix.listdir(curdir)), filenames)

def test_scandir(self):
self._test_scandir(os.curdir)

def test_scandir_none(self):
# it's the same as scandir(os.curdir).
self._test_scandir(None)

def test_scandir_bytes(self):
# When scandir is called with a bytes object,
# the returned entries names are still of type str.
# Call `os.fsencode(entry.name)` to get bytes
self.assertIn('a', {'a'})
self.assertNotIn(b'a', {'a'})
self._test_scandir(b'.')

@unittest.skipUnless(posix.scandir in os.supports_fd,
 test needs fd support for posix.scandir())
def test_scandir_fd_minus_one(self):
# it's the same as scandir(os.curdir).
self._test_scandir(-1)

def test_scandir_float(self):
# invalid args
self.assertRaises(TypeError, posix.scandir, -1.0)

@unittest.skipUnless(posix.scandir in os.supports_fd,
 test needs fd support for posix.scandir())
def test_scandir_fd(self):
fd = posix.open(posix.getcwd(), posix.O_RDONLY)
self.addCleanup(posix.close, fd)
self._test_scandir(fd)
self.assertEqual(
sorted(posix.scandir('.')),
sorted(posix.scandir(fd)))
# call 2nd time to test rewind
self.assertEqual(
sorted(posix.scandir('.')),
sorted(posix.scandir(fd)))

@unittest.skipUnless(posix.scandir in os.supports_dir_fd,
 test needs dir_fd support for os.scandir())
def test_scandir_dir_fd(self):
relpath = 'relative_path'
with support.temp_dir() as parent:
fullpath = os.path.join(parent, relpath)
with support.temp_dir(path=fullpath):
support.create_empty_file(os.path.join(parent, 'a'))
support.create_empty_file(os.path.join(fullpath, 'b'))
fd = posix.open(parent, posix.O_RDONLY)
self.addCleanup(posix.close, fd)
self.assertEqual(
sorted(posix.scandir(relpath, dir_fd=fd)),
sorted(posix.scandir(fullpath)))
# check that fd is still useful
self.assertEqual(
sorted(posix.scandir(relpath, dir_fd=fd)),
sorted(posix.scandir(fullpath)))


--
Akira

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev

Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Jonas Wielicki
On 29.06.2014 19:04, Ethan Furman wrote:
 On 06/29/2014 04:12 AM, Jonas Wielicki wrote:

 If the flag is set to False, all the fields in the DirEntry will be
 None, for consistency, even on Windows.
 
 -1

 This consistency is unnecessary.

I’m not sure -- similar to the windows_wildcard option this might be a
temptation to write platform dependent code, although possibly by
accident (i.e. not reading the docs carefully).

 
 -- 
 ~Ethan~
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 https://mail.python.org/mailman/options/python-dev/j.wielicki%40sotecware.net
 

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Fix Unicode-disabled build of Python 2.7

2014-06-29 Thread Berker Peksağ
On Sat, Jun 28, 2014 at 2:51 AM, Victor Stinner
victor.stin...@gmail.com wrote:
 2014-06-26 13:04 GMT+02:00 Antoine Pitrou anto...@python.org:
 For the same reason, I agree with Victor that we should ditch the
 threading-disabled builds. It's too much of a hassle for no actual,
 practical benefit. People who want a threadless unicodeless Python can
 install Python 1.5.2 for all I care.

 By the way, adding a buildbot for testing Python without thread
 support is not enough. The buildbot is currently broken since more
 than one month and nobody noticed :-p

I've opened http://bugs.python.org/issue21755 to fix the test a couple
of weeks ago.

--Berker


 http://buildbot.python.org/all/builders/AMD64%20Fedora%20without%20threads%203.x/

 Ok, I noticed, but I consider that I spent too much time on this minor
 use case. I prefer to leave such task to someone else :-)

 Victor
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 https://mail.python.org/mailman/options/python-dev/berker.peksag%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471 -- os.scandir() function -- a better and faster directory iterator

2014-06-29 Thread Glenn Linderman

On 6/29/2014 5:28 AM, Nick Coghlan wrote:

There'd still be a slight window of discrepancy (since the filesystem
state may change between reading the directory entry and making the
lstat() call), but this could be effectively eliminated from the
perspective of the Python code by making the result of the lstat()
call authoritative for the whole DirEntry object.


+1 to this in particular, but this whole refresh of the semantics sounds 
better overall.


Finally, for the case where someone does want to keep the DirEntry 
around, a .refresh() API could rerun lstat() and update all the data.


And with that (initial data potentially always populated, or None, and 
an explicit refresh() API), the data could all be returned as 
properties, implying that they aren't fetching new data themselves, 
because they wouldn't be.


Glenn
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com