[issue25596] regular files handled as directories in the glob module

2016-01-10 Thread Xavier de Gaye

Changes by Xavier de Gaye :


Added file: http://bugs.python.org/file41571/glob_scandir_3.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25596] regular files handled as directories in the glob module

2016-01-10 Thread Xavier de Gaye

Xavier de Gaye added the comment:

FWIW I have followed the idea of having _iterdir() yielding the DirEntry entry 
instead of the name in Serhiy's patch.  There is a slight performance gain. Now 
_glob0() and _glob1() do return a list of directories when dironly is true but 
there is now another place where OSError must be tracked, so it is not clear if 
this is worth it.

glob_scandir_2_diff.patch is the differential patch between 
glob_scandir_2.patch and glob_scandir_3.patch.

Here are the performance tests run with both patches.

$ ./python -m timeit -s "from glob import glob" -- "glob('**/*', 
recursive=True)"
glob_scandir_2.patch: 33.1 msec per loop
glob_scandir_3.patch: 33.8 msec per loop

$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/*', 
recursive=True)"
glob_scandir_2.patch: 927 msec per loop
glob_scandir_3.patch: 850 msec per loop

$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/', 
recursive=True)"
glob_scandir_2.patch: 423 msec per loop
glob_scandir_3.patch: 337 msec per loop

--
Added file: http://bugs.python.org/file41570/glob_scandir_2_diff.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25596] regular files handled as directories in the glob module

2016-01-10 Thread Xavier de Gaye

Xavier de Gaye added the comment:

Actually the microbenchmarks comparison between glob_scandir_2.patch and 
glob_scandir_3.patch must be made by removing the call to entry.is_symlink() 
also in glob_scandir_2.patch.
In that case the microbenchmarks give about the same results for each patch.
Sorry for the noise.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25596] regular files handled as directories in the glob module

2016-01-08 Thread Xavier de Gaye

Xavier de Gaye added the comment:

Adding a doc patch.

--
Added file: http://bugs.python.org/file41533/glob_scandir_doc.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25596] regular files handled as directories in the glob module

2016-01-06 Thread Guido van Rossum

Guido van Rossum added the comment:

(IOW once this patch has been applied maybe you can do the same for globbing in 
pathlib as requested in issue #26032.)

--
nosy: +gvanrossum

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25596] regular files handled as directories in the glob module

2016-01-06 Thread STINNER Victor

STINNER Victor added the comment:

Related issue: #26032 "Use scandir() to speed up pathlib globbing".

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25596] regular files handled as directories in the glob module

2016-01-06 Thread R. David Murray

Changes by R. David Murray :


--
nosy:  -r.david.murray

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25596] regular files handled as directories in the glob module

2016-01-04 Thread Xavier de Gaye

Xavier de Gaye added the comment:

> and since _iterdir() results are always accumulated in a list

Right, I failed to note that point. And so, since the file descriptor opened by 
os.scandir() is closed within each call to recursive _rlistdir(), then my other 
comment about EMFILE does not stand as well.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25596] regular files handled as directories in the glob module

2016-01-03 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I think so too. I just wanted someone to confirmed that it is not 
overcautiousness. For now glob1() is used only in one place in the stdlib 
(issue16620).

But there was other problem with previous patch, the same as with current 
implementation of os.walk() (issue25995). It makes glob to use a lot of file 
descriptors.

Updated patch lefts deprecated glob1() and glob2() and makes glob to consume 
all scandir iterator before starting to yield values (but the problem with fd 
leaks on non-refcounted implementations is still left, issue25994). This 
doesn't affect performance, but lefts the issue with delaying (issue22167).

--
dependencies: +File descriptor leaks in os.scandir()
Added file: http://bugs.python.org/file41481/glob_scandir_2.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25596] regular files handled as directories in the glob module

2016-01-03 Thread Xavier de Gaye

Xavier de Gaye added the comment:

I may be missing something, anyway here are few comments on the patch:
Is the call to entry.is_symlink() in _iterdir() necessary since is_dir() 
follows symlinks ?
If _iterdir() would yield the DirEntry instance instead of DirEntry.name, then 
_rlistdir() could use x.is_dir() to know whether it is correct to iterate over 
_rlistdir(x.path, dironly)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25596] regular files handled as directories in the glob module

2016-01-03 Thread Xavier de Gaye

Xavier de Gaye added the comment:

os.scandir() is called recursively in the last patch and the file descriptors 
are not closed until returning from the recursion.
The glob functions should fail explicitly when scandir() raises OSERROR with 
posix errno set to EMFILE (The process has too many files open), or even 
better, silently ignore only the PermissionError exception instead of ignoring 
OSERROR.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25596] regular files handled as directories in the glob module

2016-01-03 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

> os.scandir() is called recursively in the last patch and the file descriptors 
> are not closed until returning from the recursion.

No, os.scandir() is called non-recursively in _iterdir(), and since _iterdir() 
results are always accumulated in a list, a recursion starts only after 
exhausting the os.scandir() iterator and closing the file descriptor. We need 
first to resolve issue25994 to close the file descriptor explicitly.

> The glob functions should fail explicitly when scandir() raises OSERROR with 
> posix errno set to EMFILE (The process has too many files open), or even 
> better, silently ignore only the PermissionError exception instead of 
> ignoring OSERROR.

Patched code passes existing tests. Do you have additional tests?

> Is the call to entry.is_symlink() in _iterdir() necessary since is_dir() 
> follows symlinks ?

Ah, I thought is_dir() doesn't follow symlinks. Yes, now the call to 
entry.is_symlink() is not necessary.

> If _iterdir() would yield the DirEntry instance instead of DirEntry.name, 
> then _rlistdir() could use x.is_dir() to know whether it is correct to 
> iterate over _rlistdir(x.path, dironly)

Yes, but this can complicate the rest of the code. _rlistdir() is called with 
dironly=False only when the pattern ends with '**'. I'm not sure this is enough 
important case for optimization. In most cases '**' is used in the middle of 
the pattern, and all names yielded by _iterdir() are directory names (or broken 
symlinks).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25596] regular files handled as directories in the glob module

2016-01-02 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
dependencies: +Avoid using private function glob.glob1() in msi module and tools

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25596] regular files handled as directories in the glob module

2016-01-02 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

In general the patch LGTM. It allows to speed up glob by 5-10% in warn test. 
But much more gain we can achieve by using os.scandir(). Here are results of 
microbenchmarks:

$ ./python -m timeit -s "from glob import glob" -- "glob('**/*', 
recursive=True)"
Unpatched:201 msec per loop
Using isdir():181 msec per loop
Using scandir():  65.2 msec per loop

$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/*', 
recursive=True)"
Unpatched:2.06 sec per loop
Using isdir():1.92 sec per loop
Using scandir():  820 msec per loop

$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/', 
recursive=True)"
Unpatched:1.77 sec per loop
Using isdir():1.61 sec per loop
Using scandir():  431 msec per loop

Yet one benefit is that iglob() now yields path names without the delay for 
reading the full content of a directory (see issue22167).

--
nosy: +benhoyt, haypo
priority: low -> normal
Added file: http://bugs.python.org/file41474/glob_scandir.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25596] regular files handled as directories in the glob module

2015-11-10 Thread Xavier de Gaye

Xavier de Gaye added the comment:

This second patch rewrites the conditionals that decide whether to call 
_iglob() recursively, in a more natural way and without changing the behavior 
from the first patch.
Note that when 'dirname == pathname', then basename is empty and glob2() or 
glob1() are not called.

The patch also refactors the code with a new _listdir() function.

--
Added file: http://bugs.python.org/file40999/glob_isdir_2.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25596] regular files handled as directories in the glob module

2015-11-10 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
assignee:  -> serhiy.storchaka
nosy: +serhiy.storchaka
priority: normal -> low
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25596] regular files handled as directories in the glob module

2015-11-10 Thread Xavier de Gaye

New submission from Xavier de Gaye:

The glob module happily joins names of regular files together with 
os.path.join() or attempts to list the files contained into a regular file 
(sic). The same 'except os.error' statement is used to handle both these cases 
and the case of a non readable directory.

The attached patch makes the code more correct and easier to understand.

--
components: Library (Lib)
files: glob_isdir.patch
keywords: patch
messages: 254440
nosy: xdegaye
priority: normal
severity: normal
status: open
title: regular files handled as directories in the glob module
type: enhancement
versions: Python 3.6
Added file: http://bugs.python.org/file40996/glob_isdir.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com