[issue22167] iglob() has misleading documentation (does indeed store names internally)

2021-06-21 Thread Andrei Kulakov


Andrei Kulakov  added the comment:

I have put up a PR here: https://github.com/python/cpython/pull/25767

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2021-04-30 Thread Andrei Kulakov


Change by Andrei Kulakov :


--
keywords: +patch
nosy: +andrei.avk
nosy_count: 8.0 -> 9.0
pull_requests: +24460
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/25767

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2020-06-16 Thread Guido van Rossum


Guido van Rossum  added the comment:

Sounds good.

FWIW, and totally off-topic, I find it annoying that pathlib's .glob() method 
supports ** in patterns, but its cousing .match() does not:

>>> p = pathlib.Path("Lib/test/support/os_helper.py")
>>> p.match("Lib/**/*.py")
False
>>>

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2020-06-16 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

I am going to add the issue38144 feature first. Then maybe implement a dir_fd 
based optimization.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2020-06-15 Thread Guido van Rossum


Guido van Rossum  added the comment:

How's this going?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2020-06-01 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

Brilliant idea! I played with it yesterday, and it is easy to generalize it to 
work with a*/b*/c*/d/e/f and to "use not more than N simultaneously opened file 
descriptors per glob iterator". The only problem is if we want to use this idea 
with recursive "**" without losing performance. It is just a technical problem 
which can be solved by adding more code. Although I need to do some benchmarks.

And I need to test how this idea will work with issue38144.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2020-05-31 Thread Guido van Rossum


Guido van Rossum  added the comment:

I hope some volunteer will submit a doc PR.

In the meantime, throwing out one more idea: perhaps my first idea, to make
_glob1() an iterator, could work if we only do it for leaves of the
pattern, so for the a*/b*/c* example, only for the c* part.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2020-05-31 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

This is an interesting idea, but I afraid it will complicate the code too much, 
destroying the remnants of the initial elegant design. I am going to try to 
experiment with this idea after implementing other features, so it will not 
block them.

For now we need to document that iglob() may collect some names internally 
before starting to emit them. If my English be better I would did it myself.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2020-05-31 Thread Guido van Rossum


Guido van Rossum  added the comment:

Hm, yeah.

Perhaps we can add some buffering so that for directories with a small
number of files (say < 1000) the FD is closed before recursing into it,
while for large directories it keeps the FD open until the last block of
1000 DirEntries is read? It's unlikely that someone has a tree that's both
deep *and* wide at many levels, since that would mean an inordinate number
of files indeed.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2020-05-31 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

Yes, for the pattern 'a*/b*/c*' you will have an open file descriptor for every 
component with metacharacters:

for a in scandir('.'):
if fnmatch(a.name, 'a*'):
for b in scandir(a.path):
if fnmatch(b.name, 'b*'):
for c in scandir(b.path):
if fnmatch(c.name, 'c*'):
yield c.path

You can have hundreds nested directories. Looks not bad, because by default the 
limit on the number of file descriptors is 1024 on Linux. But imagine you run a 
server and it handles tens requests simultaneously. Some of them or all of them 
will fail, and not just return an error, but return an incorrect result, 
because all OSError, including "Too many open files", are silenced in glob().

Also all these file descriptors will not be closed until you finish the 
iteration, or, in case of error, until the garbage collector close them 
(because interrupted generators tend to create reference loops).

So it is vital to close the file descriptor before you open other file 
descriptors in the recursion.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2020-05-31 Thread Guido van Rossum


Guido van Rossum  added the comment:

Serhiy, what do you mean by "otherwise we could run out of file descriptiors"? 
I looked a bit at the code and there are different kinds of algorithms involved 
for different forms of patterns, and the code also takes vastly different paths 
for recursive matches.

I found one bit of code that looked like it *could* be improved, with some 
effort: _glob1(). This constructs a list of all files in one directory and then 
filters then. It looks like this could be a problem if there are e.g. 100_000 
files in one directory. To fix, we could implement fnmatch.ifilter() which 
would be like fnmatch.filter() but uses `yield name` instead of 
`result.append(name)`; then _glob1() could be rewritten as follows (untested):

def _glob1(dirname, pattern, dironly):
names = _iterdir(dirname, dironly))
if not _ishidden(pattern):
yield from fnmatch.ifilter(x for x in names if not _ishidden(x))
else:
yield from fnmatch.ifilter(names, pattern)

Thoughts? Would this increase the number of open file descriptors in some edge 
case?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2020-05-31 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
versions: +Python 3.10, Python 3.6, Python 3.7, Python 3.8, Python 3.9 -Python 
2.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2018-02-17 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

Unfortunately issue25596 didn't change anything about this issue. iglob() still 
stores names (actually DirEntry objects) of all files in a directory before 
starting yielding the first of them. Otherwise we cold exceed the limit of open 
file descriptors.

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2018-02-16 Thread Roger Erens

Roger Erens  added the comment:

http://bugs.python.org/issue25596 has been closed...

--
nosy: +Roger Erens

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2016-01-11 Thread Guido van Rossum

Guido van Rossum added the comment:

Once http://bugs.python.org/issue25596 (switching glob to use scandir) is 
solved this issue can be closed IMO.

--
nosy: +gvanrossum

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2014-08-08 Thread Roy Smith

Roy Smith added the comment:

How about something like this:

Note: The current iglob() implementation is optimized for the case of many 
files distributed in a large directory tree.  Internally, it iterates over the 
directory tree, and stores all the names from each directory at once.  This 
will lead to pathologically inefficient behavior when any individual directory 
has a large number of files in it.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22167
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2014-08-08 Thread R. David Murray

R. David Murray added the comment:

IMO the documentation isn't *wrong*, just misleading :)

What it is saying is that *your program* doesn't have to store the full list 
returned by iglob before being able to use it (ie: iglob doesn't return a 
list).  It says nothing about what resources are used internally, other than an 
implied contract that there is *some* efficiency over calling glob; which, as 
explained above, there is.  The fact that the implementation uses lots of 
memory if any single directory is large is then a performance bug, which can 
theoretically be fixed in 3.5 using scandir.

The reason iglob was introduced, if you check the revision history, is that 
glob used to call itself recursively for each sub-directory, which meant it 
held *all* of the files in *all* of the scanned tree in memory at one time.  It 
is literally true that the difference between glob and iglob is that with iglob 
your program doesn't have to store the full list of matches from all 
subdirectories, but talking about your program is not something we typically 
do in python docs, it is implied.

Perhaps in 2.7/3.4 we can mention in the module docs that at most one 
directory's worth of data will be held in memory during the globbing process, 
but it feels a little weird to document an implementation detail like that.  
Still, if someone can come up with improved wording for the docs, we can add it.

--
nosy: +r.david.murray

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22167
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2014-08-07 Thread Roy Smith

New submission from Roy Smith:

For background, see:

https://mail.python.org/pipermail/python-list/2014-August/676291.html

In a nutshell, the iglob() docs say, Return an iterator which yields the same 
values as glob() without actually storing them all simultaneously.  The 
problem is, internally, it calls os.listdir(), which apparently *does* store 
the entire list internally, defeating the whole purpose of iglob()

I recognize that iglob() is not going to get fixed in 2.7, but at least the 
documentation should be updated to point out that it doesn't really do what it 
says it does.  Or rather, it doesn't really not do what it says it doesn't :-)

--
assignee: docs@python
components: Documentation
messages: 225048
nosy: docs@python, roysmith
priority: normal
severity: normal
status: open
title: iglob() has misleading documentation (does indeed store names internally)
type: enhancement
versions: Python 2.7

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22167
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2014-08-07 Thread Tim Chase

Changes by Tim Chase python.b...@tim.thechases.com:


--
nosy: +Gumnos

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22167
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2014-08-07 Thread Steven D'Aprano

Steven D'Aprano added the comment:

I agree that the documentation could be improved, but it's not really *wrong*. 
Consider a glob like spam/[abc]/*.txt. What iglob does is conceptually closer 
to:

(1) generate the list of files matching spam/a/*.txt and yield them;
(2) generate the list of files matching spam/b/*.txt and yield them;
(3) generate the list of files matching spam/c/*.txt and yield them

rather than:

(1) generate the list of files matching spam/a/*.txt;
(2) append the files matching spam/b/*.txt;
(3) append the files matching spam/c/*.txt;
(4) finally yield them

(see the source code here: http://hg.python.org/cpython/file/3.4/Lib/glob.py ). 
I think the documentation is trying to say that iglob doesn't *always* store 
all the matching files, without implying that it *never* stores all the 
matching files. I can't think of a clean way to explain that, so a doc patch is 
welcome.

--
nosy: +steven.daprano

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22167
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22167] iglob() has misleading documentation (does indeed store names internally)

2014-08-07 Thread Roy Smith

Roy Smith added the comment:

The thread that led to this started out with the use case of a directory that 
had 200k files in it.  If I ran iglob() on that and discovered that it had 
internally generated a list of all 200k names in memory at the same time, I 
would be pretty darn surprised, based on what the docs say now.

We're shooting for principle of least astonishment here.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22167
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com