On Wed, Feb 29, 2012 at 7:57 PM, Erin Sheldon <erin.shel...@gmail.com>wrote:
> Excerpts from Nathaniel Smith's message of Wed Feb 29 13:17:53 -0500 2012: > > On Wed, Feb 29, 2012 at 3:11 PM, Erin Sheldon <erin.shel...@gmail.com> > wrote: > > > Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 > 2012: > > >> > Even for binary, there are pathological cases, e.g. 1) reading a > random > > >> > subset of nearly all rows. 2) reading a single column when rows are > > >> > small. In case 2 you will only go this route in the first place if > you > > >> > need to save memory. The user should be aware of these issues. > > >> > > >> FWIW, this route actually doesn't save any memory as compared to > np.memmap. > > > > > > Actually, for numpy.memmap you will read the whole file if you try to > > > grab a single column and read a large fraction of the rows. Here is an > > > example that will end up pulling the entire file into memory > > > > > > mm=numpy.memmap(fname, dtype=dtype) > > > rows=numpy.arange(mm.size) > > > x=mm['x'][rows] > > > > > > I just tested this on a 3G binary file and I'm sitting at 3G memory > > > usage. I believe this is because numpy.memmap only understands rows. > I > > > don't fully understand the reason for that, but I suspect it is related > > > to the fact that the ndarray really only has a concept of itemsize, and > > > the fields are really just a reinterpretation of those bytes. It may > be > > > that one could tweak the ndarray code to get around this. But I would > > > appreciate enlightenment on this subject. > > > > Ahh, that makes sense. But, the tool you are using to measure memory > > usage is misleading you -- you haven't mentioned what platform you're > > on, but AFAICT none of them have very good tools for describing memory > > usage when mmap is in use. (There isn't a very good way to handle it.) > > > > What's happening is this: numpy read out just that column from the > > mmap'ed memory region. The OS saw this and decided to read the entire > > file, for reasons discussed previously. Then, since it had read the > > entire file, it decided to keep it around in memory for now, just in > > case some program wanted it again in the near future. > > > > Now, if you instead fetched just those bytes from the file using > > seek+read or whatever, the OS would treat that request in the exact > > same way: it'd still read the entire file, and it would still keep the > > whole thing around in memory. On Linux, you could test this by > > dropping caches (echo 1 > /proc/sys/vm/drop_caches), checking how much > > memory is listed as "free" in top, and then using your code to read > > the same file -- you'll see that the 'free' memory drops by 3 > > gigabytes, and the 'buffers' or 'cached' numbers will grow by 3 > > gigabytes. > > > > [Note: if you try this experiment, make sure that you don't have the > > same file opened with np.memmap -- for some reason Linux seems to > > ignore the request to drop_caches for files that are mmap'ed.] > > > > The difference between mmap and reading is that in the former case, > > then this cache memory will be "counted against" your process's > > resident set size. The same memory is used either way -- it's just > > that it gets reported differently by your tool. And in fact, this > > memory is not really "used" at all, in the way we usually mean that > > term -- it's just a cache that the OS keeps, and it will immediately > > throw it away if there's a better use for that memory. The only reason > > it's loading the whole 3 gigabytes into memory in the first place is > > that you have >3 gigabytes of memory to spare. > > > > You might even be able to tell the OS that you *won't* be reading that > > file again, so there's no point in keeping it all cached -- on Unix > > this is done via the madavise() or posix_fadvise() syscalls. (No > > guarantee the OS will actually listen, though.) > > This is interesting, and on my machine I think I've verified that what > you say is actually true. > > This all makes theoretical sense, but goes against some experiments I > and my colleagues have done. For example, a colleague of mine was able > to read a couple of large files in using my code but not using memmap. > The combined files were greater than memory size. With memmap the code > started swapping. This was on 32-bit OSX. But as I said, I just tested > this on my linux box and it works fine with numpy.memmap. I don't have > an OSX box to test this. > I've seen this on OS X too. Here's another example on Linux: http://thread.gmane.org/gmane.comp.python.numeric.general/43965. Using tcmalloc was reported by a couple of people to solve that particular issue. Ralf
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion