On Wed, Feb 29, 2012 at 3:11 PM, Erin Sheldon <erin.shel...@gmail.com> wrote: > Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 2012: >> > Even for binary, there are pathological cases, e.g. 1) reading a random >> > subset of nearly all rows. 2) reading a single column when rows are >> > small. In case 2 you will only go this route in the first place if you >> > need to save memory. The user should be aware of these issues. >> >> FWIW, this route actually doesn't save any memory as compared to np.memmap. > > Actually, for numpy.memmap you will read the whole file if you try to > grab a single column and read a large fraction of the rows. Here is an > example that will end up pulling the entire file into memory > > mm=numpy.memmap(fname, dtype=dtype) > rows=numpy.arange(mm.size) > x=mm['x'][rows] > > I just tested this on a 3G binary file and I'm sitting at 3G memory > usage. I believe this is because numpy.memmap only understands rows. I > don't fully understand the reason for that, but I suspect it is related > to the fact that the ndarray really only has a concept of itemsize, and > the fields are really just a reinterpretation of those bytes. It may be > that one could tweak the ndarray code to get around this. But I would > appreciate enlightenment on this subject.
Ahh, that makes sense. But, the tool you are using to measure memory usage is misleading you -- you haven't mentioned what platform you're on, but AFAICT none of them have very good tools for describing memory usage when mmap is in use. (There isn't a very good way to handle it.) What's happening is this: numpy read out just that column from the mmap'ed memory region. The OS saw this and decided to read the entire file, for reasons discussed previously. Then, since it had read the entire file, it decided to keep it around in memory for now, just in case some program wanted it again in the near future. Now, if you instead fetched just those bytes from the file using seek+read or whatever, the OS would treat that request in the exact same way: it'd still read the entire file, and it would still keep the whole thing around in memory. On Linux, you could test this by dropping caches (echo 1 > /proc/sys/vm/drop_caches), checking how much memory is listed as "free" in top, and then using your code to read the same file -- you'll see that the 'free' memory drops by 3 gigabytes, and the 'buffers' or 'cached' numbers will grow by 3 gigabytes. [Note: if you try this experiment, make sure that you don't have the same file opened with np.memmap -- for some reason Linux seems to ignore the request to drop_caches for files that are mmap'ed.] The difference between mmap and reading is that in the former case, then this cache memory will be "counted against" your process's resident set size. The same memory is used either way -- it's just that it gets reported differently by your tool. And in fact, this memory is not really "used" at all, in the way we usually mean that term -- it's just a cache that the OS keeps, and it will immediately throw it away if there's a better use for that memory. The only reason it's loading the whole 3 gigabytes into memory in the first place is that you have >3 gigabytes of memory to spare. You might even be able to tell the OS that you *won't* be reading that file again, so there's no point in keeping it all cached -- on Unix this is done via the madavise() or posix_fadvise() syscalls. (No guarantee the OS will actually listen, though.) > This fact was the original motivator for writing my code; the text > reading ability came later. > >> Cool. I'm just a little concerned that, since we seem to have like... >> 5 different implementations of this stuff all being worked on at the >> same time, we need to get some consensus on which features actually >> matter, so they can be melded together into the Single Best File >> Reader Evar. An interface where indexing and file-reading are combined >> is significantly more complicated than one where the core file-reading >> inner-loop can ignore indexing. So far I'm not sure why this >> complexity would be worthwhile, so that's what I'm trying to >> understand. > > I think I've addressed the reason why the low level C code was written. > And I think a unified, high level interface to binary and text files, > which the Recfile class provides, is worthwhile. > > Can you please say more about "...one where the core file-reading > inner-loop can ignore indexing"? I didn't catch the meaning. Sure, sorry. What I mean is just, it's easier to write code that only knows how to do a dumb sequential read, and doesn't know how to seek to particular places and pick out just the fields that are being requested. And it's easier to maintain, and optimize, and document, and add features, and so forth. (And we can still have a high-level interface on top of it, if that's useful.) So I'm trying to understand if there's really a compelling advantage that we get by build seeking smarts into our low-level C code, that we can't get otherwise. -- Nathaniel _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion