Re: [Numpy-discussion] Change in memmap behaviour
On Tue, Jul 3, 2012 at 4:08 PM, Nathaniel Smith wrote: > On Tue, Jul 3, 2012 at 10:35 AM, Thouis (Ray) Jones wrote: >> On Mon, Jul 2, 2012 at 11:52 PM, Sveinung Gundersen >> wrote: >>> >>> On 2. juli 2012, at 22.40, Nathaniel Smith wrote: >>> On Mon, Jul 2, 2012 at 6:54 PM, Sveinung Gundersen wrote: > [snip] > > > > Your actual memory usage may not have increased as much as you think, > since memmap objects don't necessarily take much memory -- it sounds > like you're leaking virtual memory, but your resident set size > shouldn't go up as much. > > > As I understand it, memmap objects retain the contents of the memmap in > memory after it has been read the first time (in a lazy manner). Thus, > when > reading a slice of a 24GB file, only that part recides in memory. Our > system > reads a slice of a memmap, calculates something (say, the sum), and then > deletes the memmap. It then loops through this for consequitive slices, > retaining a low memory usage. Consider the following code: > > import numpy as np > res = [] > vecLen = 3095677412 > for i in xrange(vecLen/10**8+1): > x = i * 10**8 > y = min((i+1) * 10**8, vecLen) > res.append(np.memmap('val.float64', dtype='float64')[x:y].sum()) > > The memory usage of this code on a 24GB file (one value for each > nucleotide > in the human DNA!) is 23g resident memory after the loop is finished (not > 24g for some reason..). > > Running the same code on 1.5.1rc1 gives a resident memory of 23m after the > loop. Your memory measurement tools are misleading you. The same memory is resident in both cases, just in one case your tools say it is operating system disk cache (and not attributed to your app), and in the other case that same memory, treated in the same way by the OS, is shown as part of your app's resident memory. Virtual memory is confusing... >>> >>> But the crucial difference is perhaps that the disk cache can be cleared by >>> the OS if needed, but not the application memory in the same way, which >>> must be swapped to disk? Or am I still confused? >>> >>> (snip) >>> > > Great! Any idea on whether such a patch may be included in 1.7? Not really, if I or you or someone else gets inspired to take the time to write a patch soon then it will be, otherwise not... -N >>> >>> I have now tried to add a patch, in the way you proposed, but I may have >>> gotten it wrong.. >>> >>> http://projects.scipy.org/numpy/ticket/2179 >> >> I put this in a github repo, and added tests (author credit to Sveinung) >> https://github.com/thouis/numpy/tree/mmap_children >> >> I'm not sure which branch to issue a PR request against, though. > > Looks good to me, thanks to both of you! > > Obviously should be merged to master; beyond that I'm not sure. We > definitely want it in 1.7, but I'm not sure if that's been branched > yet or not. (Or rather, it has been branched, but then maybe it was > unbranched again? Travis?) Since it was a 1.6 regression it'd make > sense to cherrypick to the 1.6 branch too, just in case it gets > another release. Merged into master and maintenance/1.6.x, but not maintenance/1.7.x, I'll let Ondrej or Travis figure that out... -N ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Change in memmap behaviour
On Tue, Jul 3, 2012 at 10:35 AM, Thouis (Ray) Jones wrote: > On Mon, Jul 2, 2012 at 11:52 PM, Sveinung Gundersen > wrote: >> >> On 2. juli 2012, at 22.40, Nathaniel Smith wrote: >> >>> On Mon, Jul 2, 2012 at 6:54 PM, Sveinung Gundersen >>> wrote: [snip] Your actual memory usage may not have increased as much as you think, since memmap objects don't necessarily take much memory -- it sounds like you're leaking virtual memory, but your resident set size shouldn't go up as much. As I understand it, memmap objects retain the contents of the memmap in memory after it has been read the first time (in a lazy manner). Thus, when reading a slice of a 24GB file, only that part recides in memory. Our system reads a slice of a memmap, calculates something (say, the sum), and then deletes the memmap. It then loops through this for consequitive slices, retaining a low memory usage. Consider the following code: import numpy as np res = [] vecLen = 3095677412 for i in xrange(vecLen/10**8+1): x = i * 10**8 y = min((i+1) * 10**8, vecLen) res.append(np.memmap('val.float64', dtype='float64')[x:y].sum()) The memory usage of this code on a 24GB file (one value for each nucleotide in the human DNA!) is 23g resident memory after the loop is finished (not 24g for some reason..). Running the same code on 1.5.1rc1 gives a resident memory of 23m after the loop. >>> >>> Your memory measurement tools are misleading you. The same memory is >>> resident in both cases, just in one case your tools say it is >>> operating system disk cache (and not attributed to your app), and in >>> the other case that same memory, treated in the same way by the OS, is >>> shown as part of your app's resident memory. Virtual memory is >>> confusing... >> >> But the crucial difference is perhaps that the disk cache can be cleared by >> the OS if needed, but not the application memory in the same way, which must >> be swapped to disk? Or am I still confused? >> >> (snip) >> Great! Any idea on whether such a patch may be included in 1.7? >>> >>> Not really, if I or you or someone else gets inspired to take the time >>> to write a patch soon then it will be, otherwise not... >>> >>> -N >> >> I have now tried to add a patch, in the way you proposed, but I may have >> gotten it wrong.. >> >> http://projects.scipy.org/numpy/ticket/2179 > > I put this in a github repo, and added tests (author credit to Sveinung) > https://github.com/thouis/numpy/tree/mmap_children > > I'm not sure which branch to issue a PR request against, though. Looks good to me, thanks to both of you! Obviously should be merged to master; beyond that I'm not sure. We definitely want it in 1.7, but I'm not sure if that's been branched yet or not. (Or rather, it has been branched, but then maybe it was unbranched again? Travis?) Since it was a 1.6 regression it'd make sense to cherrypick to the 1.6 branch too, just in case it gets another release. -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Change in memmap behaviour
On Mon, Jul 2, 2012 at 11:52 PM, Sveinung Gundersen wrote: > > On 2. juli 2012, at 22.40, Nathaniel Smith wrote: > >> On Mon, Jul 2, 2012 at 6:54 PM, Sveinung Gundersen >> wrote: >>> [snip] >>> >>> >>> >>> Your actual memory usage may not have increased as much as you think, >>> since memmap objects don't necessarily take much memory -- it sounds >>> like you're leaking virtual memory, but your resident set size >>> shouldn't go up as much. >>> >>> >>> As I understand it, memmap objects retain the contents of the memmap in >>> memory after it has been read the first time (in a lazy manner). Thus, when >>> reading a slice of a 24GB file, only that part recides in memory. Our system >>> reads a slice of a memmap, calculates something (say, the sum), and then >>> deletes the memmap. It then loops through this for consequitive slices, >>> retaining a low memory usage. Consider the following code: >>> >>> import numpy as np >>> res = [] >>> vecLen = 3095677412 >>> for i in xrange(vecLen/10**8+1): >>> x = i * 10**8 >>> y = min((i+1) * 10**8, vecLen) >>> res.append(np.memmap('val.float64', dtype='float64')[x:y].sum()) >>> >>> The memory usage of this code on a 24GB file (one value for each nucleotide >>> in the human DNA!) is 23g resident memory after the loop is finished (not >>> 24g for some reason..). >>> >>> Running the same code on 1.5.1rc1 gives a resident memory of 23m after the >>> loop. >> >> Your memory measurement tools are misleading you. The same memory is >> resident in both cases, just in one case your tools say it is >> operating system disk cache (and not attributed to your app), and in >> the other case that same memory, treated in the same way by the OS, is >> shown as part of your app's resident memory. Virtual memory is >> confusing... > > But the crucial difference is perhaps that the disk cache can be cleared by > the OS if needed, but not the application memory in the same way, which must > be swapped to disk? Or am I still confused? > > (snip) > >>> >>> Great! Any idea on whether such a patch may be included in 1.7? >> >> Not really, if I or you or someone else gets inspired to take the time >> to write a patch soon then it will be, otherwise not... >> >> -N > > I have now tried to add a patch, in the way you proposed, but I may have > gotten it wrong.. > > http://projects.scipy.org/numpy/ticket/2179 I put this in a github repo, and added tests (author credit to Sveinung) https://github.com/thouis/numpy/tree/mmap_children I'm not sure which branch to issue a PR request against, though. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Change in memmap behaviour
On 2. juli 2012, at 22.40, Nathaniel Smith wrote: > On Mon, Jul 2, 2012 at 6:54 PM, Sveinung Gundersen wrote: >> [snip] >> >> >> >> Your actual memory usage may not have increased as much as you think, >> since memmap objects don't necessarily take much memory -- it sounds >> like you're leaking virtual memory, but your resident set size >> shouldn't go up as much. >> >> >> As I understand it, memmap objects retain the contents of the memmap in >> memory after it has been read the first time (in a lazy manner). Thus, when >> reading a slice of a 24GB file, only that part recides in memory. Our system >> reads a slice of a memmap, calculates something (say, the sum), and then >> deletes the memmap. It then loops through this for consequitive slices, >> retaining a low memory usage. Consider the following code: >> >> import numpy as np >> res = [] >> vecLen = 3095677412 >> for i in xrange(vecLen/10**8+1): >> x = i * 10**8 >> y = min((i+1) * 10**8, vecLen) >> res.append(np.memmap('val.float64', dtype='float64')[x:y].sum()) >> >> The memory usage of this code on a 24GB file (one value for each nucleotide >> in the human DNA!) is 23g resident memory after the loop is finished (not >> 24g for some reason..). >> >> Running the same code on 1.5.1rc1 gives a resident memory of 23m after the >> loop. > > Your memory measurement tools are misleading you. The same memory is > resident in both cases, just in one case your tools say it is > operating system disk cache (and not attributed to your app), and in > the other case that same memory, treated in the same way by the OS, is > shown as part of your app's resident memory. Virtual memory is > confusing... But the crucial difference is perhaps that the disk cache can be cleared by the OS if needed, but not the application memory in the same way, which must be swapped to disk? Or am I still confused? (snip) >> >> Great! Any idea on whether such a patch may be included in 1.7? > > Not really, if I or you or someone else gets inspired to take the time > to write a patch soon then it will be, otherwise not... > > -N I have now tried to add a patch, in the way you proposed, but I may have gotten it wrong.. http://projects.scipy.org/numpy/ticket/2179 Sveinung ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Change in memmap behaviour
On Mon, Jul 2, 2012 at 6:54 PM, Sveinung Gundersen wrote: > [snip] > > > > Your actual memory usage may not have increased as much as you think, > since memmap objects don't necessarily take much memory -- it sounds > like you're leaking virtual memory, but your resident set size > shouldn't go up as much. > > > As I understand it, memmap objects retain the contents of the memmap in > memory after it has been read the first time (in a lazy manner). Thus, when > reading a slice of a 24GB file, only that part recides in memory. Our system > reads a slice of a memmap, calculates something (say, the sum), and then > deletes the memmap. It then loops through this for consequitive slices, > retaining a low memory usage. Consider the following code: > > import numpy as np > res = [] > vecLen = 3095677412 > for i in xrange(vecLen/10**8+1): > x = i * 10**8 > y = min((i+1) * 10**8, vecLen) > res.append(np.memmap('val.float64', dtype='float64')[x:y].sum()) > > The memory usage of this code on a 24GB file (one value for each nucleotide > in the human DNA!) is 23g resident memory after the loop is finished (not > 24g for some reason..). > > Running the same code on 1.5.1rc1 gives a resident memory of 23m after the > loop. Your memory measurement tools are misleading you. The same memory is resident in both cases, just in one case your tools say it is operating system disk cache (and not attributed to your app), and in the other case that same memory, treated in the same way by the OS, is shown as part of your app's resident memory. Virtual memory is confusing... > That said, this is clearly a bug, and it's even worse than you mention > -- *all* operations on memmap arrays are holding onto references to > the original mmap object, regardless of whether they share any memory: > > a = np.memmap("/etc/passwd", np.uint8, "r") > > # arithmetic > > (a + 10)._mmap is a._mmap > > True > # fancy indexing (doesn't return a view!) > > a[[1, 2, 3]]._mmap is a._mmap > > True > > a.sum()._mmap is a._mmap > > True > Really, only slicing should be returning a np.memmap object at all. > Unfortunately, it is currently impossible to create an ndarray > subclass that returns base-class ndarrays from any operations -- > __array_finalize__() has no way to do this. And this is the third > ndarray subclass in a row that I've looked at that wanted to be able > to do this, so I guess maybe it's something we should implement... > > In the short term, the numpy-upstream fix is to change > numpy.core.memmap:memmap.__array_finalize__ so that it only copies > over the ._mmap attribute of its parent if np.may_share_memory(self, > parent) is True. Patches gratefully accepted ;-) > > > Great! Any idea on whether such a patch may be included in 1.7? Not really, if I or you or someone else gets inspired to take the time to write a patch soon then it will be, otherwise not... -N ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Change in memmap behaviour
[snip] > > Your actual memory usage may not have increased as much as you think, > since memmap objects don't necessarily take much memory -- it sounds > like you're leaking virtual memory, but your resident set size > shouldn't go up as much. As I understand it, memmap objects retain the contents of the memmap in memory after it has been read the first time (in a lazy manner). Thus, when reading a slice of a 24GB file, only that part recides in memory. Our system reads a slice of a memmap, calculates something (say, the sum), and then deletes the memmap. It then loops through this for consequitive slices, retaining a low memory usage. Consider the following code: import numpy as np res = [] vecLen = 3095677412 for i in xrange(vecLen/10**8+1): x = i * 10**8 y = min((i+1) * 10**8, vecLen) res.append(np.memmap('val.float64', dtype='float64')[x:y].sum()) The memory usage of this code on a 24GB file (one value for each nucleotide in the human DNA!) is 23g resident memory after the loop is finished (not 24g for some reason..). Running the same code on 1.5.1rc1 gives a resident memory of 23m after the loop. > > That said, this is clearly a bug, and it's even worse than you mention > -- *all* operations on memmap arrays are holding onto references to > the original mmap object, regardless of whether they share any memory: a = np.memmap("/etc/passwd", np.uint8, "r") > # arithmetic (a + 10)._mmap is a._mmap > True > # fancy indexing (doesn't return a view!) a[[1, 2, 3]]._mmap is a._mmap > True a.sum()._mmap is a._mmap > True > Really, only slicing should be returning a np.memmap object at all. > Unfortunately, it is currently impossible to create an ndarray > subclass that returns base-class ndarrays from any operations -- > __array_finalize__() has no way to do this. And this is the third > ndarray subclass in a row that I've looked at that wanted to be able > to do this, so I guess maybe it's something we should implement... > > In the short term, the numpy-upstream fix is to change > numpy.core.memmap:memmap.__array_finalize__ so that it only copies > over the ._mmap attribute of its parent if np.may_share_memory(self, > parent) is True. Patches gratefully accepted ;-) Great! Any idea on whether such a patch may be included in 1.7? > > In the short term, you have a few options for hacky workarounds. You > could monkeypatch the above fix into the memmap class. You could > manually assign None to the _mmap attribute of offending arrays (being > careful only to do this to arrays where you know it is safe!). And for > reduction operations like sum() in particular, what you have right now > is not actually a scalar object -- it is a 0-dimensional array that > holds a single scalar. You can pull this scalar out by calling .item() > on the array, and then throw away the array itself -- the scalar won't > have any _mmap attribute. > def scalarify(scalar_or_0d_array): >if isinstance(scalar_or_0d_array, np.ndarray): > return scalar_or_0d_array.item() >else: > return scalar_or_0d_array > # works on both numpy 1.5 and numpy 1.6: > total = scalarify(a.sum()) Thank you for this! However, such a solution would have to be scattered throughout the code (probably over 100 places), and I would rather not do that. I guess the abovementioned patch would be the best solution. I do not have experience in the numpy core code, so I am also eagerly awaiting such a patch! Sveinung -- Sveinung Gundersen PhD Student, Bioinformatics, Dept. of Tumor Biology, Inst. for Cancer Research, The Norwegian Radium Hospital, Montebello, 0310 Oslo, Norway E-mail: sveinung.gunder...@medisin.uio.no, Phone: +47 93 00 94 54 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Change in memmap behaviour
On Mon, Jul 2, 2012 at 3:53 PM, Sveinung Gundersen wrote: > Hi, > > We are developing a large project for genome analysis > (http://hyperbrowser.uio.no), where we use memmap vectors as the basic data > structure for storage. The stored data are accessed in slices, and used as > basis for calculations. As the stored data may be large (up to 24 GB), the > memory footprint is important. > > We experienced a problem with 64-bit addressing for the function concatenate > (using quite old numpy version 1.5.1rc), and have thus updated the version > of numpy to 1.7.0.dev-651ef74, where the problem has been fixed. We have, > however, experienced another problem connected to a change in memmap > behaviour. This change seems to have come with the 1.6 release. > > Before (1.5.1rc1): > import platform; print platform.python_version() > 2.7.0 import numpy as np np.version.version > '1.5.1rc1' a = np.memmap('testmemmap', 'int32', 'w+', shape=20) a[:] = 2 a[0:2] > memmap([2, 2], dtype=int32) a[0:2]._mmap > a.sum() > 40 a.sum()._mmap > Traceback (most recent call last): > File "", line 1, in > AttributeError: 'numpy.int64' object has no attribute '_mmap' > > After (1.6.2): > import platform; print platform.python_version() > 2.7.0 import numpy as np np.version.version > '1.6.2' a = np.memmap('testmemmap', 'int32', 'w+', shape=20) a[:] = 2 a[0:2] > memmap([2, 2], dtype=int32) a[0:2]._mmap > a.sum() > memmap(40) a.sum()._mmap > > > The problem is then that doing calculations of memmap objects, resulting in > scalar results, previously returned a numpy scalar, with no reference to the > memmap object. We could then just keep the result, and mark the memmap for > garbage collection. Now, the memory usage of the system has increased > dramatically, as we now longer have this option. Your actual memory usage may not have increased as much as you think, since memmap objects don't necessarily take much memory -- it sounds like you're leaking virtual memory, but your resident set size shouldn't go up as much. That said, this is clearly a bug, and it's even worse than you mention -- *all* operations on memmap arrays are holding onto references to the original mmap object, regardless of whether they share any memory: >>> a = np.memmap("/etc/passwd", np.uint8, "r") # arithmetic >>> (a + 10)._mmap is a._mmap True # fancy indexing (doesn't return a view!) >>> a[[1, 2, 3]]._mmap is a._mmap True >>> a.sum()._mmap is a._mmap True Really, only slicing should be returning a np.memmap object at all. Unfortunately, it is currently impossible to create an ndarray subclass that returns base-class ndarrays from any operations -- __array_finalize__() has no way to do this. And this is the third ndarray subclass in a row that I've looked at that wanted to be able to do this, so I guess maybe it's something we should implement... In the short term, the numpy-upstream fix is to change numpy.core.memmap:memmap.__array_finalize__ so that it only copies over the ._mmap attribute of its parent if np.may_share_memory(self, parent) is True. Patches gratefully accepted ;-) In the short term, you have a few options for hacky workarounds. You could monkeypatch the above fix into the memmap class. You could manually assign None to the _mmap attribute of offending arrays (being careful only to do this to arrays where you know it is safe!). And for reduction operations like sum() in particular, what you have right now is not actually a scalar object -- it is a 0-dimensional array that holds a single scalar. You can pull this scalar out by calling .item() on the array, and then throw away the array itself -- the scalar won't have any _mmap attribute. def scalarify(scalar_or_0d_array): if isinstance(scalar_or_0d_array, np.ndarray): return scalar_or_0d_array.item() else: return scalar_or_0d_array # works on both numpy 1.5 and numpy 1.6: total = scalarify(a.sum()) -N ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Change in memmap behaviour
Hi, We are developing a large project for genome analysis (http://hyperbrowser.uio.no), where we use memmap vectors as the basic data structure for storage. The stored data are accessed in slices, and used as basis for calculations. As the stored data may be large (up to 24 GB), the memory footprint is important. We experienced a problem with 64-bit addressing for the function concatenate (using quite old numpy version 1.5.1rc), and have thus updated the version of numpy to 1.7.0.dev-651ef74, where the problem has been fixed. We have, however, experienced another problem connected to a change in memmap behaviour. This change seems to have come with the 1.6 release. Before (1.5.1rc1): >>> import platform; print platform.python_version() 2.7.0 >>> import numpy as np >>> np.version.version '1.5.1rc1' >>> a = np.memmap('testmemmap', 'int32', 'w+', shape=20) >>> a[:] = 2 >>> a[0:2] memmap([2, 2], dtype=int32) >>> a[0:2]._mmap >>> a.sum() 40 >>> a.sum()._mmap Traceback (most recent call last): File "", line 1, in AttributeError: 'numpy.int64' object has no attribute '_mmap' After (1.6.2): >>> import platform; print platform.python_version() 2.7.0 >>> import numpy as np >>> np.version.version '1.6.2' >>> a = np.memmap('testmemmap', 'int32', 'w+', shape=20) >>> a[:] = 2 >>> a[0:2] memmap([2, 2], dtype=int32) >>> a[0:2]._mmap >>> a.sum() memmap(40) >>> a.sum()._mmap The problem is then that doing calculations of memmap objects, resulting in scalar results, previously returned a numpy scalar, with no reference to the memmap object. We could then just keep the result, and mark the memmap for garbage collection. Now, the memory usage of the system has increased dramatically, as we now longer have this option. So, the question is twofold: 1) What is the reason behind this change? It makes sense to keep the reference to the mmap when slicing, but to go from a scalar value to the mmap does not seem very useful. Is there a possibility to return to the old solution? 2) If not, do you have any advice how we can retain the old solution without rewriting the system. We could cast the results of all functions on the memmap, but these are scattered throughout the system and would probably cause much headache. So we would rather implement a general solution, for instance wrapping the memmap object somehow. Do you have any ideas? Connected to this is the rather puzzling fact that the 'new' memmap scalar object has an __iter__ method, but no length. Should not the __iter__ method be removed, as this signals that the object is iterable? Before (1.5.1rc1): >>> a[0:2].__iter__() >>> len(a[0:2]) 2 >>> a.sum().__iter__ Traceback (most recent call last): File "", line 1, in AttributeError: 'numpy.int64' object has no attribute '__iter__' >>> len(a.sum()) Traceback (most recent call last): File "", line 1, in TypeError: object of type 'numpy.int64' has no len() After (1.6.2): >>> a[0:2].__iter__() >>> len(a[0:2]) 2 >>> a.sum().__iter__ >>> len(a.sum()) Traceback (most recent call last): File "", line 1, in TypeError: len() of unsized object >>> [x for x in a.sum()] Traceback (most recent call last): File "", line 1, in TypeError: iteration over a 0-d array Regards, Sveinung Gundersen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion