On 11/16/12 6:02 PM, Jon Wilson wrote:
> Hi all,
> I am trying to find the best way to make histograms from large data
> sets.  Up to now, I've been just loading entire columns into in-memory
> numpy arrays and making histograms from those.  However, I'm currently
> working on a handful of datasets where this is prohibitively memory
> intensive (causing an out-of-memory kernel panic on a shared machine
> that you have to open a ticket to have rebooted makes you a little
> gun-shy), so I am now exploring other options.
>
> I know that the Column object is rather nicely set up to act, in some
> circumstances, like a numpy ndarray.  So my first thought is to try just
> creating the histogram out of the Column object directly. This is,
> however, 1000x slower than loading it into memory and creating the
> histogram from the in-memory array.  Please see my test notebook at:
> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
>
> For such a small table, loading into memory is not an issue.  For larger
> tables, though, it is a problem, and I had hoped that pytables was
> optimized so that histogramming directly from disk would proceed no
> slower than loading into memory and histogramming. Is there some other
> way of accessing the column (or Array or CArray) data that will make
> faster histograms?

Indeed a 1000x slowness is quite a lot, but it is important to stress 
that you are doing an disk operation whenever you are accessing a data 
element, and that takes time.  Perhaps using Array or CArray would make 
times a bit better, but frankly, I don't think this is going to buy you 
too much speed.

The problem here is that you have too many layers, and this makes access 
slower.  You may have better luck with carray 
(https://github.com/FrancescAlted/carray), that supports this sort of 
operations, but using a much simpler persistence machinery.  At any 
rate, the results are far better than PyTables:

In [6]: import numpy as np

In [7]: import carray as ca

In [8]: N = 1e7

In [9]: a = np.random.rand(N)

In [10]: %time h = np.histogram(a)
CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s
Wall time: 0.55 s

In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray')

In [12]: %time h = np.histogram(ad)
CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s
Wall time: 5.81 s

So, the overhead for using a disk-based array is just 10x (not 1000x as 
in PyTables).  I don't know if a 10x slowdown is acceptable to you, but 
in case you need more speed, you could probably implement the histogram 
as a method of the carray class in Cython:

https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651

It should not be too difficult to come up with an optimal implementation 
using a chunk-based approach.

-- 
Francesc Alted


------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to