Back in 2006/07 I wrote an optimized histogram function for pytables + numpy. The main steps were: - Read in chunksize-sections of the pytables array so the HDF5 library just needs to decompress full blocks of data from disk into memory; eliminates subsequent copying/merging of partial data blocks - Modify numpy's bincount function to be more suitable for high-speed histograms by avoiding data type conversions, eliminate initial pass to determine bounds, etc. - Also I modified the numpy histogram function to update existing histogram counts. This meant huge pytables datasets could be histogrammed by reading in successive chunks. - I also wrote numpy function in C to do weighted averages and simple joins. Net result of optimising both the pytables data storage and the numpy histogramming was probably a 50x increase in speed. Certainly I was getting >1m rows/sec for weighted average histograms, using a 2005 Dell laptop. I had plans to submit it as a patch to numpy, but work priorities at the time took me in another direction. One email about it with some C code is here: http://mail.scipy.org/pipermail/numpy-discussion/2007-March/026472.html I can send a proper Python source package for it if anyone is interested. Regards Stephen ------------------------------ Message: 3 Date: Sat, 17 Nov 2012 23:54:39 +0100 From: Francesc Alted <fal...@gmail.com> Subject: Re: [Pytables-users] Histogramming 1000x too slow To: Discussion list for PyTables <pytables-users@lists.sourceforge.net> Message-ID: <50a815af.20...@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed On 11/16/12 6:02 PM, Jon Wilson wrote:
> Hi all, > I am trying to find the best way to make histograms from large data > sets. Up to now, I've been just loading entire columns into in-memory > numpy arrays and making histograms from those. However, I'm currently > working on a handful of datasets where this is prohibitively memory > intensive (causing an out-of-memory kernel panic on a shared machine > that you have to open a ticket to have rebooted makes you a little > gun-shy), so I am now exploring other options. > > I know that the Column object is rather nicely set up to act, in some > circumstances, like a numpy ndarray. So my first thought is to try just > creating the histogram out of the Column object directly. This is, > however, 1000x slower than loading it into memory and creating the > histogram from the in-memory array. Please see my test notebook at: > http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html > > For such a small table, loading into memory is not an issue. For larger > tables, though, it is a problem, and I had hoped that pytables was > optimized so that histogramming directly from disk would proceed no > slower than loading into memory and histogramming. Is there some other > way of accessing the column (or Array or CArray) data that will make > faster histograms? Indeed a 1000x slowness is quite a lot, but it is important to stress that you are doing an disk operation whenever you are accessing a data element, and that takes time. Perhaps using Array or CArray would make times a bit better, but frankly, I don't think this is going to buy you too much speed. The problem here is that you have too many layers, and this makes access slower. You may have better luck with carray (https://github.com/FrancescAlted/carray), that supports this sort of operations, but using a much simpler persistence machinery. At any rate, the results are far better than PyTables: In [6]: import numpy as np In [7]: import carray as ca In [8]: N = 1e7 In [9]: a = np.random.rand(N) In [10]: %time h = np.histogram(a) CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s Wall time: 0.55 s In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray') In [12]: %time h = np.histogram(ad) CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s Wall time: 5.81 s So, the overhead for using a disk-based array is just 10x (not 1000x as in PyTables). I don't know if a 10x slowdown is acceptable to you, but in case you need more speed, you could probably implement the histogram as a method of the carray class in Cython: https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651 It should not be too difficult to come up with an optimal implementation using a chunk-based approach. -- Francesc Alted ------------------------------ ------------------------------------------------------------------------------ Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users