Back in 2006/07 I wrote an optimized histogram function for pytables + 
numpy. The main steps were: - Read in chunksize-sections of the pytables 
array so the HDF5 library just needs to decompress full blocks of data 
from disk into memory; eliminates subsequent copying/merging of partial 
data blocks - Modify numpy's bincount function to be more suitable for 
high-speed histograms by avoiding data type conversions, eliminate 
initial pass to determine bounds, etc. - Also I modified the numpy 
histogram function to update existing histogram counts. This meant huge 
pytables datasets could be histogrammed by reading in successive chunks. 
- I also wrote numpy function in C to do weighted averages and simple 
joins. Net result of optimising both the pytables data storage and the 
numpy histogramming was probably a 50x increase in speed. Certainly I 
was getting >1m rows/sec for weighted average histograms, using a 2005 
Dell laptop. I had plans to submit it as a patch to numpy, but work 
priorities at the time took me in another direction. One email about it 
with some C code is here: 
http://mail.scipy.org/pipermail/numpy-discussion/2007-March/026472.html 
I can send a proper Python source package for it if anyone is 
interested. Regards Stephen ------------------------------ Message: 3 
Date: Sat, 17 Nov 2012 23:54:39 +0100 From: Francesc Alted 
<fal...@gmail.com> Subject: Re: [Pytables-users] Histogramming 1000x too 
slow To: Discussion list for PyTables 
<pytables-users@lists.sourceforge.net> Message-ID: 
<50a815af.20...@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; 
format=flowed On 11/16/12 6:02 PM, Jon Wilson wrote:

> Hi all,
> I am trying to find the best way to make histograms from large data
> sets.  Up to now, I've been just loading entire columns into in-memory
> numpy arrays and making histograms from those.  However, I'm currently
> working on a handful of datasets where this is prohibitively memory
> intensive (causing an out-of-memory kernel panic on a shared machine
> that you have to open a ticket to have rebooted makes you a little
> gun-shy), so I am now exploring other options.
>
> I know that the Column object is rather nicely set up to act, in some
> circumstances, like a numpy ndarray.  So my first thought is to try just
> creating the histogram out of the Column object directly. This is,
> however, 1000x slower than loading it into memory and creating the
> histogram from the in-memory array.  Please see my test notebook at:
> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
>
> For such a small table, loading into memory is not an issue.  For larger
> tables, though, it is a problem, and I had hoped that pytables was
> optimized so that histogramming directly from disk would proceed no
> slower than loading into memory and histogramming. Is there some other
> way of accessing the column (or Array or CArray) data that will make
> faster histograms?

Indeed a 1000x slowness is quite a lot, but it is important to stress
that you are doing an disk operation whenever you are accessing a data
element, and that takes time.  Perhaps using Array or CArray would make
times a bit better, but frankly, I don't think this is going to buy you
too much speed.

The problem here is that you have too many layers, and this makes access
slower.  You may have better luck with carray
(https://github.com/FrancescAlted/carray), that supports this sort of
operations, but using a much simpler persistence machinery.  At any
rate, the results are far better than PyTables:

In [6]: import numpy as np

In [7]: import carray as ca

In [8]: N = 1e7

In [9]: a = np.random.rand(N)

In [10]: %time h = np.histogram(a)
CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s
Wall time: 0.55 s

In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray')

In [12]: %time h = np.histogram(ad)
CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s
Wall time: 5.81 s

So, the overhead for using a disk-based array is just 10x (not 1000x as
in PyTables).  I don't know if a 10x slowdown is acceptable to you, but
in case you need more speed, you could probably implement the histogram
as a method of the carray class in Cython:

https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651

It should not be too difficult to come up with an optimal implementation
using a chunk-based approach.

-- Francesc Alted ------------------------------


------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to