Re: [Pytables-users] Histogramming 1000x too slow

David Worrall Sun, 18 Nov 2012 15:30:11 -0800

Yes _please_ Stephen. It would be much appreciated.

On 19/11/2012, at 8:12 AM, Jon Wilson wrote:


> Hi Stephen,
> This sounds fantastic, and exactly what i'm looking for. I'll take a closer 
> look tomorrow.
> Jon
> 
> Stephen Simmons <m...@stevesimmons.com> wrote:
> Back in 2006/07 I wrote an optimized histogram function for pytables + 
> numpy. The main steps were: - Read in chunksize-sections of the pytables 
> array so the HDF5 library just needs to decompress full blocks of data 
> from disk into memory; eliminates subsequent copying/merging of partial 
> data blocks - Modify numpy's bincount function to be more suitable for 
> high-speed histograms by avoiding data type conversions, eliminate 
> initial pass to determine bounds, etc. - Also I modified the numpy 
> histogram function to update existing histogram counts. This meant huge 
> pytables datasets could be histogrammed by reading in successive chunks. 
> - I also wrote numpy function in C to do weighted averages and simple 
> joins. Net result of optimising both the pytables data storage and the 
> numpy histogramming was probably a 50x increase in
> speed. Certainly I 
> was getting >1m rows/sec for weighted average histograms, using a 2005 
> Dell laptop. I had plans to submit it as a patch to numpy, but work 
> priorities at the time took me in another direction. One email about it 
> with some C code is here: 
> http://mail.scipy.org/pipermail/numpy-discussion/2007-March/026472.html 
> I can send a proper Python source package for it if anyone is 
> interested. Regards Stephen

Yes _please_ Stephen. It would be much appreciated.


> Message: 3 
> Date: Sat, 17 Nov 2012 23:54:39 +0100 From: Francesc Alted 
> <fal...@gmail.com> Subject: Re: [Pytables-users] Histogramming 1000x too 
> slow To: Discussion list for PyTables 
> <pytables-users@lists.sourceforge.net> Message-ID: 
> <50a815af.20...@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; 
> format=flowed On 11/16/12 6:02 PM, Jon Wilson wrote:
> 
> Hi all,
> I am trying to find the best way to make histograms from large data
> sets.  Up to now, I've been just loading entire columns into in-memory
> numpy arrays and making histograms from those.  However, I'm currently
> working on a handful of datasets where this is prohibitively memory
> intensive (causing an out-of-memory kernel panic on a shared machine
> that you have to open a ticket to have rebooted makes you a little
> gun-shy), so I am now exploring other options.
> 
> I know that the Column object is rather nicely set up to act, in some
> circumstances, like a numpy ndarray.  So my first thought is to try just
> creating the histogram out of the Column object directly. This is,
> however, 1000x slower than loading it into memory and creating the
> histogram from the in-memory array.  Please see my test notebook at:
> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
> 
> For such a small table, loading into memory is not an issue.  For larger
> tables, though, it is a problem, and I had hoped that pytables was
> optimized so that histogramming directly from disk would proceed no
> slower than loading into memory and histogramming. Is there some other
> way of accessing the column (or Array or CArray) data that will make
> faster histograms?
> 
> Indeed a 1000x slowness is quite a lot, but it is important to stress
> that you are doing an disk operation whenever you are accessing a data
> element, and that takes time.  Perhaps using Array or CArray would make
> times a bit better, but frankly, I don't think this is going to buy you
> too much speed.
> 
> The problem here is that you have too many layers, and this makes access
> slower.  You may have better luck with
> carray
> (https://github.com/FrancescAlted/carray), that supports this sort of
> operations, but using a much simpler persistence machinery.  At any
> rate, the results are far better than PyTables:
> 
> In [6]: import numpy as np
> 
> In [7]: import carray as ca
> 
> In [8]: N = 1e7
> 
> In [9]: a = np.random.rand(N)
> 
> In [10]: %time h = np.histogram(a)
> CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s
> Wall time: 0.55 s
> 
> In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray')
> 
> In [12]: %time h = np.histogram(ad)
> CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s
> Wall time: 5.81 s
> 
> So, the overhead for using a disk-based array is just 10x (not 1000x as
> in PyTables).  I don't know if a 10x slowdown is acceptable to you, but
> in case you need more speed, you could probably implement the histogram
> as a method of the carray class in
> Cython:
> 
> https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651
> 
> It should not be too difficult to come up with an optimal implementation
> using a chunk-based approach.
> 
> -- Francesc Alted
> 
> 
> 
> 
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
> http://p.sf.net/sfu/zoho_dev2dev_nov
> 
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
> 
> -- 
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.
> ------------------------------------------------------------------------------
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
> http://p.sf.net/sfu/zoho_dev2dev_nov_______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users

_________________________________________________
experimental polymedia: www.avatar.com.au
Sonic Communications Research Group,
University of Canberra:  creative.canberra.edu.au/scrg

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Histogramming 1000x too slow

Reply via email to