Yes _please_ Stephen. It would be much appreciated.
On 19/11/2012, at 8:12 AM, Jon Wilson wrote:
> Hi Stephen,
> This sounds fantastic, and exactly what i'm looking for. I'll take a closer
> look tomorrow.
> Jon
>
> Stephen Simmons <m...@stevesimmons.com> wrote:
> Back in 2006/07 I wrote an optimized histogram function for pytables +
> numpy. The main steps were: - Read in chunksize-sections of the pytables
> array so the HDF5 library just needs to decompress full blocks of data
> from disk into memory; eliminates subsequent copying/merging of partial
> data blocks - Modify numpy's bincount function to be more suitable for
> high-speed histograms by avoiding data type conversions, eliminate
> initial pass to determine bounds, etc. - Also I modified the numpy
> histogram function to update existing histogram counts. This meant huge
> pytables datasets could be histogrammed by reading in successive chunks.
> - I also wrote numpy function in C to do weighted averages and simple
> joins. Net result of optimising both the pytables data storage and the
> numpy histogramming was probably a 50x increase in
> speed. Certainly I
> was getting >1m rows/sec for weighted average histograms, using a 2005
> Dell laptop. I had plans to submit it as a patch to numpy, but work
> priorities at the time took me in another direction. One email about it
> with some C code is here:
> http://mail.scipy.org/pipermail/numpy-discussion/2007-March/026472.html
> I can send a proper Python source package for it if anyone is
> interested. Regards Stephen
Yes _please_ Stephen. It would be much appreciated.
> Message: 3
> Date: Sat, 17 Nov 2012 23:54:39 +0100 From: Francesc Alted
> <fal...@gmail.com> Subject: Re: [Pytables-users] Histogramming 1000x too
> slow To: Discussion list for PyTables
> <pytables-users@lists.sourceforge.net> Message-ID:
> <50a815af.20...@gmail.com> Content-Type: text/plain; charset=ISO-8859-1;
> format=flowed On 11/16/12 6:02 PM, Jon Wilson wrote:
>
> Hi all,
> I am trying to find the best way to make histograms from large data
> sets. Up to now, I've been just loading entire columns into in-memory
> numpy arrays and making histograms from those. However, I'm currently
> working on a handful of datasets where this is prohibitively memory
> intensive (causing an out-of-memory kernel panic on a shared machine
> that you have to open a ticket to have rebooted makes you a little
> gun-shy), so I am now exploring other options.
>
> I know that the Column object is rather nicely set up to act, in some
> circumstances, like a numpy ndarray. So my first thought is to try just
> creating the histogram out of the Column object directly. This is,
> however, 1000x slower than loading it into memory and creating the
> histogram from the in-memory array. Please see my test notebook at:
> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
>
> For such a small table, loading into memory is not an issue. For larger
> tables, though, it is a problem, and I had hoped that pytables was
> optimized so that histogramming directly from disk would proceed no
> slower than loading into memory and histogramming. Is there some other
> way of accessing the column (or Array or CArray) data that will make
> faster histograms?
>
> Indeed a 1000x slowness is quite a lot, but it is important to stress
> that you are doing an disk operation whenever you are accessing a data
> element, and that takes time. Perhaps using Array or CArray would make
> times a bit better, but frankly, I don't think this is going to buy you
> too much speed.
>
> The problem here is that you have too many layers, and this makes access
> slower. You may have better luck with
> carray
> (https://github.com/FrancescAlted/carray), that supports this sort of
> operations, but using a much simpler persistence machinery. At any
> rate, the results are far better than PyTables:
>
> In [6]: import numpy as np
>
> In [7]: import carray as ca
>
> In [8]: N = 1e7
>
> In [9]: a = np.random.rand(N)
>
> In [10]: %time h = np.histogram(a)
> CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s
> Wall time: 0.55 s
>
> In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray')
>
> In [12]: %time h = np.histogram(ad)
> CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s
> Wall time: 5.81 s
>
> So, the overhead for using a disk-based array is just 10x (not 1000x as
> in PyTables). I don't know if a 10x slowdown is acceptable to you, but
> in case you need more speed, you could probably implement the histogram
> as a method of the carray class in
> Cython:
>
> https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651
>
> It should not be too difficult to come up with an optimal implementation
> using a chunk-based approach.
>
> -- Francesc Alted
>
>
>
>
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
> http://p.sf.net/sfu/zoho_dev2dev_nov
>
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.
> ------------------------------------------------------------------------------
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
> http://p.sf.net/sfu/zoho_dev2dev_nov_______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
_________________________________________________
experimental polymedia: www.avatar.com.au
Sonic Communications Research Group,
University of Canberra: creative.canberra.edu.au/scrg
------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users