Re: [Pytables-users] Histogramming 1000x too slow

2012-11-25 Thread Anthony Scopatz
On Mon, Nov 19, 2012 at 12:59 PM, Jon Wilson  wrote:

>  Hi Anthony,
>
>
>
>
> On 11/17/2012 11:49 AM, Anthony Scopatz wrote:
>
>  Hi Jon,
>
>  Barring changes to numexpr itself, this is exactly what I am suggesting.
>  Well,, either writing one query expr per bin or (more cleverly) writing
> one expr which when evaluated for a row returns the integer bin number (1,
> 2, 3,...) this row falls in.  Then you can simply count() for each bin
> number.
>
>  For example, if you wanted to histogram data which ran from [0,100] into
> 10 bins, then the expr "r/10" into a dtype=int would do the trick.  This
> has the advantage of only running over the data once.  (Also, I am not
> convinced that running over the data multiple times is less efficient than
> doing row-based iteration.  You would have to test it on your data to find
> out.)
>
>
>>  It is a reduction operation, and would greatly benefit from chunking, I
>> expect. Not unlike sum(), which is implemented as a specially supported
>> reduction operation inside numexpr (buggily, last I checked). I suspect
>> that a substantial improvement in histogramming requires direct support
>> from either pytables or from numexpr. I don't suppose that there might be a
>> chunked-reduction interface exposed somewhere that I could hook into?
>>
>
>  This is definitively as feature to request from numexpr.
>
> I've been fiddling around with Stephen's code a bit, and it looks like the
> best way to do things is to read chunks (whether exactly of table.chunksize
> or not is a matter for optimization) of the data in at a time, and create
> histograms of those chunks.  Then combining the histograms is a trivial sum
> operation.  This type of approach can be generically applied in many cases,
> I suspect, where row-by-row iteration is prohibitively slow, but the
> dataset is too large to fit into memory.  As I understand, this idea is the
> primary win of PyTables in the first place!
>
> So, I think it would be extraordinarily helpful to provide a
> chunked-iteration interface for this sort of use case.  It can be as simple
> as a wrapper around Table.read():
>
> class Table:
> def chunkiter(self, field=None):
> while n*self.chunksize < self.nrows:
> yield self.read(n*self.chunksize, (n+1)*self.chunksize,
> field=field)
>
> Then I can write something like
> bins = linspace(-1,1, 101)
> hist = sum(histogram(chunk, bins=bins) for chunk in
> mytable.chunkiter(myfield))
>
> Preliminary tests seem to indicate that, for a table with 1 column and 10M
> rows, reading in "chunks" of 10x chunksize gives the best
> read-time-per-row.  This is perhaps naive as regards chunksize black magic,
> though...
>

Hello Jon,

Sorry about the slow reply, but I think that what is proposed in issue #27
[1] would solve the above by default, right?  Maybe you could pull Josh's
code and test it on the above example to make sure.  And then we could go
ahead and merge this in :).


> And of course, if implemented by numexpr, it could benefit from the nice
> automatic multithreading there.
>

This would be nice, but as you point out, not totally necessary here.


>
> Also, I might dig in a bit and see about extending the "field" argument to
> read so it can read multiple fields at once (to do N-dimensional
> histograms), as you suggested in a previous mail some months ago.
>

Also super cool, but not immediate ;)

Be Well
Anthony

1. https://github.com/PyTables/PyTables/issues/27


> Best Regards,
> Jon
>
--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Histogramming 1000x too slow

2012-11-19 Thread Jon Wilson

Hi Anthony,



On 11/17/2012 11:49 AM, Anthony Scopatz wrote:

Hi Jon,

Barring changes to numexpr itself, this is exactly what I am 
suggesting.  Well,, either writing one query expr per bin or (more 
cleverly) writing one expr which when evaluated for a row returns the 
integer bin number (1, 2, 3,...) this row falls in.  Then you can 
simply count() for each bin number.


For example, if you wanted to histogram data which ran from [0,100] 
into 10 bins, then the expr "r/10" into a dtype=int would do the 
trick.  This has the advantage of only running over the data once. 
 (Also, I am not convinced that running over the data multiple times 
is less efficient than doing row-based iteration.  You would have to 
test it on your data to find out.)


It is a reduction operation, and would greatly benefit from
chunking, I expect. Not unlike sum(), which is implemented as a
specially supported reduction operation inside numexpr (buggily,
last I checked). I suspect that a substantial improvement in
histogramming requires direct support from either pytables or from
numexpr. I don't suppose that there might be a chunked-reduction
interface exposed somewhere that I could hook into?


This is definitively as feature to request from numexpr.
I've been fiddling around with Stephen's code a bit, and it looks like 
the best way to do things is to read chunks (whether exactly of 
table.chunksize or not is a matter for optimization) of the data in at a 
time, and create histograms of those chunks.  Then combining the 
histograms is a trivial sum operation.  This type of approach can be 
generically applied in many cases, I suspect, where row-by-row iteration 
is prohibitively slow, but the dataset is too large to fit into memory.  
As I understand, this idea is the primary win of PyTables in the first 
place!


So, I think it would be extraordinarily helpful to provide a 
chunked-iteration interface for this sort of use case.  It can be as 
simple as a wrapper around Table.read():


class Table:
def chunkiter(self, field=None):
while n*self.chunksize < self.nrows:
yield self.read(n*self.chunksize, (n+1)*self.chunksize, 
field=field)


Then I can write something like
bins = linspace(-1,1, 101)
hist = sum(histogram(chunk, bins=bins) for chunk in 
mytable.chunkiter(myfield))


Preliminary tests seem to indicate that, for a table with 1 column and 
10M rows, reading in "chunks" of 10x chunksize gives the best 
read-time-per-row.  This is perhaps naive as regards chunksize black 
magic, though...


And of course, if implemented by numexpr, it could benefit from the nice 
automatic multithreading there.


Also, I might dig in a bit and see about extending the "field" argument 
to read so it can read multiple fields at once (to do N-dimensional 
histograms), as you suggested in a previous mail some months ago.

Best Regards,
Jon
--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Histogramming 1000x too slow

2012-11-18 Thread David Worrall
Yes _please_ Stephen. It would be much appreciated.

On 19/11/2012, at 8:12 AM, Jon Wilson wrote:

> Hi Stephen,
> This sounds fantastic, and exactly what i'm looking for. I'll take a closer 
> look tomorrow.
> Jon
> 
> Stephen Simmons  wrote:
> Back in 2006/07 I wrote an optimized histogram function for pytables + 
> numpy. The main steps were: - Read in chunksize-sections of the pytables 
> array so the HDF5 library just needs to decompress full blocks of data 
> from disk into memory; eliminates subsequent copying/merging of partial 
> data blocks - Modify numpy's bincount function to be more suitable for 
> high-speed histograms by avoiding data type conversions, eliminate 
> initial pass to determine bounds, etc. - Also I modified the numpy 
> histogram function to update existing histogram counts. This meant huge 
> pytables datasets could be histogrammed by reading in successive chunks. 
> - I also wrote numpy function in C to do weighted averages and simple 
> joins. Net result of optimising both the pytables data storage and the 
> numpy histogramming was probably a 50x increase in
> speed. Certainly I 
> was getting >1m rows/sec for weighted average histograms, using a 2005 
> Dell laptop. I had plans to submit it as a patch to numpy, but work 
> priorities at the time took me in another direction. One email about it 
> with some C code is here: 
> http://mail.scipy.org/pipermail/numpy-discussion/2007-March/026472.html 
> I can send a proper Python source package for it if anyone is 
> interested. Regards Stephen

Yes _please_ Stephen. It would be much appreciated.


> Message: 3 
> Date: Sat, 17 Nov 2012 23:54:39 +0100 From: Francesc Alted 
>  Subject: Re: [Pytables-users] Histogramming 1000x too 
> slow To: Discussion list for PyTables 
>  Message-ID: 
> <50a815af.20...@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; 
> format=flowed On 11/16/12 6:02 PM, Jon Wilson wrote:
> 
> Hi all,
> I am trying to find the best way to make histograms from large data
> sets.  Up to now, I've been just loading entire columns into in-memory
> numpy arrays and making histograms from those.  However, I'm currently
> working on a handful of datasets where this is prohibitively memory
> intensive (causing an out-of-memory kernel panic on a shared machine
> that you have to open a ticket to have rebooted makes you a little
> gun-shy), so I am now exploring other options.
> 
> I know that the Column object is rather nicely set up to act, in some
> circumstances, like a numpy ndarray.  So my first thought is to try just
> creating the histogram out of the Column object directly. This is,
> however, 1000x slower than loading it into memory and creating the
> histogram from the in-memory array.  Please see my test notebook at:
> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
> 
> For such a small table, loading into memory is not an issue.  For larger
> tables, though, it is a problem, and I had hoped that pytables was
> optimized so that histogramming directly from disk would proceed no
> slower than loading into memory and histogramming. Is there some other
> way of accessing the column (or Array or CArray) data that will make
> faster histograms?
> 
> Indeed a 1000x slowness is quite a lot, but it is important to stress
> that you are doing an disk operation whenever you are accessing a data
> element, and that takes time.  Perhaps using Array or CArray would make
> times a bit better, but frankly, I don't think this is going to buy you
> too much speed.
> 
> The problem here is that you have too many layers, and this makes access
> slower.  You may have better luck with
> carray
> (https://github.com/FrancescAlted/carray), that supports this sort of
> operations, but using a much simpler persistence machinery.  At any
> rate, the results are far better than PyTables:
> 
> In [6]: import numpy as np
> 
> In [7]: import carray as ca
> 
> In [8]: N = 1e7
> 
> In [9]: a = np.random.rand(N)
> 
> In [10]: %time h = np.histogram(a)
> CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s
> Wall time: 0.55 s
> 
> In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray')
> 
> In [12]: %time h = np.histogram(ad)
> CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s
> Wall time: 5.81 s
> 
> So, the overhead for using a disk-based array is just 10x (not 1000x as
> in PyTables).  I don't know if a 10x slowdown is acceptable to you, but
> in case you need more speed, you could probably implement the histogram
> as a method of the carray class in
> Cython:
> 
> https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651
> 
> It should not be too difficult

Re: [Pytables-users] Histogramming 1000x too slow

2012-11-18 Thread Jon Wilson
Hi Stephen,
This sounds fantastic, and exactly what i'm looking for. I'll take a closer 
look tomorrow.
Jon

Stephen Simmons  wrote:

>Back in 2006/07 I wrote an optimized histogram function for pytables + 
>numpy. The main steps were: - Read in chunksize-sections of the
>pytables 
>array so the HDF5 library just needs to decompress full blocks of data 
>from disk into memory; eliminates subsequent copying/merging of partial
>
>data blocks - Modify numpy's bincount function to be more suitable for 
>high-speed histograms by avoiding data type conversions, eliminate 
>initial pass to determine bounds, etc. - Also I modified the numpy 
>histogram function to update existing histogram counts. This meant huge
>
>pytables datasets could be histogrammed by reading in successive
>chunks. 
>- I also wrote numpy function in C to do weighted averages and simple 
>joins. Net result of optimising both the pytables data storage and the 
>numpy histogramming was probably a 50x increase in speed. Certainly I 
>was getting >1m rows/sec for weighted average histograms, using a 2005 
>Dell laptop. I had plans to submit it as a patch to numpy, but work 
>priorities at the time took me in another direction. One email about it
>
>with some C code is here: 
>http://mail.scipy.org/pipermail/numpy-discussion/2007-March/026472.html
>
>I can send a proper Python source package for it if anyone is 
>interested. Regards Stephen ---------- Message: 3 
>Date: Sat, 17 Nov 2012 23:54:39 +0100 From: Francesc Alted 
> Subject: Re: [Pytables-users] Histogramming 1000x
>too 
>slow To: Discussion list for PyTables 
> Message-ID: 
><50a815af.20...@gmail.com> Content-Type: text/plain;
>charset=ISO-8859-1; 
>format=flowed On 11/16/12 6:02 PM, Jon Wilson wrote:
>
>> Hi all,
>> I am trying to find the best way to make histograms from large data
>> sets.  Up to now, I've been just loading entire columns into
>in-memory
>> numpy arrays and making histograms from those.  However, I'm
>currently
>> working on a handful of datasets where this is prohibitively memory
>> intensive (causing an out-of-memory kernel panic on a shared machine
>> that you have to open a ticket to have rebooted makes you a little
>> gun-shy), so I am now exploring other options.
>>
>> I know that the Column object is rather nicely set up to act, in some
>> circumstances, like a numpy ndarray.  So my first thought is to try
>just
>> creating the histogram out of the Column object directly. This is,
>> however, 1000x slower than loading it into memory and creating the
>> histogram from the in-memory array.  Please see my test notebook at:
>> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
>>
>> For such a small table, loading into memory is not an issue.  For
>larger
>> tables, though, it is a problem, and I had hoped that pytables was
>> optimized so that histogramming directly from disk would proceed no
>> slower than loading into memory and histogramming. Is there some
>other
>> way of accessing the column (or Array or CArray) data that will make
>> faster histograms?
>
>Indeed a 1000x slowness is quite a lot, but it is important to stress
>that you are doing an disk operation whenever you are accessing a data
>element, and that takes time.  Perhaps using Array or CArray would make
>times a bit better, but frankly, I don't think this is going to buy you
>too much speed.
>
>The problem here is that you have too many layers, and this makes
>access
>slower.  You may have better luck with carray
>(https://github.com/FrancescAlted/carray), that supports this sort of
>operations, but using a much simpler persistence machinery.  At any
>rate, the results are far better than PyTables:
>
>In [6]: import numpy as np
>
>In [7]: import carray as ca
>
>In [8]: N = 1e7
>
>In [9]: a = np.random.rand(N)
>
>In [10]: %time h = np.histogram(a)
>CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s
>Wall time: 0.55 s
>
>In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray')
>
>In [12]: %time h = np.histogram(ad)
>CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s
>Wall time: 5.81 s
>
>So, the overhead for using a disk-based array is just 10x (not 1000x as
>in PyTables).  I don't know if a 10x slowdown is acceptable to you, but
>in case you need more speed, you could probably implement the histogram
>as a method of the carray class in Cython:
>
>https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651
>
>It should not be too difficult to come up with an optimal
>implementation
>using a chunk-based approach.
>
>

Re: [Pytables-users] Histogramming 1000x too slow

2012-11-18 Thread Stephen Simmons
Back in 2006/07 I wrote an optimized histogram function for pytables + 
numpy. The main steps were: - Read in chunksize-sections of the pytables 
array so the HDF5 library just needs to decompress full blocks of data 
from disk into memory; eliminates subsequent copying/merging of partial 
data blocks - Modify numpy's bincount function to be more suitable for 
high-speed histograms by avoiding data type conversions, eliminate 
initial pass to determine bounds, etc. - Also I modified the numpy 
histogram function to update existing histogram counts. This meant huge 
pytables datasets could be histogrammed by reading in successive chunks. 
- I also wrote numpy function in C to do weighted averages and simple 
joins. Net result of optimising both the pytables data storage and the 
numpy histogramming was probably a 50x increase in speed. Certainly I 
was getting >1m rows/sec for weighted average histograms, using a 2005 
Dell laptop. I had plans to submit it as a patch to numpy, but work 
priorities at the time took me in another direction. One email about it 
with some C code is here: 
http://mail.scipy.org/pipermail/numpy-discussion/2007-March/026472.html 
I can send a proper Python source package for it if anyone is 
interested. Regards Stephen -- Message: 3 
Date: Sat, 17 Nov 2012 23:54:39 +0100 From: Francesc Alted 
 Subject: Re: [Pytables-users] Histogramming 1000x too 
slow To: Discussion list for PyTables 
 Message-ID: 
<50a815af.20...@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; 
format=flowed On 11/16/12 6:02 PM, Jon Wilson wrote:

> Hi all,
> I am trying to find the best way to make histograms from large data
> sets.  Up to now, I've been just loading entire columns into in-memory
> numpy arrays and making histograms from those.  However, I'm currently
> working on a handful of datasets where this is prohibitively memory
> intensive (causing an out-of-memory kernel panic on a shared machine
> that you have to open a ticket to have rebooted makes you a little
> gun-shy), so I am now exploring other options.
>
> I know that the Column object is rather nicely set up to act, in some
> circumstances, like a numpy ndarray.  So my first thought is to try just
> creating the histogram out of the Column object directly. This is,
> however, 1000x slower than loading it into memory and creating the
> histogram from the in-memory array.  Please see my test notebook at:
> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
>
> For such a small table, loading into memory is not an issue.  For larger
> tables, though, it is a problem, and I had hoped that pytables was
> optimized so that histogramming directly from disk would proceed no
> slower than loading into memory and histogramming. Is there some other
> way of accessing the column (or Array or CArray) data that will make
> faster histograms?

Indeed a 1000x slowness is quite a lot, but it is important to stress
that you are doing an disk operation whenever you are accessing a data
element, and that takes time.  Perhaps using Array or CArray would make
times a bit better, but frankly, I don't think this is going to buy you
too much speed.

The problem here is that you have too many layers, and this makes access
slower.  You may have better luck with carray
(https://github.com/FrancescAlted/carray), that supports this sort of
operations, but using a much simpler persistence machinery.  At any
rate, the results are far better than PyTables:

In [6]: import numpy as np

In [7]: import carray as ca

In [8]: N = 1e7

In [9]: a = np.random.rand(N)

In [10]: %time h = np.histogram(a)
CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s
Wall time: 0.55 s

In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray')

In [12]: %time h = np.histogram(ad)
CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s
Wall time: 5.81 s

So, the overhead for using a disk-based array is just 10x (not 1000x as
in PyTables).  I don't know if a 10x slowdown is acceptable to you, but
in case you need more speed, you could probably implement the histogram
as a method of the carray class in Cython:

https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651

It should not be too difficult to come up with an optimal implementation
using a chunk-based approach.

-- Francesc Alted --


--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Histogramming 1000x too slow

2012-11-17 Thread Francesc Alted
On 11/16/12 6:02 PM, Jon Wilson wrote:
> Hi all,
> I am trying to find the best way to make histograms from large data
> sets.  Up to now, I've been just loading entire columns into in-memory
> numpy arrays and making histograms from those.  However, I'm currently
> working on a handful of datasets where this is prohibitively memory
> intensive (causing an out-of-memory kernel panic on a shared machine
> that you have to open a ticket to have rebooted makes you a little
> gun-shy), so I am now exploring other options.
>
> I know that the Column object is rather nicely set up to act, in some
> circumstances, like a numpy ndarray.  So my first thought is to try just
> creating the histogram out of the Column object directly. This is,
> however, 1000x slower than loading it into memory and creating the
> histogram from the in-memory array.  Please see my test notebook at:
> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
>
> For such a small table, loading into memory is not an issue.  For larger
> tables, though, it is a problem, and I had hoped that pytables was
> optimized so that histogramming directly from disk would proceed no
> slower than loading into memory and histogramming. Is there some other
> way of accessing the column (or Array or CArray) data that will make
> faster histograms?

Indeed a 1000x slowness is quite a lot, but it is important to stress 
that you are doing an disk operation whenever you are accessing a data 
element, and that takes time.  Perhaps using Array or CArray would make 
times a bit better, but frankly, I don't think this is going to buy you 
too much speed.

The problem here is that you have too many layers, and this makes access 
slower.  You may have better luck with carray 
(https://github.com/FrancescAlted/carray), that supports this sort of 
operations, but using a much simpler persistence machinery.  At any 
rate, the results are far better than PyTables:

In [6]: import numpy as np

In [7]: import carray as ca

In [8]: N = 1e7

In [9]: a = np.random.rand(N)

In [10]: %time h = np.histogram(a)
CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s
Wall time: 0.55 s

In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray')

In [12]: %time h = np.histogram(ad)
CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s
Wall time: 5.81 s

So, the overhead for using a disk-based array is just 10x (not 1000x as 
in PyTables).  I don't know if a 10x slowdown is acceptable to you, but 
in case you need more speed, you could probably implement the histogram 
as a method of the carray class in Cython:

https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651

It should not be too difficult to come up with an optimal implementation 
using a chunk-based approach.

-- 
Francesc Alted


--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Histogramming 1000x too slow

2012-11-17 Thread David Wilson
I've been using (and recommend) Pandas http://pandas.pydata.org/ along with
this book:
http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CDIQFjAA&url=http%3A%2F%2Fshop.oreilly.com%2Fproduct%2F0636920023784.do&ei=GfSnUJSbGqm5ywH7poCwDA&usg=AFQjCNEJuio5DbubgyNQR4Tp9iM1RClZHA


Good luck,
Dave

On Fri, Nov 16, 2012 at 11:02 AM, Jon Wilson  wrote:

> Hi all,
> I am trying to find the best way to make histograms from large data
> sets.  Up to now, I've been just loading entire columns into in-memory
> numpy arrays and making histograms from those.  However, I'm currently
> working on a handful of datasets where this is prohibitively memory
> intensive (causing an out-of-memory kernel panic on a shared machine
> that you have to open a ticket to have rebooted makes you a little
> gun-shy), so I am now exploring other options.
>
> I know that the Column object is rather nicely set up to act, in some
> circumstances, like a numpy ndarray.  So my first thought is to try just
> creating the histogram out of the Column object directly. This is,
> however, 1000x slower than loading it into memory and creating the
> histogram from the in-memory array.  Please see my test notebook at:
> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
>
> For such a small table, loading into memory is not an issue.  For larger
> tables, though, it is a problem, and I had hoped that pytables was
> optimized so that histogramming directly from disk would proceed no
> slower than loading into memory and histogramming. Is there some other
> way of accessing the column (or Array or CArray) data that will make
> faster histograms?
> Regards,
> Jon
>
>
> --
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
> http://p.sf.net/sfu/zoho_dev2dev_nov
> ___
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>



-- 
David C. Wilson
(612) 460-1329
david.craig.wil...@gmail.com
http://www.linkedin.com/in/davidcwilson
--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Histogramming 1000x too slow

2012-11-17 Thread Anthony Scopatz
On Fri, Nov 16, 2012 at 7:33 PM, Jon Wilson  wrote:

> Hi Anthony,
> I don't think that either of these help me here (unless I've misunderstood
> something). I need to fill the histogram with every row in the table, so
> querying doesn't gain me anything. (especially since the query just returns
> an iterator over rows) I also don't need (at the moment) to compute any
> function of the column data, just count (weighted) entries into various
> bins. I suppose I could write one Expr for each bin of my histogram, but
> that seems dreadfully inefficient and probably difficult to maintain.
>

Hi Jon,

Barring changes to numexpr itself, this is exactly what I am suggesting.
 Well,, either writing one query expr per bin or (more cleverly) writing
one expr which when evaluated for a row returns the integer bin number (1,
2, 3,...) this row falls in.  Then you can simply count() for each bin
number.

For example, if you wanted to histogram data which ran from [0,100] into 10
bins, then the expr "r/10" into a dtype=int would do the trick.  This has
the advantage of only running over the data once.  (Also, I am not
convinced that running over the data multiple times is less efficient than
doing row-based iteration.  You would have to test it on your data to find
out.)


> It is a reduction operation, and would greatly benefit from chunking, I
> expect. Not unlike sum(), which is implemented as a specially supported
> reduction operation inside numexpr (buggily, last I checked). I suspect
> that a substantial improvement in histogramming requires direct support
> from either pytables or from numexpr. I don't suppose that there might be a
> chunked-reduction interface exposed somewhere that I could hook into?
>

This is definitively as feature to request from numexpr.

Be Well
Anthony


>  Jon
>
> Anthony Scopatz  wrote:
>>
>>  On Fri, Nov 16, 2012 at 9:02 AM, Jon Wilson  wrote:
>>
>>> Hi all,
>>> I am trying to find the best way to make histograms from large data
>>> sets.  Up to now, I've been just loading entire columns into in-memory
>>> numpy arrays and making histograms from those.  However, I'm currently
>>> working on a handful of datasets where this is prohibitively memory
>>> intensive (causing an out-of-memory kernel panic on a shared machine
>>> that you have to open a ticket to have rebooted makes you a little
>>> gun-shy), so I am now exploring other options.
>>>
>>> I know that the Column object is rather nicely set up to act, in some
>>> circumstances, like a numpy ndarray.  So my first thought is to try just
>>> creating the histogram out of the Column object directly. This is,
>>> however, 1000x slower than loading it into memory and creating the
>>> histogram from the in-memory array.  Please see my test notebook at:
>>> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
>>>
>>> For such a small table, loading into memory is not an issue.  For larger
>>> tables, though, it is a problem, and I had hoped that pytables was
>>> optimized so that histogramming directly from disk would proceed no
>>> slower than loading into memory and histogramming. Is there some other
>>> way of accessing the column (or Array or CArray) data that will make
>>> faster histograms?
>>>
>>
>> Hi Jon,
>>
>> This is not surprising since the column object itself is going to be
>> iterated
>> over per row.  As you found, reading in each row individually will be
>> prohibitively expensive as compared to reading in all the data at one.
>>
>> To do this in the right way for data that is larger than system memory,
>> you
>> need to read it in in chunks.  Luckily there are tools to help you
>> automate
>> this process already in PyTables.  I would recommend that you use
>> expressions [1] or queries [2] to do your historgramming more efficiently.
>>
>> Be Well
>> Anthony
>>
>> 1. http://pytables.github.com/usersguide/libref/expr_class.html
>> 2.
>> http://pytables.github.com/usersguide/libref/structured_storage.html?#table-methods-querying
>>
>>
>>
>>> Regards,
>>> Jon
>>>
>>>
>>> --
>>> Monitor your physical, virtual and cloud infrastructure from a single
>>> web console. Get in-depth insight into apps, servers, databases, vmware,
>>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>>> Pricing starts from $795 for 25 servers or applications!
>>> http://p.sf.net/sfu/zoho_dev2dev_nov
>>> ___
>>> Pytables-users mailing list
>>> Pytables-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>>
>>
>> --
>>
>> Monitor your physical, virtual and cloud infrastructure from a single
>>
>>
>> web console. Get in-depth insight into apps, servers, databases, vmware,
>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>> Pricing starts from $795 for 25 servers or applications!
>> http://p.sf.net/sfu/zoho_dev2dev_nov
>>
>> ---

Re: [Pytables-users] Histogramming 1000x too slow

2012-11-16 Thread Jon Wilson
Hi Anthony,
I don't think that either of these help me here (unless I've misunderstood 
something). I need to fill the histogram with every row in the table, so 
querying doesn't gain me anything. (especially since the query just returns an 
iterator over rows)  I also don't need (at the moment) to compute any function 
of the column data, just count (weighted) entries into various bins. I suppose 
I could write one Expr for each bin of my histogram, but that seems dreadfully 
inefficient and probably difficult to maintain.

It is a reduction operation, and would greatly benefit from chunking, I expect. 
Not unlike sum(), which is implemented as a specially supported reduction 
operation inside numexpr (buggily, last I checked). I suspect that a 
substantial improvement in histogramming requires direct support from either 
pytables or from numexpr.  I don't suppose that there might be a 
chunked-reduction interface exposed somewhere that I could hook into?
Jon

Anthony Scopatz  wrote:

>On Fri, Nov 16, 2012 at 9:02 AM, Jon Wilson  wrote:
>
>> Hi all,
>> I am trying to find the best way to make histograms from large data
>> sets.  Up to now, I've been just loading entire columns into
>in-memory
>> numpy arrays and making histograms from those.  However, I'm
>currently
>> working on a handful of datasets where this is prohibitively memory
>> intensive (causing an out-of-memory kernel panic on a shared machine
>> that you have to open a ticket to have rebooted makes you a little
>> gun-shy), so I am now exploring other options.
>>
>> I know that the Column object is rather nicely set up to act, in some
>> circumstances, like a numpy ndarray.  So my first thought is to try
>just
>> creating the histogram out of the Column object directly. This is,
>> however, 1000x slower than loading it into memory and creating the
>> histogram from the in-memory array.  Please see my test notebook at:
>> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
>>
>> For such a small table, loading into memory is not an issue.  For
>larger
>> tables, though, it is a problem, and I had hoped that pytables was
>> optimized so that histogramming directly from disk would proceed no
>> slower than loading into memory and histogramming. Is there some
>other
>> way of accessing the column (or Array or CArray) data that will make
>> faster histograms?
>>
>
>Hi Jon,
>
>This is not surprising since the column object itself is going to be
>iterated
>over per row.  As you found, reading in each row individually will be
>prohibitively expensive as compared to reading in all the data at one.
>
>To do this in the right way for data that is larger than system memory,
>you
>need to read it in in chunks.  Luckily there are tools to help you
>automate
>this process already in PyTables.  I would recommend that you use
>expressions [1] or queries [2] to do your historgramming more
>efficiently.
>
>Be Well
>Anthony
>
>1. http://pytables.github.com/usersguide/libref/expr_class.html
>2.
>http://pytables.github.com/usersguide/libref/structured_storage.html?#table-methods-querying
>
>
>
>> Regards,
>> Jon
>>
>>
>>
>--
>> Monitor your physical, virtual and cloud infrastructure from a single
>> web console. Get in-depth insight into apps, servers, databases,
>vmware,
>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>> Pricing starts from $795 for 25 servers or applications!
>> http://p.sf.net/sfu/zoho_dev2dev_nov
>> ___
>> Pytables-users mailing list
>> Pytables-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>
>
>
>
>--
>Monitor your physical, virtual and cloud infrastructure from a single
>web console. Get in-depth insight into apps, servers, databases,
>vmware,
>SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>Pricing starts from $795 for 25 servers or applications!
>http://p.sf.net/sfu/zoho_dev2dev_nov
>
>
>
>___
>Pytables-users mailing list
>Pytables-users@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/pytables-users

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists

Re: [Pytables-users] Histogramming 1000x too slow

2012-11-16 Thread Anthony Scopatz
On Fri, Nov 16, 2012 at 9:02 AM, Jon Wilson  wrote:

> Hi all,
> I am trying to find the best way to make histograms from large data
> sets.  Up to now, I've been just loading entire columns into in-memory
> numpy arrays and making histograms from those.  However, I'm currently
> working on a handful of datasets where this is prohibitively memory
> intensive (causing an out-of-memory kernel panic on a shared machine
> that you have to open a ticket to have rebooted makes you a little
> gun-shy), so I am now exploring other options.
>
> I know that the Column object is rather nicely set up to act, in some
> circumstances, like a numpy ndarray.  So my first thought is to try just
> creating the histogram out of the Column object directly. This is,
> however, 1000x slower than loading it into memory and creating the
> histogram from the in-memory array.  Please see my test notebook at:
> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
>
> For such a small table, loading into memory is not an issue.  For larger
> tables, though, it is a problem, and I had hoped that pytables was
> optimized so that histogramming directly from disk would proceed no
> slower than loading into memory and histogramming. Is there some other
> way of accessing the column (or Array or CArray) data that will make
> faster histograms?
>

Hi Jon,

This is not surprising since the column object itself is going to be
iterated
over per row.  As you found, reading in each row individually will be
prohibitively expensive as compared to reading in all the data at one.

To do this in the right way for data that is larger than system memory, you
need to read it in in chunks.  Luckily there are tools to help you automate
this process already in PyTables.  I would recommend that you use
expressions [1] or queries [2] to do your historgramming more efficiently.

Be Well
Anthony

1. http://pytables.github.com/usersguide/libref/expr_class.html
2.
http://pytables.github.com/usersguide/libref/structured_storage.html?#table-methods-querying



> Regards,
> Jon
>
>
> --
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
> http://p.sf.net/sfu/zoho_dev2dev_nov
> ___
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
--
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users