Hi Elias,
A Tuesday 04 September 2007, [EMAIL PROTECTED] escrigué:
> Hello again,
>
> This topic is of great interest to me as I have been attempting to
> tune the chunkshape parameter manually.
>
> After our last exchange, I took your suggestions and made all my
> index searches in-memory to get max speed. What I found was initially
> very surprising, but on reflection started to make sense: I actually
> had a greater bottleneck due to how I organized my data vs. how it
> was being used. To whit, I had a multidimensional array with a shape
> like this:
>
> {1020, 4, 15678, 3}
>
> but I was reading it -- with PyTables -- like so:
> >>> data = earrayObject[:,:,offset,:]
>
> With small arrays like {20, 4, 15678, 3} it is not so noticeable, but
> with the combination of large arrays and the default chunkshape, a
> lot of time was being spent slicing the array.
Mmmm, what do you mean by your 'default' chunkshape? Your application
chunkshape or a PyTables automatic chunkshape?. You are not saying
which 'default' chunkshape are you using, but, in your example above,
and for your kind of access pattern, a pretty optimal chunkshape would
be {20, 4, 1, 3}, because you only need to read one element of the
third dimension on each access, avoiding further unnecessary
reads/decompressions. However, having a chunksize in the third
dimension moderately larger than 1 could represent a good I/O balance.
See below.
> The switch to PyTables (from h5import) I was able to easily
> reorganize the data to be more efficient for how I was reading it,
> ie,
>
> >>> earrayObject.shape
>
> (15678L, 4L, 1020L, 3L)
>
> >>> data = earrayObject[offset,:,:,:]
In PyTables 2.0 you could also set the third dimension as the main one,
and the chunkshapes will be computed optimally (I mean, for sparse
access along the main dim and reasonably fast appends).
> It seems to me then, that chunkshape could be selected to also give
> optimal, or near-optimal performance. My problem now is that as I
> make the chunks smaller, I get better read performance (which is the
> goal), but write performance (not done very often) has slowed way
> down. I suppose this makes sense, as smaller chunks implies more
> trips to the disk for I/O writing the entire array.
That's correct.
> So are there any guidelines to balance reading vs writing performance
> with chunkshape? Right now I'm just trying 'sensible' chunkshapes and
> seeing what the result is. Currently, I'm leaning toward something
> like (32, 4, 256, 3). The truth is, only one row is ever read at a
> time, but the write time for (1, 4, 512, 3) is just too long. Is
> there an obvious flaw in my approach that I cannot see?
Not so obvious, because an optimal chunkshape depends largely on your
access pattern and whether you want to optimize reads, writes or get a
fair balance between them. So, your mileage may vary.
As a tip, it is always good to write a small benchmark and see the best
parameters for your case (I know that this takes time, and if you were
to write this in plain C, perhaps you would think twice about doing
this, but hey, you are using Python! ;). As an example, I've made such
a benchmark that times read/write operations on a scenario similar to
yours (see attached script).
This benchmark selects a chunksize of 1 (labeled as 'e1'), 5 ('e5') and
10 ('e10') for the main dimension and measure the times for doing a
sequential write and a random sparse reads (along the main dimension).
Here are the results when using zlib (and shuffle) compressor:
************** Writes ************
e1. Time took for writing: 7.567
e5. Time took for writing: 2.361
e10. Time took for writing: 1.873
************** Reads *************
e1. Time took for 1000 reads: 0.588
e5. Time took for 1000 reads: 0.669
e10. Time took for 1000 reads: 0.755
So, using a chunksize of 1 in the maindim is optimal for random reads
(as expected), but it takes a lot of time for writes. A size of 10
offers best writing times and poor read times. In this case, 5 seems
to represent a reasonable good balance for write/read.
If you want better speed but still keep using compression, the LZO
compressor does perform very well in this scenario. Here are the times
for LZO (and shuffle):
************** Writes ************
e1. Time took for writing: 4.847
e5. Time took for writing: 1.602
e10. Time took for writing: 1.281
************** Reads *************
e1. Time took for 1000 reads: 0.532
e5. Time took for 1000 reads: 0.568
e10. Time took for 1000 reads: 0.611
which represents up to a 50% of speed-up for writes and up to 18% faster
on sparse reads.
Finally, removing compression completely might seem the best bet for
optimize reads, but this can get tricky (and it gets tricky actually).
The times when disabling compression are:
************** Writes ************
e1. Time took for writing: 4.337
e5. Time took for writing: 1.428
e10. Time took for writing: 1.076
************** Reads *************
e1. Time took for 1000 reads: 0.751
e5. Time took for 1000 reads: 2.979
e10. Time took for 1000 reads: 0.605
i.e. for writes there is a neat win, but reads perform generally slower
(specially for the chunksize 5 which is extremely slow, but I don't
know exactly why).
> Also, should I avoid ptrepack, or is there a switch that will
> preserve my carefully chosen chunkshapes? I have the same situation
> as Gabriel in that I don't know what the final number of rows my
> EArray will have (it's the now the third dimension that is the
> extensible axis) and I just take the default, expectedrows=1000.
Well, if you want to preserve your carefully tuned chunkshape, then you
shouldn't use ptrepack, as it is meant to re-calculate chunkshape in
order to adapt to general uses, that could not coincide with your
specific needs (as it is generally the case when you want to find
extremely fine-tuned chunkshape parameters).
Mmm, I'm thinking that perhaps adding a 'chunkshape' argument to
Leaf.copy() would be a good thing for those users who want to
explicitely set their own chunkshape on the destination leaf. I'll add
a ticket so that we don't forget about this.
Hope that helps,
--
>0,0< Francesc Altet http://www.carabos.com/
V V Cárabos Coop. V. Enjoy Data
"-"
prova.py
Description: application/python
------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________ Pytables-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/pytables-users
