Unfortunately I cannot do that since it is company data. I wrote a simple
script that queries a file twice:
import sys
import time
import tables
f = tables.openFile(sys.argv[1])
start = time.time()
data = f.root.table.readWhere('field1==2912')
print 'time: %.1f'%(time.time() - start)
print 'nr items: %i'%(len(data))
start = time.time()
data = f.root.table.readWhere(' field1==2912')
print 'time: %.1f'%(time.time() - start)
print 'nr items: %i'%(len(data))
Now I created the same file with lzo, blosc, and zlib compression, each with 2
chunkshapes (large means chunkshape = (3971,), while small means chunkshape =
(248,))
I ran the script for each file twice (to detect any operating system file
buffering). Results:
C:\Devel\work>test_query c:\Devel\data\bigdata_lzo_small.hdf5
time: 31.8
nr items: 20678
time: 5.9
nr items: 20678
C:\Devel\work>test_query c:\Devel\data\bigdata_lzo_small.hdf5
time: 5.8
nr items: 20678
time: 5.9
nr items: 20678
C:\Devel\work>test_query c:\Devel\data\bigdata_lzo_large.hdf5
time: 25.2
nr items: 20678
time: 16.2
nr items: 20678
C:\Devel\work>test_query c:\Devel\data\bigdata_lzo_large.hdf5
time: 16.0
nr items: 20678
time: 16.5
nr items: 20678
C:\Devel\work>test_query c:\Devel\data\bigdata_blosc_small.hdf5
time: 46.2
nr items: 20678
time: 4.2
nr items: 20678
C:\Devel\work>test_query c:\Devel\data\bigdata_blosc_small.hdf5
time: 4.4
nr items: 20678
time: 4.3
nr items: 20678
C:\Devel\work>test_query c:\Devel\data\bigdata_blosc_large.hdf5
time: 47.9
nr items: 20678
time: 5.3
nr items: 20678
C:\Devel\work>test_query c:\Devel\data\bigdata_blosc_large.hdf5
time: 5.0
nr items: 20678
time: 5.7
nr items: 20678
C:\Devel\work>test_query c:\Devel\data\bigdata_zlib_small.hdf5
time: 11.7
nr items: 20678
time: 10.3
nr items: 20678
C:\Devel\work>test_query c:\Devel\data\bigdata_zlib_small.hdf5
time: 10.3
nr items: 20678
time: 9.9
nr items: 20678
C:\Devel\work>test_query c:\Devel\data\bigdata_zlib_large.hdf5
time: 24.5
nr items: 20678
time: 24.4
nr items: 20678
C:\Devel\work>test_query c:\Devel\data\bigdata_zlib_large.hdf5
time: 19.7
nr items: 20678
time: 24.7
nr items: 20678
So small chunkshape is generally better, and blosc is the slowest on the very
first query, and the fastest after that. This must be an operating system
issue? The blosc files are much larger, perhaps that plays a role? The files
for lzo and zlib are 180 - 240Mb, while the files for blosc are 1.1Gb
Koert
-----Original Message-----
From: Francesc Alted [mailto:[email protected]]
Sent: September 27 2010 13:05
To: [email protected]
Subject: Re: [Pytables-users] trying blosc instead of zlib for compression,but
very slow
A Monday 27 September 2010 17:50:55 Koert Kuipers escrigué:
> Hi all,
>
> I have a table that looks like this:
>
> /table (Table(17801755,), shuffle, zlib(1)) ''
> description := {
> "field1": UInt32Col(shape=(), dflt=0, pos=0),
> "field2": UInt32Col(shape=(), dflt=0, pos=1),
> "field3": Float64Col(shape=(), dflt=0.0, pos=2),
> "field4": Float32Col(shape=(), dflt=0.0, pos=3),
> "field5": Float32Col(shape=(), dflt=0.0, pos=4),
> "field6": Float32Col(shape=(), dflt=0.0, pos=5),
> "field7": Float32Col(shape=(), dflt=0.0, pos=6),
> "field8": Float32Col(shape=(), dflt=0.0, pos=7),
> "field9": Float32Col(shape=(), dflt=0.0, pos=8),
> "field10": Float32Col(shape=(), dflt=0.0, pos=9),
> "field11": UInt16Col(shape=(), dflt=0, pos=10),
> "field12": UInt16Col(shape=(), dflt=0, pos=11),
> "field13": UInt16Col(shape=(), dflt=0, pos=12),
> "field14": UInt16Col(shape=(), dflt=0, pos=13),
> "field15": Float64Col(shape=(), dflt=0.0, pos=14),
> "field16": Float32Col(shape=(), dflt=0.0, pos=15),
> "field17": UInt16Col(shape=(), dflt=0, pos=16)}
> byteorder := 'little'
> chunkshape := (248,)
>
> when I run a query on it this is the result:
> >>> start=time.time(); data=f.root.table.readWhere('field1==2912');
> >>> print time.time()-start
>
> 11.0780000687
>
> >>> len(data)
>
> 20678
>
> I wanted to speed up this sort of querying, so created a new table
> with blosc compression and copied the data. My old table has
> expectedrows = 1000000, but since reality turned out to be a lot
> more data I also updated expectedrows to 10000000
>
> /table1 (Table(17801755,), shuffle, blosc(1)) ''
> description := {
> "field1": UInt32Col(shape=(), dflt=0, pos=0),
> "field2": UInt32Col(shape=(), dflt=0, pos=1),
> "field3": Float64Col(shape=(), dflt=0.0, pos=2),
> "field4": Float32Col(shape=(), dflt=0.0, pos=3),
> "field5": Float32Col(shape=(), dflt=0.0, pos=4),
> "field6": Float32Col(shape=(), dflt=0.0, pos=5),
> "field7": Float32Col(shape=(), dflt=0.0, pos=6),
> "field8": Float32Col(shape=(), dflt=0.0, pos=7),
> "field9": Float32Col(shape=(), dflt=0.0, pos=8),
> "field10": Float32Col(shape=(), dflt=0.0, pos=9),
> "field11": UInt16Col(shape=(), dflt=0, pos=10),
> "field12": UInt16Col(shape=(), dflt=0, pos=11),
> "field13": UInt16Col(shape=(), dflt=0, pos=12),
> "field14": UInt16Col(shape=(), dflt=0, pos=13),
> "field15": Float64Col(shape=(), dflt=0.0, pos=14),
> "field16": Float32Col(shape=(), dflt=0.0, pos=15),
> "field17": UInt16Col(shape=(), dflt=0, pos=16)}
> byteorder := 'little'
> chunkshape := (3971,)
>
> >>> start=time.time(); data=f.root.table1.readWhere('field1==2912');
> >>> print time.time()-start
>
> 115.51699996
>
> >>> len(data)
>
> 20678
>
> Not exactly what I expected! I am obviously doing something wrong.
> Any suggestions? Thanks, Koert
Certainly surprising. Can you put your datafile on a public place so
that I can experiment with it?
Thanks,
--
Francesc Alted
------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users
------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users