I've been finding pytables useful for organizing big genomics data (e.g.
storing and querying ~200 Gb all vs all uniparc smith-waterman hits from
Uniprot).
One thing that has surprised me a little:
I was interested in the efficiency of querying small tables storing an index
(integer) and an integer value vs storing the values in an array.
I am finding the second option about 10X faster when selecting the index of
a particular integer value. I would have guessed the 'kernel' selection
would have been faster than reading out the entire array and then using
numpy.where().
Is this expected, or can I do something to make the table selection faster?
In this case I am fine with the second option, so this is just for future
reference.
eg.
Option 1:
class testTable(IsDescription):
index = UInt8Col(pos=0)
id = UInt32Col(pos=1)
h5_file.createTable(group,'test1',testTable,expectedrows=5000)
def fxn1(group,id):
"""
Retrieve rows from pytables table.
"""
return [x['index'] for x in group.test1.where("id == %s" % id)]
#########################
Option 2:
z = numpy.array([id1, id2, ...])
h5_file.createArray(group,'test2',z)
def fxn2(group,id):
"""
Retrieve rows from pytables array.
About 10x faster than selecting from table!
"""
return where(group.test2.listarr == id)[0]
Thanks,
Rich
------------------------------------------------------------------------------
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users