A Saturday 12 December 2009 18:45:33 Ernesto escrigué:
> Dear Francesc,
>
> thank you for your reply. I'll try to better explain my problem using
> real examples of data and code.
[clip]
I've been doing some benchmarks based on your requirements, and my conclusion
is that the implementation of variable length types in HDF5 is not very
efficient, specially with HDF5 1.8.x series (see [1]). So, you should avoid
using VLArrays for saving small arrays: they fit better in table fields.
With this, a possible solution is to distinguish between small and large
strings (for this case). Small strings can be saved in a Table field, while
larger ones will be output into a VLArray. Then you will have to add another
field in the table specifying where the data is (for example -1 could mean "in
this table" and any other positive value "the index in the VLArray"). You may
want to experiment in order to see the optimal threshold that separates
'small' string from 'large' ones, but anything between 128 and 1024 would work
fine.
I'm adding the script that I've been using for my own benchmarking. Notice
that if your optimal break-point (threshold) is too large (say, > 10000
bytes), then this partition is not going to work well, but chances are that
your scenario would fit here easily. If not, one can think on a finer
partition, but let's start by this one.
[1]http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/2009-
December/002298.html
Cheers,
--
Francesc Alted
import tables as t
import numpy as np
LEN_INPUT = int(1e6)
#BREAK_POINT = 1024
BREAK_POINT = 256
map_ = {0:'A', 1:'C', 2:'G', 3:'T'}
def create_input(len):
return "".join([map_[i] for i in np.random.random_integers(0,3,size=len)])
def get_short_string(len):
a = np.random.standard_exponential(len)
b = np.array(a*100, dtype='i4')
for l in b:
yield "".join([map_[i] for i in np.random.random_integers(0,3,size=l)])
def create_file(fname, verbose):
class NucSeq(t.IsDescription):
id = t.Int32Col(pos=1) # integer
where = t.Int32Col(pos=2)
gnuc = t.StringCol(1, pos=3) # 1-character String
sstring = t.StringCol(BREAK_POINT-1, pos=4)
fileh = t.openFile(fname, mode = "w")
root = fileh.root
group = fileh.createGroup(root, "newgroup")
tableNuc = fileh.createTable(group, 'tableNuc', NucSeq, "tableNuc",
t.Filters(1, complib='lzo'),
expectedrows=LEN_INPUT)
nucseq = tableNuc.row
vlarray = fileh.createVLArray(root, 'vlarray', t.VLStringAtom(),
"vlarray test")
gen_sstring = get_short_string(LEN_INPUT)
for x,j in enumerate(create_input(LEN_INPUT)):
sstring = gen_sstring.next()
nucseq['id'] = x
nucseq['gnuc'] = j
if len(sstring) < BREAK_POINT:
nucseq['where'] = -1 # saved locally in this table
nucseq['sstring'] = sstring
else:
if verbose:
print "saving to vlarray!", len(sstring)
nucseq['where'] = vlarray.nrows # row in external VLArray
vlarray.append(sstring)
nucseq.append()
fileh.close()
if __name__=="__main__":
import sys, os
import getopt
usage = """usage: %s [-f] [-v] filename\n""" % sys.argv[0]
try:
opts, pargs = getopt.getopt(sys.argv[1:], 'fv')
except:
sys.stderr.write(usage)
sys.exit(1)
doprofile = False
verbose = False
for option in opts:
if option[0] == '-f':
doprofile = True
elif option[0] == '-v':
verbose = True
fname = pargs[0]
if doprofile:
import pstats
import cProfile as prof
prof.run('create_file(fname, verbose)', 'gataca.prof')
stats = pstats.Stats('gataca.prof')
stats.strip_dirs()
stats.sort_stats('time', 'calls')
if verbose:
stats.print_stats()
else:
stats.print_stats(20)
else:
create_file(fname, verbose)
------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users