
I am writing a lot of data(close to 122GB ) to a hdf5 file using PyTables.
The execution time for writing the query result to the file is close to 10
hours, which includes querying the database and then writing to the file.
When I timed the entire execution, I found that it takes as much time to
get the data from the database as it takes to write to the hdf5 file. Here
is the small snippet(P.S: the execution time noted below is not for 122GB
data, but a small subset close to 10GB):

class ContactClass(table.IsDescription):
    name= tb.StringCol(4200)
    address= tb.StringCol(4200)
    emailAddr= tb.StringCol(180)
    phone= tb.StringCol(256)

h5File= table.openFile(<file name>, mode="a", title= "Contacts")
t= h5File.createTable(h5File.root, 'ContactClass', ContactClass,
filters=table.Filters(5, 'blosc'), expectedrows=77806938)

resultSet= get data from database
currRow= t.row
print("Before appending data: %s" % str(datetime.now()))
for (attributes ..) in resultSet:
     currRow['name']= attribute[0]
     currRow['address']= attribute[1]
     currRow['emailAddr']= attribute[2]
     currRow['phone']= attribute[3]
print("After done appending: %s" % str(datetime.now()))
print("After done flushing: %s" % str(datetime.now()))

.. gives me:
*Before appending data  2013-04-11 10:42:39.903713  *
*After done appending: 2013-04-11 11:04:10.002712*
*After done flushing: 2013-04-11 11:05:50.059893*
it seems like append() takes a lot of time. Any suggestions on how to
improve this?

Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
Pytables-users mailing list

Reply via email to