Hi Anthony & All, On 30 October 2012 23:31, Anthony Scopatz wrote: > On Tue, Oct 30, 2012 at 6:20 PM, Andrea Gavana <[email protected]> > wrote: >> >> Hi Anthony, >> >> On 30 October 2012 22:52, Anthony Scopatz wrote: >> > Hi Andrea, >> > >> > Your problem is two fold. >> > >> > 1. Your timing wasn't reporting the time per data set, but rather the >> > total >> > time since writing all data sets. You need to put the start time in the >> > loop to get the time per data set. >> > >> > 2. Your larger problem was that you were writing too many times. >> > Generally >> > it is faster to write fewer, bigger sets of data than performing a lot >> > of >> > small write operations. Since you had data set opening and writing in a >> > doubly nested loop, it is not surprising that you were getting terrible >> > performance. You were basically maximizing HDF5 overhead ;). Using >> > slicing I removed the outermost loop and saw timings like the following: >> > >> > H5 file creation time: 7.406 >> > >> > Saving results for table: 0.0105440616608 >> > Saving results for table: 0.0158948898315 >> > Saving results for table: 0.0164661407471 >> > Saving results for table: 0.00654292106628 >> > Saving results for table: 0.00676298141479 >> > Saving results for table: 0.00664114952087 >> > Saving results for table: 0.0066990852356 >> > Saving results for table: 0.00687289237976 >> > Saving results for table: 0.00664210319519 >> > Saving results for table: 0.0157809257507 >> > Saving results for table: 0.0141618251801 >> > Saving results for table: 0.00796294212341 >> > >> > Please see the attached version, at around line 82. Additionally, if >> > you >> > need to focus on performance I would recommend reading the following >> > (http://pytables.github.com/usersguide/optimization.html). PyTables can >> > be >> > blazingly fast when implemented correctly. I would highly recommend >> > looking >> > into compression. >> > >> > I hope this helps! >> >> Thank you for your answer; indeed, I was timing it wrongly (I really >> need to go to sleep...). However, although I understand the need of >> "writing fewer", I am not sure I can actually do it in my situations. >> Let me explain: >> >> 1. I have a GUI which starts a number of parallel processes (up to 16, >> depending on a user selection); >> 2. These processes actually do the computation/simulations - so, if I >> have 1,000 simulations to run and 8 parallel processes, each process >> gets 125 simulations (each of which holds 1,200 "objects" with a 600x7 >> timeseries matrix per object). > > > Well, you can at least change the order of the loops and see if that helps. > That is rather than doing: > > for i in xrange(): > for p in table: > > Do the following instead: > > for p in table: > for i in xrange(): > > I don't believe that this will help too much since you are still writing > every element individually.. > >> >> >> If I had to write out the results only at the end, it would mean for >> me to find a way to share the 1,200 "objects" matrices in all the >> parallel processes (and I am not sure if pytables is going to complain >> when multiple concurrent processes try to access the same underlying >> HDF5 file). > > > Reading in parallel works pretty well. Writing causes more headaches > but can be done. > >> >> Or I could create one HDF file per process, but given the nature of >> the simulation I am running, every "object" in the 1,200 "objects" >> pool would need to keep a reference to a 125x600x7 matrix (assuming >> 1,000 simulations and 8 processes) around in memory *OR* I will need >> to write the results to the HDF5 file for every simulation. Although >> we have extremely powerful PCs at work, I am not sure it is the right >> way to go... >> >> As always, I am open to all suggestions on how to improve my approach. > > > My basic suggestion is to have all of you processes produce results which > are then > aggregated by a single master process. This master is the only one which > has write > access to the hdf5 file and will allow you to create larger arrays and > minimize the > number of writes that you do. > > You'll probably want to take a look at this example: > https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py > > I think that there might be a page in the docs about it now too... > > But I think that this is the strategy that you want to pursue. Multiple > compute processes, one write process.
Thank you for all your suggestions. I managed to slightly modify the
script you attached and I am also experimenting with compression.
However, in the newly attached script the underlying table is not
modified, i.e., this assignment:
for p in table:
p['results'][:NUM_SIM, :, :] = numpy.random.random(size=(NUM_SIM,
len(ALL_DATES), 7))
table.flush()
Seems to be doing nothing (i.e., printing out the 'results' attribute
for an object class prints a matrix full of zeros instead of random
numbers...). Also, on my PC at work, the file creation time is
tremendously slow (76 seconds for a 100 simulations - 1.9 GB file).
In order to understand what's going on, I set back the number of
simulations to 10 (NUM_SIM=10), but still I am getting only zeros out
of the table. This is what my script is printing out:
H5 file creation time: 7.652
Saving results for table: 1.03400015831
Results (should be random...)
Object name : KB0001
Object results:
[[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
...,
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]]
I am on Windows Vista, Python 2.7.2 64-bit from EPD 7.1-2, pytables
version '2.3b1.devpro'.
Any suggestion is really appreciated. Thank you in advance.
Andrea.
"Imagination Is The Only Weapon In The War Against Reality."
http://www.infinity77.net
# ------------------------------------------------------------- #
def ask_mailing_list_support(email):
if mention_platform_and_version() and include_sample_app():
send_message(email)
else:
install_malware()
erase_hard_drives()
# ------------------------------------------------------------- #
pytables_test2.py
Description: Binary data
------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________ Pytables-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/pytables-users
