Re: [Pytables-users] Large (to very large) datasets...

Andrea Gavana Wed, 31 Oct 2012 01:31:13 -0700

Hi Anthony & All,

On 30 October 2012 23:31, Anthony Scopatz wrote:
> On Tue, Oct 30, 2012 at 6:20 PM, Andrea Gavana <andrea.gav...@gmail.com>
> wrote:
>>
>> Hi Anthony,
>>
>> On 30 October 2012 22:52, Anthony Scopatz wrote:
>> > Hi Andrea,
>> >
>> > Your problem is two fold.
>> >
>> > 1. Your timing wasn't reporting the time per data set, but rather the
>> > total
>> > time since writing all data sets.  You need to put the start time in the
>> > loop to get the time per data set.
>> >
>> > 2. Your larger problem was that you were writing too many times.
>> > Generally
>> > it is faster to write fewer, bigger sets of data than performing a lot
>> > of
>> > small write operations.  Since you had data set opening and writing in a
>> > doubly nested loop, it is not surprising that you were getting terrible
>> > performance.   You were basically maximizing HDF5 overhead ;).  Using
>> > slicing I removed the outermost loop and saw timings like the following:
>> >
>> > H5 file creation time: 7.406
>> >
>> > Saving results for table: 0.0105440616608
>> > Saving results for table: 0.0158948898315
>> > Saving results for table: 0.0164661407471
>> > Saving results for table: 0.00654292106628
>> > Saving results for table: 0.00676298141479
>> > Saving results for table: 0.00664114952087
>> > Saving results for table: 0.0066990852356
>> > Saving results for table: 0.00687289237976
>> > Saving results for table: 0.00664210319519
>> > Saving results for table: 0.0157809257507
>> > Saving results for table: 0.0141618251801
>> > Saving results for table: 0.00796294212341
>> >
>> > Please see the attached version, at around line 82.  Additionally, if
>> > you
>> > need to focus on performance I would recommend reading the following
>> > (http://pytables.github.com/usersguide/optimization.html).  PyTables can
>> > be
>> > blazingly fast when implemented correctly.  I would highly recommend
>> > looking
>> > into compression.
>> >
>> > I hope this helps!
>>
>> Thank you for your answer; indeed, I was timing it wrongly (I really
>> need to go to sleep...). However, although I understand the need of
>> "writing fewer", I am not sure I can actually do it in my situations.
>> Let me explain:
>>
>> 1. I have a GUI which starts a number of parallel processes (up to 16,
>> depending on a user selection);
>> 2. These processes actually do the computation/simulations - so, if I
>> have 1,000 simulations to run and 8 parallel processes, each process
>> gets 125 simulations (each of which holds 1,200 "objects" with a 600x7
>> timeseries matrix per object).
>
>
> Well, you can at least change the order of the loops and see if that helps.
> That is rather than doing:
>
> for i in xrange():
>     for p in table:
>
> Do the following instead:
>
> for p in table:
>     for i in xrange():
>
> I don't believe that this will help too much since you are still writing
> every element individually..
>
>>
>>
>> If I had to write out the results only at the end, it would mean for
>> me to find a way to share the 1,200 "objects" matrices in all the
>> parallel processes (and I am not sure if pytables is going to complain
>> when multiple concurrent processes try to access the same underlying
>> HDF5 file).
>
>
> Reading in parallel works pretty well.  Writing causes more headaches
> but can be done.
>
>>
>> Or I could create one HDF file per process, but given the nature of
>> the simulation I am running, every "object" in the 1,200 "objects"
>> pool would need to keep a reference to a 125x600x7 matrix (assuming
>> 1,000 simulations and 8 processes) around in memory *OR* I will need
>> to write the results to the HDF5 file for every simulation. Although
>> we have extremely powerful PCs at work, I am not sure it is the right
>> way to go...
>>
>> As always, I am open to all suggestions on how to improve my approach.
>
>
> My basic suggestion is to have all of you processes produce results which
> are then
> aggregated by a single master process.  This master is the only one which
> has write
> access to the hdf5 file and will allow you to create larger arrays and
> minimize the
> number of writes that you do.
>
> You'll probably want to take a look at this example:
> https://github.com/PyTables/PyTables/blob/develop/examples/multiprocess_access_queues.py
>
> I think that there might be a page in the docs about it now too...
>
> But I think that this is the strategy that you want to pursue.  Multiple
> compute processes, one write process.


Thank you for all your suggestions. I managed to slightly modify the
script you attached and I am also experimenting with compression.
However, in the newly attached script the underlying table is not
modified, i.e., this assignment:

for p in table:
    p['results'][:NUM_SIM, :, :] = numpy.random.random(size=(NUM_SIM,
len(ALL_DATES), 7))
    table.flush()

Seems to be doing nothing (i.e., printing out the 'results' attribute
for an object class prints a matrix full of zeros instead of random
numbers...). Also, on my PC at work, the file creation time is
tremendously slow (76 seconds for a 100 simulations - 1.9 GB file).

In order to understand what's going on, I set back the number of
simulations to 10 (NUM_SIM=10), but still I am getting only zeros out
of the table. This is what my script is printing out:

H5 file creation time: 7.652

Saving results for table: 1.03400015831

Results (should be random...)

Object name   : KB0001
Object results:
[[[ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  ...,
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]]

 [[ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  ...,
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]]

 [[ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  ...,
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]]

 ...,
 [[ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  ...,
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]]

 [[ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  ...,
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]]

 [[ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  ...,
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]
  [ 0.  0.  0. ...,  0.  0.  0.]]]



I am on Windows Vista, Python 2.7.2 64-bit  from EPD 7.1-2, pytables
version '2.3b1.devpro'.

Any suggestion is really appreciated. Thank you in advance.


Andrea.

"Imagination Is The Only Weapon In The War Against Reality."
http://www.infinity77.net

# ------------------------------------------------------------- #
def ask_mailing_list_support(email):

    if mention_platform_and_version() and include_sample_app():
        send_message(email)
    else:
        install_malware()
        erase_hard_drives()
# ------------------------------------------------------------- #

pytables_test2.py
Description: Binary data

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Large (to very large) datasets...

Reply via email to