Re: [Pytables-users] Large (to very large) datasets...

Andrea Gavana Wed, 31 Oct 2012 07:13:06 -0700

Hi Francesc & All,

On 31 October 2012 14:13, Francesc Alted wrote:
> On 10/31/12 4:30 AM, Andrea Gavana wrote:
>> Thank you for all your suggestions. I managed to slightly modify the
>> script you attached and I am also experimenting with compression.
>> However, in the newly attached script the underlying table is not
>> modified, i.e., this assignment:
>>
>> for p in table:
>>      p['results'][:NUM_SIM, :, :] = numpy.random.random(size=(NUM_SIM,
>> len(ALL_DATES), 7))
>>      table.flush()
>
> For modifying row values you need to assign a complete row object.
> Something like:
>
> for i in range(len(table)):
>      myrow = table[i]
>      myrow['results'][:NUM_SIM, :, :] =
> numpy.random.random(size=(NUM_SIM, len(ALL_DATES), 7))
>      table[i] = myrow
>
> You may also use Table.modifyColumn() for better efficiency.  Look at
> the different modification methods here:
>
> http://pytables.github.com/usersguide/libref/structured_storage.html#table-methods-writing
>
> and experiment with them.


Thank you, I have tried different approaches and they all seem to run
more or less at the same speed (see below). I had to slightly modify
your code from:

table[i] = myrow

to

table[i] = [myrow]

To avoid exceptions.

In the newly attached file, I switched to blosc for compression (but
with compression level 1) and run a few sensitivities. By calling the
attached script as:

python pytables_test.py NUM_SIM

where "NUM_SIM" is an integer, I get the following timings and file sizes:

C:\MyProjects\Phaser\tests>python pytables_test.py 10
Number of simulations   : 10
H5 file creation time   : 0.879s
Saving results for table: 6.413s
H5 file size (MB)       : 193


C:\MyProjects\Phaser\tests>python pytables_test.py 100
Number of simulations   : 100
H5 file creation time   : 4.155s
Saving results for table: 86.326s
H5 file size (MB)       : 1935


I dont think I will try the 1,000 simulations case :-) . I believe I
still don't understand what the best strategy would be for my problem.
I basically need to save all the simulation results for all the 1,200
"objects", each of which has a timeseries matrix of 600x7 size. In the
GUI I have, these 1,200 "objects" are grouped into multiple
categories, and multiple categories can reference the same "object",
i.e.:

Category_1: object_1, object_23, object_543, etc...
Category_2: object_23, object_100, object_543, etc...

So my idea was to save all the "objects" results to disk and, upon the
user's choice, build the categories results "on the fly", i.e. by
seeking the H5 file on disk for the "objects" belonging to that
specific category and summing up all their results (over time, i.e.
the 600 time-steps). Maybe I would be better off with a 4D array
(NUM_OBJECTS, NUM_SIM, TSTEPS, 7) as a table, but then I will lose the
ability to reference the "objects" by their names...

I welcome in advance any suggestion on how to improve my thinking on
this matter. Thanks for all the answers I received.


Andrea.

"Imagination Is The Only Weapon In The War Against Reality."
http://www.infinity77.net

# ------------------------------------------------------------- #
def ask_mailing_list_support(email):

    if mention_platform_and_version() and include_sample_app():
        send_message(email)
    else:
        install_malware()
        erase_hard_drives()
# ------------------------------------------------------------- #

pytables_test.py
Description: Binary data

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Large (to very large) datasets...

Reply via email to