@Anthony:
Thanks for the quick reply.
I fixed my problem (I will get to it later) but first to my previous problem:
I actually made a mistake in my previous mail.
My setup is the following: I have around 29 0000 groups. In each of
these groups I have 5 result tables.
Each of these tables contains approx. 31k rows. From each of these
tables I try to retrieve a subset which fulfills a specific score
criteria (higher than a specific score)
This is the pseudo code for my script:
h5_f = tables.openFile(hdf5_file,'r+')
table = h5_f.createTable('/','top_results',TopResults,'Top
results from all results')
row = table.row
for group_info in h5_f.root.result_list:
group_name = group_info['name']
group = h5_f.getNode('/results/%s' %group_name)
for result_type in
['result1','result2','result3','result4','result5']:
res_table = h5_f.getNode(group,result_type)
//retrieves top results
results = _getTopResults(res_table,min_score)
for result_row in results:
assignvValues(row,result_row)
row.append()
table.flush()
table.cols.score.createIndex()
h5_f.flush()
h5_f.close()
So the first couple of thousand tables run really quickly and then the
performance degrades to 1 table/second or even less.
After the script finished I checked the resulting tables and the
tables contained around 51 million rows.
I fixed my problem in several ways:
1.) Split up the one huge table into 5 tables (for each result_type).
2.) I set NODE_CACHE_SLOTS=1024 in table.openFile()
3.) First retrieve all rows for a specific result_type and then
append them via table.append()
By doing this the performance doesn't degrade at all. Memory
consumption is also reasonable.
cheers
Ümit
P.S.: Sorry for writing this mail in this way. However I somehow
didn't get your response directly via mail
On Mon, Jan 16, 2012 at 7:43 PM, Ümit Seren <[email protected]> wrote:
> I created a hdf5 file with pytables which contains around 29 000
> tables with around 31k rows each.
> I am trying to create a caching table in the same hdf5 file which
> contains a subset of those 29 000 tables.
>
> I wrote a script which basically iterates through each of the 29 000
> tables retrieves a subset and then writes it to the caching table.
> Basically it goes through the subset and then adds the rows from the
> subset one by one to the caching table.
> The first couple of 1000 tables run really quickly (around 5-8 tables
> per second or so). However the longer the script runs the slower it
> becomes (down to 1 table per second).
>
> Does anyone know why this is the case? (LRU cache maybe?)
>
> Right now I write row by row using row.append().
> Is it faster to create the dataset in memory and then write it as a
> whole block to the table?
>
> thanks in advance
>
> Ümit
>> Yes. In general, the more you can read / write in one go, the better
>> performance is. There is overhead in both Python and HDF5 to the
>> method calls.
>> However, the 9x slowdown per call is a little disconcerting. Do you have
>> a demonstration script that you can share?
>> Be Well
>> Anthony
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users