Re: [Pytables-users] Help with performance

Francesc Altet Tue, 16 May 2006 09:22:26 -0700

Pepe,

A Divendres 12 Maig 2006 20:42, Pepe Barbe va escriure:
> Hello,
>
> I am writing an app where I am processing a some genetic information.
> I have already placed all my information in a H5 DB. Currently the
> information that I am interested resides in a Table that contains a
> String for the Gene Name, a long string with Gene Genetic Sequence and
> a a 1x7 vector of type FloatCol. When I made the DB I made sure the
> entries would be indexed by the Name row.
>
> So, now I need to build some submatrixes from the data and to extract
> them I am using the following code:
>
> unlabeled_matrix=zeros( (num_genes,vec_len), dtype=float)
> count = 0
> for gene_name in instance.Unlabeled[0]:
>        iterator = table.where( table.cols._f_col('Name') == gene_name )
>        row = iterator.next()
>        unlabeled_matrix[count,:] = row['Expression']
>        count +=1
>
> Basically I am doing searches by name, and at least for this
> application I have made sure that I always search for existing entries
> and that there will be only one entry with that name. unlabeled_matrix
> is a numpy matrix and each row of this matrix contains one vector of
> the Expression row in the table.
>


Mmm... for general searches by name I think that the code above is
inadequate. I'd go with something like:

unlabeled_matrix=zeros( (num_genes,vec_len), dtype=float)
count = 0
for gene_name in instance.Unlabeled[0]:
    for row in table.where( table.cols.Name == gene_name ):
        unlabeled_matrix[count,:] = row['Expression']
        count +=1

Your approach works just because there is just one single gene_name
entry in 'Name' column, but this second version would work even in the
case there are more entries.

> Besides that this may not be the best way of accessing the database
> (Some comments on this would be appreciated) I am getting performance
> that is less of what I was expecting (Although I don't know really
> what I should be expecting), so I would appreciate any suggestions on
> how to approach this problem. For example, it takes  between 3.5 and 4
> seconds to generate a 20x7 matrix made of 20 expression genes vectors.
> Besides this I need to process another matrix that would be 3800x7 and
> I need to do it this several times of, which starts adding time. This
> is without counting doing some mathematical manipulations with the
> matrices and then storing them in the DB again.

Well, if you are going to search a lot by the 'Name' column, perhaps
you can load this column completely in memory as keys in a dictionary
and then save the index as the values. Something like:

# Build the names dictionary:
namedict = {}
for row in table:
    namedict[row['Name']] = row.nrow
# Populate the unlabeled_matrix
unlabeled_matrix=zeros( (num_genes,vec_len), dtype=float)
count = 0
for gene_name in instance.Unlabeled[0]:
    unlabeled_matrix[count,:] = table.cols.Expression[namedict['gene_name']]
    count +=1

I haven't check this directly, so it may not work as it is, but I hope
you've got the idea.

HTH,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Help with performance

Reply via email to