Dear JP, On Tue, May 3, 2011 at 6:29 PM, JP <jeanpaul.ebe...@inhibox.com> wrote: > > Based on an rdkit post I read over the warm weekend I set myself to > have a look at my rdkit based queries (and ways to speed them up)... > > But first some details: > > Postgres: > PostgreSQL 9.0.3 on x86_64-unknown-linux-gnu, compiled by GCC gcc > (GCC) 4.5.2, 64-bit > > RDKit (DB Cartridge): > v. 0.20.0 >
The other important question is how much memory you have and what filesystem postgres is using for the database (local or network). > # of Molecules > 8,432,896 > > Database table (ligand) > Table "public.ligand" > Column | Type | Modifiers > ------------+-----------------------+----------------------------------------------------- > id | integer | not null default > nextval('ligand_id_seq'::regclass) > supplierid | character varying(50) | > smiles | text | > rdkitmol | mol | > pairbv | bfp | > torsionbv | bfp | > morganbv | bfp | > amw | real | > mollogp | real | > hba | integer | > hbd | integer | > atoms | integer | > hvyatoms | integer | > Indexes: > "ligand_pkey" PRIMARY KEY, btree (id) > "idx_ligand_morganbv" gist (morganbv) > "idx_ligand_pairbv" gist (pairbv) > "idx_ligand_rdkitmol" gist (rdkitmol) > "idx_ligand_torsionbv" gist (torsionbv) > > > I cannot explain why the following queries: > > db=# select count(*) from ligand where rdkitmol@>'c1cccc2c1nncc2' ; > count > ------- > 2942 > (1 row) > > Time: 193973.253 ms > db=# select count(*) from ligand where > morganbv%morganbv_fp('c1cccc2c1nncc2',2); > count > ------- > 8 > (1 row) > > Time: 400138.989 ms The performance is critically dependent on whether or not the indices are in memory. If you've just started the database or if there's been a lot of non-postgres activity on the machine since the last time you used it, it can take a long time to load the index from disk to memory. Once it has been loaded, things should go faster. My emolecules database requires about 1 GB for each of the indices: emolecules=# \di+ molidx; List of relations Schema | Name | Type | Owner | Table | Size | Description --------+--------+-------+----------+-------+---------+------------- public | molidx | index | glandrum | mols | 1049 MB | (1 row) emolecules=# \di+ mfp2idx; List of relations Schema | Name | Type | Owner | Table | Size | Description --------+---------+-------+----------+-------+--------+------------- public | mfp2idx | index | glandrum | fps | 970 MB | (1 row) > > Take so long... these are orders of magnitude larger than timings > reported in http://code.google.com/p/rdkit/wiki/DatabaseCreation2 > And my database in "only" roughly 50% larger (8M instead of the puny > 5M in emolecules). > > When I do an "explain" on these queries (to make sure the indices are > being used), I get: > > db=# explain select count(*) from ligand where rdkitmol@>'c1cccc2c1nncc2' ; > QUERY PLAN > ---------------------------------------------------------------------------------------------- > Aggregate (cost=34850.44..34850.45 rows=1 width=0) > -> Bitmap Heap Scan on ligand (cost=2667.36..34829.36 rows=8433 width=0) > Recheck Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol) > -> Bitmap Index Scan on idx_ligand_rdkitmol > (cost=0.00..2665.25 rows=8433 width=0) > Index Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol) > (5 rows) > > db=# explain select count(*) from ligand where > morganbv%morganbv_fp('c1cccc2c1nncc2',2); > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=33290.88..33290.89 rows=1 width=0) > -> Bitmap Heap Scan on ligand (cost=918.05..33269.79 rows=8433 width=0) > Recheck Cond: (morganbv % > '\\xe0ffffff00040000130000007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp) > -> Bitmap Index Scan on idx_ligand_morganbv > (cost=0.00..915.94 rows=8433 width=0) > Index Cond: (morganbv % > '\\xe0ffffff00040000130000007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp) > (5 rows) > > Looks good no? > Am I missing something? Or is this the fastest my search can go at? > Supposedly the fingerprints search is just doing some ~8M binary > operations no? Why does this take so long? > Ideas, anyone? This all looks fine. The experiments to try are: 1) do a second query and see if that's faster 2) try a smaller table (start with 1 or 2 million) and see how that performs Best Regards, -greg ------------------------------------------------------------------------------ WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss