Hi there all (and Greg, since he is likely to answer this), Based on an rdkit post I read over the warm weekend I set myself to have a look at my rdkit based queries (and ways to speed them up)...
But first some details: Postgres: PostgreSQL 9.0.3 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.5.2, 64-bit RDKit (DB Cartridge): v. 0.20.0 # of Molecules 8,432,896 Database table (ligand) Table "public.ligand" Column | Type | Modifiers ------------+-----------------------+----------------------------------------------------- id | integer | not null default nextval('ligand_id_seq'::regclass) supplierid | character varying(50) | smiles | text | rdkitmol | mol | pairbv | bfp | torsionbv | bfp | morganbv | bfp | amw | real | mollogp | real | hba | integer | hbd | integer | atoms | integer | hvyatoms | integer | Indexes: "ligand_pkey" PRIMARY KEY, btree (id) "idx_ligand_morganbv" gist (morganbv) "idx_ligand_pairbv" gist (pairbv) "idx_ligand_rdkitmol" gist (rdkitmol) "idx_ligand_torsionbv" gist (torsionbv) I cannot explain why the following queries: db=# select count(*) from ligand where rdkitmol@>'c1cccc2c1nncc2' ; count ------- 2942 (1 row) Time: 193973.253 ms db=# select count(*) from ligand where morganbv%morganbv_fp('c1cccc2c1nncc2',2); count ------- 8 (1 row) Time: 400138.989 ms Take so long... these are orders of magnitude larger than timings reported in http://code.google.com/p/rdkit/wiki/DatabaseCreation2 And my database in "only" roughly 50% larger (8M instead of the puny 5M in emolecules). When I do an "explain" on these queries (to make sure the indices are being used), I get: db=# explain select count(*) from ligand where rdkitmol@>'c1cccc2c1nncc2' ; QUERY PLAN ---------------------------------------------------------------------------------------------- Aggregate (cost=34850.44..34850.45 rows=1 width=0) -> Bitmap Heap Scan on ligand (cost=2667.36..34829.36 rows=8433 width=0) Recheck Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol) -> Bitmap Index Scan on idx_ligand_rdkitmol (cost=0.00..2665.25 rows=8433 width=0) Index Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol) (5 rows) db=# explain select count(*) from ligand where morganbv%morganbv_fp('c1cccc2c1nncc2',2); QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Aggregate (cost=33290.88..33290.89 rows=1 width=0) -> Bitmap Heap Scan on ligand (cost=918.05..33269.79 rows=8433 width=0) Recheck Cond: (morganbv % '\\xe0ffffff00040000130000007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp) -> Bitmap Index Scan on idx_ligand_morganbv (cost=0.00..915.94 rows=8433 width=0) Index Cond: (morganbv % '\\xe0ffffff00040000130000007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp) (5 rows) Looks good no? Am I missing something? Or is this the fastest my search can go at? Supposedly the fingerprints search is just doing some ~8M binary operations no? Why does this take so long? Ideas, anyone? Many Thanks JP ------------------------------------------------------------------------------ WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss