Dear All,

I have recently been spending a bit more time with the RDKit cartridge, and 
have what is probably a very naïve question...
Having built some RDKit fingerprints for ChEMBL_18, I see the following 
behaviour (for clarification - 'ecfp4_bv' is the column in my rdk.fps table 
that has been generated using morganbv_fp(mol, 2)):


chembl_18=# \timing on
Timing is on.

chembl_18=# set rdkit.tanimoto_threshold=0.5;
SET
Time: 0.167 ms

chembl_18=# select chembl_id from rdk.fps where ecfp4_bv % 
morganbv_fp('c1nnccc1'::mol,2);
  chembl_id
-------------
CHEMBL15719
(1 row)

Time: 2033.348 ms

chembl_18=# select chembl_id from rdk.fps where tanimoto_sml(ecfp4_bv, 
morganbv_fp('c1nnccc1'::mol, 2)) > 0.5;
  chembl_id
-------------
CHEMBL15719
(1 row)

Time: 6843.605 ms


I can see that the query plans are different in the two cases, but I don't 
fully understand why - see below:

QUERY 1 (with explain analyze)
chembl_18=# explain analyze select chembl_id from rdk.fps where ecfp4_bv % 
morganbv_fp('c1nnccc1'::mol,2);
                                                                                
 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on fps  (cost=106.91..5298.31 rows=1352 width=13) (actual 
time=1774.986..1774.987 rows=1 loops=1)
   Recheck Cond: (ecfp4_bv % 
'\x00000000000000000100000000000000080000000000000000000000000000000000004200000000000482000000000400000000000000000000000000000000'::bfp)
   ->  Bitmap Index Scan on fps_ecfp4bv_idx  (cost=0.00..106.57 rows=1352 
width=0) (actual time=1774.969..1774.969 rows=1 loops=1)
         Index Cond: (ecfp4_bv % 
'\x00000000000000000100000000000000080000000000000000000000000000000000004200000000000482000000000400000000000000000000000000000000'::bfp)
Total runtime: 1775.035 ms
(5 rows)

Time: 1776.133 ms


QUERY 2 (with explain analyze)
chembl_18=# explain analyze select chembl_id from rdk.fps where 
tanimoto_sml(ecfp4_bv, morganbv_fp('c1nnccc1'::mol, 2)) > 0.5;
                                                                                
              QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on fps  (cost=0.00..388808.17 rows=450793 width=13) (actual 
time=1278.115..6953.977 rows=1 loops=1)
   Filter: (tanimoto_sml(ecfp4_bv, 
'\x00000000000000000100000000000000080000000000000000000000000000000000004200000000000482000000000400000000000000000000000000000000'::bfp)
 > 0.5::double precision)
   Rows Removed by Filter: 1352377
Total runtime: 6954.010 ms
(4 rows)

Time: 6955.103 ms


It seems conceptually 'easier' to add the similarity value as part of the 
query, rather than setting it as a variable ahead of the query; but clearly I 
should be doing it the latter way for performance reasons.  So even if I don't 
fully understand why at the moment, am I correct in thinking that queries of 
this sort should always be run with the similarity operators (%, #)?  And if 
so, is the rdkit.tanimoto_threshold variable set at the level of the session, 
the user, or the database?

Kind regards

James

______________________________________________________________________
PLEASE READ: This email is confidential and may be privileged. It is intended 
for the named addressee(s) only and access to it by anyone else is 
unauthorised. If you are not an addressee, any disclosure or copying of the 
contents of this email or any action taken (or not taken) in reliance on it is 
unauthorised and may be unlawful. If you have received this email in error, 
please notify the sender or [email protected]. Email is not a secure 
method of communication and the Company cannot accept responsibility for the 
accuracy or completeness of this message or any attachment(s). Please check 
this email for virus infection for which the Company accepts no responsibility. 
If verification of this email is sought then please request a hard copy. Unless 
otherwise stated, any views or opinions presented are solely those of the 
author and do not represent those of the Company.

The Vernalis Group of Companies
100 Berkshire Place
Wharfedale Road
Winnersh, Berkshire
RG41 5RD, England
Tel: +44 (0)118 938 0000

To access trading company registration and address details, please go to the 
Vernalis website at www.vernalis.com and click on the "Company address and 
registration details" link at the bottom of the page..
______________________________________________________________________
------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
• 3 signs your SCM is hindering your productivity
• Requirements for releasing software faster
• Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to