Dear All,
I have recently been spending a bit more time with the RDKit cartridge, and
have what is probably a very naïve question...
Having built some RDKit fingerprints for ChEMBL_18, I see the following
behaviour (for clarification - 'ecfp4_bv' is the column in my rdk.fps table
that has been generated using morganbv_fp(mol, 2)):
chembl_18=# \timing on
Timing is on.
chembl_18=# set rdkit.tanimoto_threshold=0.5;
SET
Time: 0.167 ms
chembl_18=# select chembl_id from rdk.fps where ecfp4_bv %
morganbv_fp('c1nnccc1'::mol,2);
chembl_id
-------------
CHEMBL15719
(1 row)
Time: 2033.348 ms
chembl_18=# select chembl_id from rdk.fps where tanimoto_sml(ecfp4_bv,
morganbv_fp('c1nnccc1'::mol, 2)) > 0.5;
chembl_id
-------------
CHEMBL15719
(1 row)
Time: 6843.605 ms
I can see that the query plans are different in the two cases, but I don't
fully understand why - see below:
QUERY 1 (with explain analyze)
chembl_18=# explain analyze select chembl_id from rdk.fps where ecfp4_bv %
morganbv_fp('c1nnccc1'::mol,2);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on fps (cost=106.91..5298.31 rows=1352 width=13) (actual
time=1774.986..1774.987 rows=1 loops=1)
Recheck Cond: (ecfp4_bv %
'\x00000000000000000100000000000000080000000000000000000000000000000000004200000000000482000000000400000000000000000000000000000000'::bfp)
-> Bitmap Index Scan on fps_ecfp4bv_idx (cost=0.00..106.57 rows=1352
width=0) (actual time=1774.969..1774.969 rows=1 loops=1)
Index Cond: (ecfp4_bv %
'\x00000000000000000100000000000000080000000000000000000000000000000000004200000000000482000000000400000000000000000000000000000000'::bfp)
Total runtime: 1775.035 ms
(5 rows)
Time: 1776.133 ms
QUERY 2 (with explain analyze)
chembl_18=# explain analyze select chembl_id from rdk.fps where
tanimoto_sml(ecfp4_bv, morganbv_fp('c1nnccc1'::mol, 2)) > 0.5;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on fps (cost=0.00..388808.17 rows=450793 width=13) (actual
time=1278.115..6953.977 rows=1 loops=1)
Filter: (tanimoto_sml(ecfp4_bv,
'\x00000000000000000100000000000000080000000000000000000000000000000000004200000000000482000000000400000000000000000000000000000000'::bfp)
> 0.5::double precision)
Rows Removed by Filter: 1352377
Total runtime: 6954.010 ms
(4 rows)
Time: 6955.103 ms
It seems conceptually 'easier' to add the similarity value as part of the
query, rather than setting it as a variable ahead of the query; but clearly I
should be doing it the latter way for performance reasons. So even if I don't
fully understand why at the moment, am I correct in thinking that queries of
this sort should always be run with the similarity operators (%, #)? And if
so, is the rdkit.tanimoto_threshold variable set at the level of the session,
the user, or the database?
Kind regards
James
______________________________________________________________________
PLEASE READ: This email is confidential and may be privileged. It is intended
for the named addressee(s) only and access to it by anyone else is
unauthorised. If you are not an addressee, any disclosure or copying of the
contents of this email or any action taken (or not taken) in reliance on it is
unauthorised and may be unlawful. If you have received this email in error,
please notify the sender or [email protected]. Email is not a secure
method of communication and the Company cannot accept responsibility for the
accuracy or completeness of this message or any attachment(s). Please check
this email for virus infection for which the Company accepts no responsibility.
If verification of this email is sought then please request a hard copy. Unless
otherwise stated, any views or opinions presented are solely those of the
author and do not represent those of the Company.
The Vernalis Group of Companies
100 Berkshire Place
Wharfedale Road
Winnersh, Berkshire
RG41 5RD, England
Tel: +44 (0)118 938 0000
To access trading company registration and address details, please go to the
Vernalis website at www.vernalis.com and click on the "Company address and
registration details" link at the bottom of the page..
______________________________________________________________________------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
• 3 signs your SCM is hindering your productivity
• Requirements for releasing software faster
• Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss