---------- Forwarded message ----------
From: Markus Sitzmann <markus.sitzm...@gmail.com>
Date: Thu, May 8, 2014 at 3:14 PM
Subject: Re: [Rdkit-discuss] RDKit cartridge similarity search speeds(?)
To: James Davidson <j.david...@vernalis.com>


Hi James,

I would guess, in your second query, "morganbv_fp('c1nnccc1'::mol, 2)"
has to be calculated for each row you are scanning because from the
database's perspective the result is unpredictable (although it is
not), so it can not be optimized so easily. All of this is avoided in
your first query, the calculation is done once before the table scan
and then the actual index/table scan is a rather simple one.

Markus

On Thu, May 8, 2014 at 2:35 PM, James Davidson <j.david...@vernalis.com> wrote:
> Dear All,
>
>
>
> I have recently been spending a bit more time with the RDKit cartridge, and
> have what is probably a very naïve question…
>
> Having built some RDKit fingerprints for ChEMBL_18, I see the following
> behaviour (for clarification – ‘ecfp4_bv’ is the column in my rdk.fps table
> that has been generated using morganbv_fp(mol, 2)):
>
>
>
>
>
> chembl_18=# \timing on
>
> Timing is on.
>
>
>
> chembl_18=# set rdkit.tanimoto_threshold=0.5;
>
> SET
>
> Time: 0.167 ms
>
>
>
> chembl_18=# select chembl_id from rdk.fps where ecfp4_bv %
> morganbv_fp('c1nnccc1'::mol,2);
>
>   chembl_id
>
> -------------
>
> CHEMBL15719
>
> (1 row)
>
>
>
> Time: 2033.348 ms
>
>
>
> chembl_18=# select chembl_id from rdk.fps where tanimoto_sml(ecfp4_bv,
> morganbv_fp('c1nnccc1'::mol, 2)) > 0.5;
>
>   chembl_id
>
> -------------
>
> CHEMBL15719
>
> (1 row)
>
>
>
> Time: 6843.605 ms
>
>
>
>
>
> I can see that the query plans are different in the two cases, but I don’t
> fully understand why – see below:
>
>
>
> QUERY 1 (with explain analyze)
>
> chembl_18=# explain analyze select chembl_id from rdk.fps where ecfp4_bv %
> morganbv_fp('c1nnccc1'::mol,2);
>
>
> QUERY PLAN
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Bitmap Heap Scan on fps  (cost=106.91..5298.31 rows=1352 width=13) (actual
> time=1774.986..1774.987 rows=1 loops=1)
>
>    Recheck Cond: (ecfp4_bv %
> '\x00000000000000000100000000000000080000000000000000000000000000000000004200000000000482000000000400000000000000000000000000000000'::bfp)
>
>    ->  Bitmap Index Scan on fps_ecfp4bv_idx  (cost=0.00..106.57 rows=1352
> width=0) (actual time=1774.969..1774.969 rows=1 loops=1)
>
>          Index Cond: (ecfp4_bv %
> '\x00000000000000000100000000000000080000000000000000000000000000000000004200000000000482000000000400000000000000000000000000000000'::bfp)
>
> Total runtime: 1775.035 ms
>
> (5 rows)
>
>
>
> Time: 1776.133 ms
>
>
>
>
>
> QUERY 2 (with explain analyze)
>
> chembl_18=# explain analyze select chembl_id from rdk.fps where
> tanimoto_sml(ecfp4_bv, morganbv_fp('c1nnccc1'::mol, 2)) > 0.5;
>
>
> QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Seq Scan on fps  (cost=0.00..388808.17 rows=450793 width=13) (actual
> time=1278.115..6953.977 rows=1 loops=1)
>
>    Filter: (tanimoto_sml(ecfp4_bv,
> '\x00000000000000000100000000000000080000000000000000000000000000000000004200000000000482000000000400000000000000000000000000000000'::bfp)
>> 0.5::double precision)
>
>    Rows Removed by Filter: 1352377
>
> Total runtime: 6954.010 ms
>
> (4 rows)
>
>
>
> Time: 6955.103 ms
>
>
>
>
>
> It seems conceptually ‘easier’ to add the similarity value as part of the
> query, rather than setting it as a variable ahead of the query; but clearly
> I should be doing it the latter way for performance reasons.  So even if I
> don’t fully understand why at the moment, am I correct in thinking that
> queries of this sort should always be run with the similarity operators (%,
> #)?  And if so, is the rdkit.tanimoto_threshold variable set at the level of
> the session, the user, or the database?
>
>
>
> Kind regards
>
>
>
> James
>
>
> ______________________________________________________________________
> PLEASE READ: This email is confidential and may be privileged. It is
> intended for the named addressee(s) only and access to it by anyone else is
> unauthorised. If you are not an addressee, any disclosure or copying of the
> contents of this email or any action taken (or not taken) in reliance on it
> is unauthorised and may be unlawful. If you have received this email in
> error, please notify the sender or postmas...@vernalis.com. Email is not a
> secure method of communication and the Company cannot accept responsibility
> for the accuracy or completeness of this message or any attachment(s).
> Please check this email for virus infection for which the Company accepts no
> responsibility. If verification of this email is sought then please request
> a hard copy. Unless otherwise stated, any views or opinions presented are
> solely those of the author and do not represent those of the Company.
>
> The Vernalis Group of Companies
> 100 Berkshire Place
> Wharfedale Road
> Winnersh, Berkshire
> RG41 5RD, England
> Tel: +44 (0)118 938 0000
>
> To access trading company registration and address details, please go to the
> Vernalis website at www.vernalis.com and click on the "Company address and
> registration details" link at the bottom of the page..
> ______________________________________________________________________
>
> ------------------------------------------------------------------------------
> Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
> &#149; 3 signs your SCM is hindering your productivity
> &#149; Requirements for releasing software faster
> &#149; Expert tips and advice for migrating your SCM now
> http://p.sf.net/sfu/perforce
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
&#149; 3 signs your SCM is hindering your productivity
&#149; Requirements for releasing software faster
&#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to