Hi James,

[fair warning before I start: we quickly hit the limits of my postgresql
expertise here]


On Thu, May 8, 2014 at 2:35 PM, James Davidson <[email protected]>wrote:

>  Dear All,
>
>
>
> I have recently been spending a bit more time with the RDKit cartridge,
> and have what is probably a very naïve question…
>
> Having built some RDKit fingerprints for ChEMBL_18, I see the following
> behaviour (for clarification – ‘ecfp4_bv’ is the column in my rdk.fps table
> that has been generated using morganbv_fp(mol, 2)):
>
>
>
>
>
> chembl_18=# \timing on
>
> Timing is on.
>
>
>
> chembl_18=# set rdkit.tanimoto_threshold=0.5;
>
> SET
>
> Time: 0.167 ms
>
>
>
> chembl_18=# select chembl_id from rdk.fps where ecfp4_bv %
> morganbv_fp('c1nnccc1'::mol,2);
>
>   chembl_id
>
> -------------
>
> CHEMBL15719
>
> (1 row)
>
>
>
> Time: 2033.348 ms
>
>
>
> chembl_18=# select chembl_id from rdk.fps where tanimoto_sml(ecfp4_bv,
> morganbv_fp('c1nnccc1'::mol, 2)) > 0.5;
>
>   chembl_id
>
> -------------
>
> CHEMBL15719
>
> (1 row)
>
>
>
> Time: 6843.605 ms
>
>
>
>
>
> I can see that the query plans are different in the two cases, but I don’t
> fully understand why – see below:
>
>
>
> *QUERY 1 (with explain analyze)*
>
> chembl_18=# explain analyze select chembl_id from rdk.fps where ecfp4_bv %
> morganbv_fp('c1nnccc1'::mol,2);
>
>
> QUERY PLAN
>
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Bitmap Heap Scan on fps  (cost=106.91..5298.31 rows=1352 width=13) (actual
> time=1774.986..1774.987 rows=1 loops=1)
>
>    Recheck Cond: (ecfp4_bv %
> '\x00000000000000000100000000000000080000000000000000000000000000000000004200000000000482000000000400000000000000000000000000000000'::bfp)
>
>    ->  Bitmap Index Scan on fps_ecfp4bv_idx  (cost=0.00..106.57 rows=1352
> width=0) (actual time=1774.969..1774.969 rows=1 loops=1)
>
>          Index Cond: (ecfp4_bv %
> '\x00000000000000000100000000000000080000000000000000000000000000000000004200000000000482000000000400000000000000000000000000000000'::bfp)
>
> Total runtime: 1775.035 ms
>
> (5 rows)
>
>
>
> Time: 1776.133 ms
>
>
>
>
>
> *QUERY 2 (with explain analyze)*
>
> chembl_18=# explain analyze select chembl_id from rdk.fps where
> tanimoto_sml(ecfp4_bv, morganbv_fp('c1nnccc1'::mol, 2)) > 0.5;
>
>
> QUERY PLAN
>
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Seq Scan on fps  (cost=0.00..388808.17 rows=450793 width=13) (actual
> time=1278.115..6953.977 rows=1 loops=1)
>
>    Filter: (tanimoto_sml(ecfp4_bv,
> '\x00000000000000000100000000000000080000000000000000000000000000000000004200000000000482000000000400000000000000000000000000000000'::bfp)
> > 0.5::double precision)
>
>    Rows Removed by Filter: 1352377
>
> Total runtime: 6954.010 ms
>
> (4 rows)
>
>
>
> Time: 6955.103 ms
>

What these are telling you is that the second query is not using the index:
it's a sequential scan, so it has to test all rows of the database. This
happens because the index is defined for the operator %, but not for the
function tanimoto_sml(). There may be an approach to get the index set up
using that function, but there we reach the limits of my expertise.


>  It seems conceptually ‘easier’ to add the similarity value as part of
> the query, rather than setting it as a variable ahead of the query; but
> clearly I should be doing it the latter way for performance reasons.  So
> even if I don’t fully understand why at the moment, am I correct in
> thinking that queries of this sort should always be run with the similarity
> operators (%, #)?  And if so, is the rdkit.tanimoto_threshold variable set
> at the level of the session, the user, or the database?
>
>
It's set at the session level.

When doing similarity searches, I find it generally helpful to also include
the <%> operator in an "order by" clause so that the results come back in
sorted order.
So instead of this;
chembl_17=# select molregno from rdk.fps where mfp2 %
morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1');
 molregno
----------
   412312
   412302
   412310
   441378
   470082
   773946
   775269
   911501
  1015485
  1034321
  1040255
  1040496
  1042958
  1043871
  1044892
  1045663
  1047691
  1049393
(18 rows)

Time: 1042.310 ms

I do this:
chembl_17=# select molregno from rdk.fps where mfp2 %
morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1')
                                                     order by
morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1') <%> mfp2;
 molregno
----------
   412312
   470082
  1040255
   773946
  1044892
  1049393
  1040496
   441378
  1047691
  1042958
   412302
  1043871
   412310
  1045663
   911501
   775269
  1015485
  1034321
(18 rows)

Time: 1032.266 ms

Notice that this doesn't make things any slower.

It's nice to see the actual similarity values:

chembl_17=# select
molregno,tanimoto_sml(morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1'),mfp2)
 from rdk.fps
                where mfp2 % morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1')

                                              order by
morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1') <%> mfp2;
 molregno |   tanimoto_sml
----------+-------------------
   412312 | 0.692307692307692
   470082 | 0.583333333333333
  1040255 | 0.571428571428571
   773946 | 0.549019607843137
  1044892 | 0.518518518518518
  1049393 | 0.517857142857143
  1040496 | 0.517857142857143
   441378 | 0.510204081632653
  1047691 | 0.509090909090909
  1042958 | 0.509090909090909
   412302 |               0.5
  1043871 |               0.5
   412310 |               0.5
  1045663 |               0.5
   911501 |               0.5
   775269 |               0.5
  1015485 |               0.5
  1034321 |               0.5
(18 rows)

Time: 1034.257 ms

This also doesn't slow things down, but it is now pretty repetitive, so I
typically define a function for it:

chembl_17=# create or replace function do_mfp2_search(text)
chembl_17-#       returns table(molregno integer, similarity double
precision) as
chembl_17-#     $$
chembl_17$#     select molregno,tanimoto_sml(morganbv_fp($1::mol),mfp2) as
similarity
chembl_17$#     from rdk.fps
chembl_17$#     where morganbv_fp($1::mol)%mfp2
chembl_17$#     order by morganbv_fp($1::mol)<%>mfp2;
chembl_17$#     $$ language sql volatile ;
CREATE FUNCTION
Time: 0.926 ms

chembl_17=# select * from do_mfp2_search('Cc1ccc2nc(N(C)CC(=O)O)sc2c1');
 molregno |    similarity
----------+-------------------
   412312 | 0.692307692307692
   470082 | 0.583333333333333
  1040255 | 0.571428571428571
   773946 | 0.549019607843137
  1044892 | 0.518518518518518
  1049393 | 0.517857142857143
  1040496 | 0.517857142857143
   441378 | 0.510204081632653
  1047691 | 0.509090909090909
  1042958 | 0.509090909090909
   412302 |               0.5
  1043871 |               0.5
   412310 |               0.5
  1045663 |               0.5
   911501 |               0.5
   775269 |               0.5
  1015485 |               0.5
  1034321 |               0.5
(18 rows)

Time: 1061.676 ms


Another feature that can be quite useful is a nearest-neighbor search;
which can be used to find the N nearest-neighbors while ignoring the
similarity threshold. Here's an example that starts by setting the
threshold higher and then demonstrates what a neighbor search returns:

chembl_17=# set rdkit.tanimoto_threshold = 0.6;
SET
Time: 0.214 ms
chembl_17=# select * from do_mfp2_search('Cc1ccc2nc(N(C)CC(=O)O)sc2c1');
 molregno |    similarity
----------+-------------------
   412312 | 0.692307692307692
(1 row)

Time: 1045.409 ms
chembl_17=# select
molregno,tanimoto_sml(morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1'),mfp2)
 from rdk.fps
                order by morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1') <%>
mfp2 limit 5;
 molregno |   tanimoto_sml
----------+-------------------
   412312 | 0.692307692307692
   470082 | 0.583333333333333
  1040255 | 0.571428571428571
   773946 | 0.549019607843137
  1044892 | 0.518518518518518
(5 rows)

Time: 1278.761 ms


One final advanced topic: if you are planning on making regular use of the
similarity features in the cartridge and are running on a linux system or
Mac I would recommend recompiling the cartridge with some optimizations for
tanimoto similarity. To do this, you need to edit the cartridge Makefile
from:
PG_CPPFLAGS = -I${BOOSTHOME} -I${RDKIT}/Code -DRDKITVER='"007200"'
${INCHIFLAGS} #-DUSE_BUILTIN_POPCOUNT -msse4.2

to:
PG_CPPFLAGS = -I${BOOSTHOME} -I${RDKIT}/Code -DRDKITVER='"007200"'
${INCHIFLAGS} -DUSE_BUILTIN_POPCOUNT -msse4.2

(I just removed a comment character here). This speeds the Tanimoto
calculation up a fair bit (it's still not nearly as fast as Andrew's
chemfp, but it's better than the default behavior).

Hope this helps,
-greg
------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
&#149; 3 signs your SCM is hindering your productivity
&#149; Requirements for releasing software faster
&#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to