Re: [Rdkit-discuss] RDKit cartridge similarity search speeds(?)

2014-05-09 Thread Greg Landrum
James,


On Fri, May 9, 2014 at 10:25 AM, James Davidson wrote:

>
> > One final advanced topic: if you are planning on making regular use of
> the
> > similarity features in the cartridge and are running on a linux system or
> > Mac I would recommend recompiling the cartridge with some optimizations
> for
> > tanimoto similarity. To do this, you need to edit the cartridge Makefile
> > from:
> > PG_CPPFLAGS = -I${BOOSTHOME} -I${RDKIT}/Code -DRDKITVER='"007200"'
> > ${INCHIFLAGS} #-DUSE_BUILTIN_POPCOUNT -msse4.2
> >
> > to:
> > PG_CPPFLAGS = -I${BOOSTHOME} -I${RDKIT}/Code -DRDKITVER='"007200"'
> > ${INCHIFLAGS} -DUSE_BUILTIN_POPCOUNT -msse4.2
> >
> > (I just removed a comment character here). This speeds the Tanimoto
> > calculation up a fair bit (it's still not nearly as fast as Andrew's
> > chemfp, but it's better than the default behavior).
>
> I'm on linux (Ubuntu), and have just re-built with the above
> recommendation.
> I'll see what the speeds look like afterwards (out of interest, I presume
> the timings in your examples were with this optimisation in place?).
>

I would have thought so, but it turns out that I was using a build on my
Mac without the optimization. It's about 20% faster for those sample
queries when I use the rebuilt version.


> Does this also affect dice?
>

It should, but, stupidly, it looks like it doesn't. I'll fix that.


> And final question - after rebuilding the cartridge, does the extension
> need to be dropped and then re-created in all databases; does postgreSQL
> server need restarting; or neither?
>

You do not need to drop the extension. This is just a change to the shared
library and doesn't affect the API. You just need to do a "make install" to
copy in the new shared lib and then restart the server.

-greg
--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
• 3 signs your SCM is hindering your productivity
• Requirements for releasing software faster
• Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit cartridge similarity search speeds(?)

2014-05-09 Thread James Davidson
Hi Greg,

> What these are telling you is that the second query is not using the index:
> it's a sequential scan, so it has to test all rows of the database. This
> happens because the index is defined for the operator %, but not for the
> function tanimoto_sml(). There may be an approach to get the index set up
> using that function, but there we reach the limits of my expertise.

Well, I will stick to the recommended operator use then!


> One final advanced topic: if you are planning on making regular use of the
> similarity features in the cartridge and are running on a linux system or
> Mac I would recommend recompiling the cartridge with some optimizations for
> tanimoto similarity. To do this, you need to edit the cartridge Makefile
> from:
> PG_CPPFLAGS = -I${BOOSTHOME} -I${RDKIT}/Code -DRDKITVER='"007200"'
> ${INCHIFLAGS} #-DUSE_BUILTIN_POPCOUNT -msse4.2
> 
> to:
> PG_CPPFLAGS = -I${BOOSTHOME} -I${RDKIT}/Code -DRDKITVER='"007200"'
> ${INCHIFLAGS} -DUSE_BUILTIN_POPCOUNT -msse4.2
> 
> (I just removed a comment character here). This speeds the Tanimoto
> calculation up a fair bit (it's still not nearly as fast as Andrew's
> chemfp, but it's better than the default behavior).

I'm on linux (Ubuntu), and have just re-built with the above recommendation.
I'll see what the speeds look like afterwards (out of interest, I presume the 
timings in your examples were with this optimisation in place?).

Does this also affect dice?

And final question - after rebuilding the cartridge, does the extension need to 
be dropped and then re-created in all databases; does postgreSQL server need 
restarting; or neither?


> Hope this helps,
> -greg

It does - thanks!

Kind regards

James

__
PLEASE READ: This email is confidential and may be privileged. It is intended 
for the named addressee(s) only and access to it by anyone else is 
unauthorised. If you are not an addressee, any disclosure or copying of the 
contents of this email or any action taken (or not taken) in reliance on it is 
unauthorised and may be unlawful. If you have received this email in error, 
please notify the sender or postmas...@vernalis.com. Email is not a secure 
method of communication and the Company cannot accept responsibility for the 
accuracy or completeness of this message or any attachment(s). Please check 
this email for virus infection for which the Company accepts no responsibility. 
If verification of this email is sought then please request a hard copy. Unless 
otherwise stated, any views or opinions presented are solely those of the 
author and do not represent those of the Company.

The Vernalis Group of Companies
100 Berkshire Place
Wharfedale Road
Winnersh, Berkshire
RG41 5RD, England
Tel: +44 (0)118 938 

To access trading company registration and address details, please go to the 
Vernalis website at www.vernalis.com and click on the "Company address and 
registration details" link at the bottom of the page..
__
--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
• 3 signs your SCM is hindering your productivity
• Requirements for releasing software faster
• Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit cartridge similarity search speeds(?)

2014-05-08 Thread Greg Landrum
Hi James,

[fair warning before I start: we quickly hit the limits of my postgresql
expertise here]


On Thu, May 8, 2014 at 2:35 PM, James Davidson wrote:

>  Dear All,
>
>
>
> I have recently been spending a bit more time with the RDKit cartridge,
> and have what is probably a very naïve question…
>
> Having built some RDKit fingerprints for ChEMBL_18, I see the following
> behaviour (for clarification – ‘ecfp4_bv’ is the column in my rdk.fps table
> that has been generated using morganbv_fp(mol, 2)):
>
>
>
>
>
> chembl_18=# \timing on
>
> Timing is on.
>
>
>
> chembl_18=# set rdkit.tanimoto_threshold=0.5;
>
> SET
>
> Time: 0.167 ms
>
>
>
> chembl_18=# select chembl_id from rdk.fps where ecfp4_bv %
> morganbv_fp('c1nnccc1'::mol,2);
>
>   chembl_id
>
> -
>
> CHEMBL15719
>
> (1 row)
>
>
>
> Time: 2033.348 ms
>
>
>
> chembl_18=# select chembl_id from rdk.fps where tanimoto_sml(ecfp4_bv,
> morganbv_fp('c1nnccc1'::mol, 2)) > 0.5;
>
>   chembl_id
>
> -
>
> CHEMBL15719
>
> (1 row)
>
>
>
> Time: 6843.605 ms
>
>
>
>
>
> I can see that the query plans are different in the two cases, but I don’t
> fully understand why – see below:
>
>
>
> *QUERY 1 (with explain analyze)*
>
> chembl_18=# explain analyze select chembl_id from rdk.fps where ecfp4_bv %
> morganbv_fp('c1nnccc1'::mol,2);
>
>
> QUERY PLAN
>
>
> 
>
> Bitmap Heap Scan on fps  (cost=106.91..5298.31 rows=1352 width=13) (actual
> time=1774.986..1774.987 rows=1 loops=1)
>
>Recheck Cond: (ecfp4_bv %
> '\x0100084200048204'::bfp)
>
>->  Bitmap Index Scan on fps_ecfp4bv_idx  (cost=0.00..106.57 rows=1352
> width=0) (actual time=1774.969..1774.969 rows=1 loops=1)
>
>  Index Cond: (ecfp4_bv %
> '\x0100084200048204'::bfp)
>
> Total runtime: 1775.035 ms
>
> (5 rows)
>
>
>
> Time: 1776.133 ms
>
>
>
>
>
> *QUERY 2 (with explain analyze)*
>
> chembl_18=# explain analyze select chembl_id from rdk.fps where
> tanimoto_sml(ecfp4_bv, morganbv_fp('c1nnccc1'::mol, 2)) > 0.5;
>
>
> QUERY PLAN
>
>
> ---
>
> Seq Scan on fps  (cost=0.00..388808.17 rows=450793 width=13) (actual
> time=1278.115..6953.977 rows=1 loops=1)
>
>Filter: (tanimoto_sml(ecfp4_bv,
> '\x0100084200048204'::bfp)
> > 0.5::double precision)
>
>Rows Removed by Filter: 1352377
>
> Total runtime: 6954.010 ms
>
> (4 rows)
>
>
>
> Time: 6955.103 ms
>

What these are telling you is that the second query is not using the index:
it's a sequential scan, so it has to test all rows of the database. This
happens because the index is defined for the operator %, but not for the
function tanimoto_sml(). There may be an approach to get the index set up
using that function, but there we reach the limits of my expertise.


>  It seems conceptually ‘easier’ to add the similarity value as part of
> the query, rather than setting it as a variable ahead of the query; but
> clearly I should be doing it the latter way for performance reasons.  So
> even if I don’t fully understand why at the moment, am I correct in
> thinking that queries of this sort should always be run with the similarity
> operators (%, #)?  And if so, is the rdkit.tanimoto_threshold variable set
> at the level of the session, the user, or the database?
>
>
It's set at the session level.

When doing similarity searches, I find it generally helpful to also include
the <%> operator in an "order by" clause so that the results come back in
sorted order.
So instead of this;
chembl_17=# select molregno from rdk.fps where mfp2 %
morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1');
 molregno
--
   412312
   412302
   412310
   441378
   470082
   773946
   775269
   911501
  1015485
  1034321
  1040255
  1040496
  1042958
  1043871
  1044892
  1045663
  1047691
  1049393
(18 rows)

Time: 1042.310 ms

I do this:
chembl_17=# select molregno from rdk.fps where mfp2 %
morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1')
 order by
morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1') <%> mfp2;
 molregno
--
   412312
   470082
  1040255
   773946
  1044892
  1049393
  1040496
   441378
  1047691
  1042958
   412302
  1043871
   412310
  1045663
   911501
   775269
  1015485
  1034321
(18 rows)

Time: 1032.266 ms

Notice that this doesn't make things any slower.

It's nice to see the actual similarity values:

[Rdkit-discuss] RDKit cartridge similarity search speeds(?)

2014-05-08 Thread James Davidson
Dear All,

I have recently been spending a bit more time with the RDKit cartridge, and 
have what is probably a very naïve question...
Having built some RDKit fingerprints for ChEMBL_18, I see the following 
behaviour (for clarification - 'ecfp4_bv' is the column in my rdk.fps table 
that has been generated using morganbv_fp(mol, 2)):


chembl_18=# \timing on
Timing is on.

chembl_18=# set rdkit.tanimoto_threshold=0.5;
SET
Time: 0.167 ms

chembl_18=# select chembl_id from rdk.fps where ecfp4_bv % 
morganbv_fp('c1nnccc1'::mol,2);
  chembl_id
-
CHEMBL15719
(1 row)

Time: 2033.348 ms

chembl_18=# select chembl_id from rdk.fps where tanimoto_sml(ecfp4_bv, 
morganbv_fp('c1nnccc1'::mol, 2)) > 0.5;
  chembl_id
-
CHEMBL15719
(1 row)

Time: 6843.605 ms


I can see that the query plans are different in the two cases, but I don't 
fully understand why - see below:

QUERY 1 (with explain analyze)
chembl_18=# explain analyze select chembl_id from rdk.fps where ecfp4_bv % 
morganbv_fp('c1nnccc1'::mol,2);

 QUERY PLAN

Bitmap Heap Scan on fps  (cost=106.91..5298.31 rows=1352 width=13) (actual 
time=1774.986..1774.987 rows=1 loops=1)
   Recheck Cond: (ecfp4_bv % 
'\x0100084200048204'::bfp)
   ->  Bitmap Index Scan on fps_ecfp4bv_idx  (cost=0.00..106.57 rows=1352 
width=0) (actual time=1774.969..1774.969 rows=1 loops=1)
 Index Cond: (ecfp4_bv % 
'\x0100084200048204'::bfp)
Total runtime: 1775.035 ms
(5 rows)

Time: 1776.133 ms


QUERY 2 (with explain analyze)
chembl_18=# explain analyze select chembl_id from rdk.fps where 
tanimoto_sml(ecfp4_bv, morganbv_fp('c1nnccc1'::mol, 2)) > 0.5;

  QUERY PLAN
---
Seq Scan on fps  (cost=0.00..388808.17 rows=450793 width=13) (actual 
time=1278.115..6953.977 rows=1 loops=1)
   Filter: (tanimoto_sml(ecfp4_bv, 
'\x0100084200048204'::bfp)
 > 0.5::double precision)
   Rows Removed by Filter: 1352377
Total runtime: 6954.010 ms
(4 rows)

Time: 6955.103 ms


It seems conceptually 'easier' to add the similarity value as part of the 
query, rather than setting it as a variable ahead of the query; but clearly I 
should be doing it the latter way for performance reasons.  So even if I don't 
fully understand why at the moment, am I correct in thinking that queries of 
this sort should always be run with the similarity operators (%, #)?  And if 
so, is the rdkit.tanimoto_threshold variable set at the level of the session, 
the user, or the database?

Kind regards

James

__
PLEASE READ: This email is confidential and may be privileged. It is intended 
for the named addressee(s) only and access to it by anyone else is 
unauthorised. If you are not an addressee, any disclosure or copying of the 
contents of this email or any action taken (or not taken) in reliance on it is 
unauthorised and may be unlawful. If you have received this email in error, 
please notify the sender or postmas...@vernalis.com. Email is not a secure 
method of communication and the Company cannot accept responsibility for the 
accuracy or completeness of this message or any attachment(s). Please check 
this email for virus infection for which the Company accepts no responsibility. 
If verification of this email is sought then please request a hard copy. Unless 
otherwise stated, any views or opinions presented are solely those of the 
author and do not represent those of the Company.

The Vernalis Group of Companies
100 Berkshire Place
Wharfedale Road
Winnersh, Berkshire
RG41 5RD, England
Tel: +44 (0)118 938 

To access trading company registration and address details, please go to the 
Vernalis website at www.vernalis.com and click on the "Company address and 
registration details" link at the bottom of the page..
__--
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
• 3 signs your SCM is hindering your productivity
• Requirements for releasing software faster
• Expert tips and advice for migratin