Re: [Rdkit-discuss] RDKit cartridge similarity search speeds(?)
James, On Fri, May 9, 2014 at 10:25 AM, James Davidson wrote: > > > One final advanced topic: if you are planning on making regular use of > the > > similarity features in the cartridge and are running on a linux system or > > Mac I would recommend recompiling the cartridge with some optimizations > for > > tanimoto similarity. To do this, you need to edit the cartridge Makefile > > from: > > PG_CPPFLAGS = -I${BOOSTHOME} -I${RDKIT}/Code -DRDKITVER='"007200"' > > ${INCHIFLAGS} #-DUSE_BUILTIN_POPCOUNT -msse4.2 > > > > to: > > PG_CPPFLAGS = -I${BOOSTHOME} -I${RDKIT}/Code -DRDKITVER='"007200"' > > ${INCHIFLAGS} -DUSE_BUILTIN_POPCOUNT -msse4.2 > > > > (I just removed a comment character here). This speeds the Tanimoto > > calculation up a fair bit (it's still not nearly as fast as Andrew's > > chemfp, but it's better than the default behavior). > > I'm on linux (Ubuntu), and have just re-built with the above > recommendation. > I'll see what the speeds look like afterwards (out of interest, I presume > the timings in your examples were with this optimisation in place?). > I would have thought so, but it turns out that I was using a build on my Mac without the optimization. It's about 20% faster for those sample queries when I use the rebuilt version. > Does this also affect dice? > It should, but, stupidly, it looks like it doesn't. I'll fix that. > And final question - after rebuilding the cartridge, does the extension > need to be dropped and then re-created in all databases; does postgreSQL > server need restarting; or neither? > You do not need to drop the extension. This is just a change to the shared library and doesn't affect the API. You just need to do a "make install" to copy in the new shared lib and then restart the server. -greg -- Is your legacy SCM system holding you back? Join Perforce May 7 to find out: • 3 signs your SCM is hindering your productivity • Requirements for releasing software faster • Expert tips and advice for migrating your SCM now http://p.sf.net/sfu/perforce___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] RDKit cartridge similarity search speeds(?)
Hi Greg, > What these are telling you is that the second query is not using the index: > it's a sequential scan, so it has to test all rows of the database. This > happens because the index is defined for the operator %, but not for the > function tanimoto_sml(). There may be an approach to get the index set up > using that function, but there we reach the limits of my expertise. Well, I will stick to the recommended operator use then! > One final advanced topic: if you are planning on making regular use of the > similarity features in the cartridge and are running on a linux system or > Mac I would recommend recompiling the cartridge with some optimizations for > tanimoto similarity. To do this, you need to edit the cartridge Makefile > from: > PG_CPPFLAGS = -I${BOOSTHOME} -I${RDKIT}/Code -DRDKITVER='"007200"' > ${INCHIFLAGS} #-DUSE_BUILTIN_POPCOUNT -msse4.2 > > to: > PG_CPPFLAGS = -I${BOOSTHOME} -I${RDKIT}/Code -DRDKITVER='"007200"' > ${INCHIFLAGS} -DUSE_BUILTIN_POPCOUNT -msse4.2 > > (I just removed a comment character here). This speeds the Tanimoto > calculation up a fair bit (it's still not nearly as fast as Andrew's > chemfp, but it's better than the default behavior). I'm on linux (Ubuntu), and have just re-built with the above recommendation. I'll see what the speeds look like afterwards (out of interest, I presume the timings in your examples were with this optimisation in place?). Does this also affect dice? And final question - after rebuilding the cartridge, does the extension need to be dropped and then re-created in all databases; does postgreSQL server need restarting; or neither? > Hope this helps, > -greg It does - thanks! Kind regards James __ PLEASE READ: This email is confidential and may be privileged. It is intended for the named addressee(s) only and access to it by anyone else is unauthorised. If you are not an addressee, any disclosure or copying of the contents of this email or any action taken (or not taken) in reliance on it is unauthorised and may be unlawful. If you have received this email in error, please notify the sender or postmas...@vernalis.com. Email is not a secure method of communication and the Company cannot accept responsibility for the accuracy or completeness of this message or any attachment(s). Please check this email for virus infection for which the Company accepts no responsibility. If verification of this email is sought then please request a hard copy. Unless otherwise stated, any views or opinions presented are solely those of the author and do not represent those of the Company. The Vernalis Group of Companies 100 Berkshire Place Wharfedale Road Winnersh, Berkshire RG41 5RD, England Tel: +44 (0)118 938 To access trading company registration and address details, please go to the Vernalis website at www.vernalis.com and click on the "Company address and registration details" link at the bottom of the page.. __ -- Is your legacy SCM system holding you back? Join Perforce May 7 to find out: • 3 signs your SCM is hindering your productivity • Requirements for releasing software faster • Expert tips and advice for migrating your SCM now http://p.sf.net/sfu/perforce ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] RDKit cartridge similarity search speeds(?)
Hi James, [fair warning before I start: we quickly hit the limits of my postgresql expertise here] On Thu, May 8, 2014 at 2:35 PM, James Davidson wrote: > Dear All, > > > > I have recently been spending a bit more time with the RDKit cartridge, > and have what is probably a very naïve question… > > Having built some RDKit fingerprints for ChEMBL_18, I see the following > behaviour (for clarification – ‘ecfp4_bv’ is the column in my rdk.fps table > that has been generated using morganbv_fp(mol, 2)): > > > > > > chembl_18=# \timing on > > Timing is on. > > > > chembl_18=# set rdkit.tanimoto_threshold=0.5; > > SET > > Time: 0.167 ms > > > > chembl_18=# select chembl_id from rdk.fps where ecfp4_bv % > morganbv_fp('c1nnccc1'::mol,2); > > chembl_id > > - > > CHEMBL15719 > > (1 row) > > > > Time: 2033.348 ms > > > > chembl_18=# select chembl_id from rdk.fps where tanimoto_sml(ecfp4_bv, > morganbv_fp('c1nnccc1'::mol, 2)) > 0.5; > > chembl_id > > - > > CHEMBL15719 > > (1 row) > > > > Time: 6843.605 ms > > > > > > I can see that the query plans are different in the two cases, but I don’t > fully understand why – see below: > > > > *QUERY 1 (with explain analyze)* > > chembl_18=# explain analyze select chembl_id from rdk.fps where ecfp4_bv % > morganbv_fp('c1nnccc1'::mol,2); > > > QUERY PLAN > > > > > Bitmap Heap Scan on fps (cost=106.91..5298.31 rows=1352 width=13) (actual > time=1774.986..1774.987 rows=1 loops=1) > >Recheck Cond: (ecfp4_bv % > '\x0100084200048204'::bfp) > >-> Bitmap Index Scan on fps_ecfp4bv_idx (cost=0.00..106.57 rows=1352 > width=0) (actual time=1774.969..1774.969 rows=1 loops=1) > > Index Cond: (ecfp4_bv % > '\x0100084200048204'::bfp) > > Total runtime: 1775.035 ms > > (5 rows) > > > > Time: 1776.133 ms > > > > > > *QUERY 2 (with explain analyze)* > > chembl_18=# explain analyze select chembl_id from rdk.fps where > tanimoto_sml(ecfp4_bv, morganbv_fp('c1nnccc1'::mol, 2)) > 0.5; > > > QUERY PLAN > > > --- > > Seq Scan on fps (cost=0.00..388808.17 rows=450793 width=13) (actual > time=1278.115..6953.977 rows=1 loops=1) > >Filter: (tanimoto_sml(ecfp4_bv, > '\x0100084200048204'::bfp) > > 0.5::double precision) > >Rows Removed by Filter: 1352377 > > Total runtime: 6954.010 ms > > (4 rows) > > > > Time: 6955.103 ms > What these are telling you is that the second query is not using the index: it's a sequential scan, so it has to test all rows of the database. This happens because the index is defined for the operator %, but not for the function tanimoto_sml(). There may be an approach to get the index set up using that function, but there we reach the limits of my expertise. > It seems conceptually ‘easier’ to add the similarity value as part of > the query, rather than setting it as a variable ahead of the query; but > clearly I should be doing it the latter way for performance reasons. So > even if I don’t fully understand why at the moment, am I correct in > thinking that queries of this sort should always be run with the similarity > operators (%, #)? And if so, is the rdkit.tanimoto_threshold variable set > at the level of the session, the user, or the database? > > It's set at the session level. When doing similarity searches, I find it generally helpful to also include the <%> operator in an "order by" clause so that the results come back in sorted order. So instead of this; chembl_17=# select molregno from rdk.fps where mfp2 % morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1'); molregno -- 412312 412302 412310 441378 470082 773946 775269 911501 1015485 1034321 1040255 1040496 1042958 1043871 1044892 1045663 1047691 1049393 (18 rows) Time: 1042.310 ms I do this: chembl_17=# select molregno from rdk.fps where mfp2 % morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1') order by morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1') <%> mfp2; molregno -- 412312 470082 1040255 773946 1044892 1049393 1040496 441378 1047691 1042958 412302 1043871 412310 1045663 911501 775269 1015485 1034321 (18 rows) Time: 1032.266 ms Notice that this doesn't make things any slower. It's nice to see the actual similarity values:
[Rdkit-discuss] RDKit cartridge similarity search speeds(?)
Dear All, I have recently been spending a bit more time with the RDKit cartridge, and have what is probably a very naïve question... Having built some RDKit fingerprints for ChEMBL_18, I see the following behaviour (for clarification - 'ecfp4_bv' is the column in my rdk.fps table that has been generated using morganbv_fp(mol, 2)): chembl_18=# \timing on Timing is on. chembl_18=# set rdkit.tanimoto_threshold=0.5; SET Time: 0.167 ms chembl_18=# select chembl_id from rdk.fps where ecfp4_bv % morganbv_fp('c1nnccc1'::mol,2); chembl_id - CHEMBL15719 (1 row) Time: 2033.348 ms chembl_18=# select chembl_id from rdk.fps where tanimoto_sml(ecfp4_bv, morganbv_fp('c1nnccc1'::mol, 2)) > 0.5; chembl_id - CHEMBL15719 (1 row) Time: 6843.605 ms I can see that the query plans are different in the two cases, but I don't fully understand why - see below: QUERY 1 (with explain analyze) chembl_18=# explain analyze select chembl_id from rdk.fps where ecfp4_bv % morganbv_fp('c1nnccc1'::mol,2); QUERY PLAN Bitmap Heap Scan on fps (cost=106.91..5298.31 rows=1352 width=13) (actual time=1774.986..1774.987 rows=1 loops=1) Recheck Cond: (ecfp4_bv % '\x0100084200048204'::bfp) -> Bitmap Index Scan on fps_ecfp4bv_idx (cost=0.00..106.57 rows=1352 width=0) (actual time=1774.969..1774.969 rows=1 loops=1) Index Cond: (ecfp4_bv % '\x0100084200048204'::bfp) Total runtime: 1775.035 ms (5 rows) Time: 1776.133 ms QUERY 2 (with explain analyze) chembl_18=# explain analyze select chembl_id from rdk.fps where tanimoto_sml(ecfp4_bv, morganbv_fp('c1nnccc1'::mol, 2)) > 0.5; QUERY PLAN --- Seq Scan on fps (cost=0.00..388808.17 rows=450793 width=13) (actual time=1278.115..6953.977 rows=1 loops=1) Filter: (tanimoto_sml(ecfp4_bv, '\x0100084200048204'::bfp) > 0.5::double precision) Rows Removed by Filter: 1352377 Total runtime: 6954.010 ms (4 rows) Time: 6955.103 ms It seems conceptually 'easier' to add the similarity value as part of the query, rather than setting it as a variable ahead of the query; but clearly I should be doing it the latter way for performance reasons. So even if I don't fully understand why at the moment, am I correct in thinking that queries of this sort should always be run with the similarity operators (%, #)? And if so, is the rdkit.tanimoto_threshold variable set at the level of the session, the user, or the database? Kind regards James __ PLEASE READ: This email is confidential and may be privileged. It is intended for the named addressee(s) only and access to it by anyone else is unauthorised. If you are not an addressee, any disclosure or copying of the contents of this email or any action taken (or not taken) in reliance on it is unauthorised and may be unlawful. If you have received this email in error, please notify the sender or postmas...@vernalis.com. Email is not a secure method of communication and the Company cannot accept responsibility for the accuracy or completeness of this message or any attachment(s). Please check this email for virus infection for which the Company accepts no responsibility. If verification of this email is sought then please request a hard copy. Unless otherwise stated, any views or opinions presented are solely those of the author and do not represent those of the Company. The Vernalis Group of Companies 100 Berkshire Place Wharfedale Road Winnersh, Berkshire RG41 5RD, England Tel: +44 (0)118 938 To access trading company registration and address details, please go to the Vernalis website at www.vernalis.com and click on the "Company address and registration details" link at the bottom of the page.. __-- Is your legacy SCM system holding you back? Join Perforce May 7 to find out: • 3 signs your SCM is hindering your productivity • Requirements for releasing software faster • Expert tips and advice for migratin