[Rdkit-discuss] tuning postgres for substructure search

2016-11-28 Thread Alexander Klenner-Bajaja
Dear all,

Has anyone experience with the postgrsql.conf and tuning its parameters for 
efficient sub structure search in collections of about 5-10 million compounds?

We have about 250GB of RAM and 40 CPUs at hand. Concurrency will be around 10 
simultaneous connections/queries so any memory parameter that is additive 
regarding number of parallel queries should be 1/10 of its max.  It`s mainly a 
question on how to share the available RAM among the various parameters.

I found the recommendations of Greg to increase "shared_bufferes" and 
"work_mem" as well as turning off "synchronous_commit" and "full_page_writes"  
but I`d really want to make best use of what I have available.

I played around a little with those values but I was hoping for someone who 
already has an optimized setup in terms of ratios between the different memory 
parameters.

I tested the same queries as shown in the documentation unfortunately we have a 
different collection (about 4x times as large) and I find the following results 
(repeated queries give almost identical results in terms of execution time)

>From the documentation:
select count(*) from mols where m@>'c12c1nncc2';
count
---
  1916
(1 row)
Time: 531.994 ms

select count(*) from mols where m@>'c1ccnc2c1nccn2';
count
---
  1020
(1 row)
Time: 419.250 ms

select count(*) from mols where m@>'c12c1ncs2' ;


select count(*) from mols where m@>'c12c1CNCCN2';
count
---
  3014
(1 row)
Time: 5717.531 ms


Just to see if possible:
select count(*) from mols where m@>'c1c1';
count
-
3466765
(1 row)
Time: 219125.331 ms

select count(*) from mols where m@>'C';
count
-
4999564
(1 row)

Time: 185236.779 ms


You can see, the last query returns the full collection in 3 minutes while 
benzene returns ¾ and takes about 4 minutes.

Thank you very much,

Alex


Best regards / Mit freundlichen Grüßen / Sincères salutations

Dr. Alexander Garvin Klenner-Bajaja
Administrator Requirements Engineering-Solution Design | Dir. 2.8.3.3
European Patent Office
Patentlaan 3-9 | 2288 EE Rijswijk | The Netherlands
Tel. +31(0)70340-1991
aklen...@epo.org
www.epo.org

Please consider the environment before printing this email.

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens

2016-11-23 Thread Alexander Klenner-Bajaja
Thank you both Greg & Markus – I`ll happily wait for it to appear in conda in 
the near future ☺

Alex

From: Markus Sitzmann [mailto:markus.sitzm...@gmail.com]
Sent: Wednesday, November 23, 2016 3:40 PM
To: Alexander Klenner-Bajaja
Cc: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] smarts vs smiles database queries and explicit 
hydrogens

If I understood Greg correctly, it will be in 2016.09 which isn't in conda just 
of yet, they are currently working on putting it there.

Markus
-
|  Markus Sitzmann
|  markus.sitzm...@gmail.com<mailto:markus.sitzm...@gmail.com>

On 23 Nov 2016, at 15:29, Alexander Klenner-Bajaja 
<aklen...@epo.org<mailto:aklen...@epo.org>> wrote:
Dear Greg,

Thank you very much, looking at the results that function was exactly what I 
was looking for – only I can’t find it in my updated anaconda installation.

“conda update rdkit” tells me I have the latest version 2016.03.4 and postgres 
tells me I have the 3.4 version of the RDKit extension

If I understand your blog post correctly it should be in 2016.03 version? What 
am I missing?


Best,

Alex



From: Greg Landrum [mailto:greg.land...@gmail.com]
Sent: Wednesday, November 23, 2016 11:42 AM
To: Alexander Klenner-Bajaja
Cc: 
rdkit-discuss@lists.sourceforge.net<mailto:rdkit-discuss@lists.sourceforge.net>
Subject: Re: [Rdkit-discuss] smarts vs smiles database queries and explicit 
hydrogens

Hi Alex,

The new version of the cartridge has some capabilities that, I think, address 
this.

There's a blog post about this: 
http://rdkit.blogspot.com/2016/07/tuning-substructure-queries-ii.html
but the short version is that you can do the kind of queries it seems like you 
want to do quite simply:

chembl_21=# select * from rdk.mols where 
m@>mol_adjust_query_properties('*c1ncccn1') limit 3;
 molregno |   m
--+---
   601707 | CCCc1nc(-c2ccc(F)cc2)oc1C(=O)NC(CC)CN1CCN(c2ncccn2)CC1
   289103 | CC1C(=N)/C(=N/Nc2ccc(S(=O)(=O)Nc3ncccn3)cc2)C(=O)C(C)C1=O
   607646 | 
CCNC(=O)[C@@H]1OC(n2cnc3c(NC(=O)Nc4ccc(S(=O)(=O)Nc5ncccn5)cc4)ncnc32)[C@@H](O)[C@H]1O
(3 rows)

chembl_21=# select * from rdk.mols where 
m@>mol_adjust_query_properties('*c1nc(*)ccn1') limit 3;
 molregno |   m
--+---
   158659 | CCNc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(C)CC2)n1
   158743 | Nc1nccc(-c2c(-c3ccc(F)cc3)ncn2C2CCN(Cc3c3)CC2)n1
   158843 | CC1(C)CC(n2cnc(-c3ccc(F)cc3)c2-c2ccnc(N)n2)CC(C)(C)N1
(3 rows)

chembl_21=# select * from rdk.mols where 
m@>mol_adjust_query_properties('*c1nc(*)cc(*)n1') limit 3;
 molregno |m
--+--
   726443 | CN=C(S)NNc1nc(C)cc(C)n1
   561136 | 
C[C@H](Nc1cc(NC2CC2)nc(C(F)(F)F)n1)[C@@H](Cc1ccc(Cl)cc1)c1(Br)c1
   205784 | CCN(CC)C(=O)CSc1nc(N)cc(Cl)n1
(3 rows)

There's more detail in the blog post, but the default behavior is to convert 
dummies into generic query atoms and to constrain the substitution at any other 
*ring* position.

Best Regards,
-greg


On Wed, Nov 23, 2016 at 9:20 AM, Alexander Klenner-Bajaja 
<aklen...@epo.org<mailto:aklen...@epo.org>> wrote:
Hi all,

I am currently exploring the possibilities of the RDKit database cartridge for 
substructure search- I installed everything following the  tutorial from 
http://www.rdkit.org/docs/Install.html

Very nice tutorial  - worked perfectly fine.

Since we are exploring solutions for browser based gui searches I created a 
test page using Ketcher (http://lifescience.opensource.epam.com/ketcher/) which 
communicates with the database through PHP.

Ketcher returns a SMILES representation from the drawn molecule. The raw data 
of the molecules in the database are canonical SMILES created from RDKIT 
canonical SMILES from the rdkit KNIME node (they are text-mined from patents).

When doing substructure searches, as long as we query for well-defined 
compounds the results make sense – however looking at R1,…-groups things get a 
little odd.

I found a very old discussion on the mailing list from 2009 where this has been 
discussed and I understood from that dialog that when looking at SMILES with a 
“*” representation this is interpreted as a dummy atom and the same dummy atom 
is expected in the search space to produce a hit. While a SMARTS representation 
of the same string actually leads to the behaviour that “any atom” is matched 
at that position.

I ended up with the very cumbersome query, I am sure there are more elegant 
ways of doing this using ::qmol notation, but as I said I am currently 
exploring ☺

That’s the query (in PHP) in question for PostgreSQL:

$search_result = pg_query($dbconn, "select m from pat.mols where 
m@>mol_from_smarts

[Rdkit-discuss] smarts vs smiles database queries and explicit hydrogens

2016-11-23 Thread Alexander Klenner-Bajaja
Hi all,

I am currently exploring the possibilities of the RDKit database cartridge for 
substructure search- I installed everything following the  tutorial from 
http://www.rdkit.org/docs/Install.html

Very nice tutorial  - worked perfectly fine.

Since we are exploring solutions for browser based gui searches I created a 
test page using Ketcher (http://lifescience.opensource.epam.com/ketcher/) which 
communicates with the database through PHP.

Ketcher returns a SMILES representation from the drawn molecule. The raw data 
of the molecules in the database are canonical SMILES created from RDKIT 
canonical SMILES from the rdkit KNIME node (they are text-mined from patents).

When doing substructure searches, as long as we query for well-defined 
compounds the results make sense - however looking at R1,...-groups things get 
a little odd.

I found a very old discussion on the mailing list from 2009 where this has been 
discussed and I understood from that dialog that when looking at SMILES with a 
"*" representation this is interpreted as a dummy atom and the same dummy atom 
is expected in the search space to produce a hit. While a SMARTS representation 
of the same string actually leads to the behaviour that "any atom" is matched 
at that position.

I ended up with the very cumbersome query, I am sure there are more elegant 
ways of doing this using ::qmol notation, but as I said I am currently 
exploring :)

That's the query (in PHP) in question for PostgreSQL:

$search_result = pg_query($dbconn, "select m from pat.mols where 
m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('".$_POST['smiles']. "'))) 
LIMIT 20;");

Extracting rdkit functionality leaves me with:

m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('".$_POST['smiles']. "')))
and adding a smiles string to make it more readable:

m@>mol_from_smarts(mol_to_smiles(mol_from_smiles(' C([*])1=CC=CC=C1')))   (This 
is how Ketcher creates the smiles string, using explicit double bonds)

This query does actually work and returns structures that are correct (visually 
inspected a few examples)

The same query without all the molecule conversion methods does not return 
anything

m@>' C([*])1=CC=CC=C1'

I guess the reason for this is that the default interpretation is smiles and it 
is looking for actual dummy atoms in the database (there are none).

That's my first question: Is this assumption correct?

My next issue is a query with explicit hydrogens:

Using

"C([*])1=C([H])C([H])=C([H])C([H])=C1[H]"

as a query with the all the molecule conversion as shown above to make SMARTS 
happen, returns among others:

"C(C)1=CC=C(C)C=C1"

Which is correct for implicit hydrogens but not for explicit - so my guess is 
they are lost.

Can I enforce at query time against the cartridge to work with explicit 
hydrogens so that only molecules are found that have different substitutes at 
the "*" position?

I could not find a pre-defined function for that.

Thank you very much for any hints or solutions,

Best regards,

Alex



Best regards / Mit freundlichen Grüßen / Sincères salutations

Dr. Alexander Garvin Klenner-Bajaja
Administrator Requirements Engineering-Solution Design | Dir. 2.8.3.3
European Patent Office
Patentlaan 3-9 | 2288 EE Rijswijk | The Netherlands
Tel. +31(0)70340-1991
aklen...@epo.org
www.epo.org

Please consider the environment before printing this email.

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss