Hi all, I am currently exploring the possibilities of the RDKit database cartridge for substructure search- I installed everything following the tutorial from http://www.rdkit.org/docs/Install.html
Very nice tutorial - worked perfectly fine. Since we are exploring solutions for browser based gui searches I created a test page using Ketcher (http://lifescience.opensource.epam.com/ketcher/) which communicates with the database through PHP. Ketcher returns a SMILES representation from the drawn molecule. The raw data of the molecules in the database are canonical SMILES created from RDKIT canonical SMILES from the rdkit KNIME node (they are text-mined from patents). When doing substructure searches, as long as we query for well-defined compounds the results make sense - however looking at R1,...-groups things get a little odd. I found a very old discussion on the mailing list from 2009 where this has been discussed and I understood from that dialog that when looking at SMILES with a "*" representation this is interpreted as a dummy atom and the same dummy atom is expected in the search space to produce a hit. While a SMARTS representation of the same string actually leads to the behaviour that "any atom" is matched at that position. I ended up with the very cumbersome query, I am sure there are more elegant ways of doing this using ::qmol notation, but as I said I am currently exploring :) That's the query (in PHP) in question for PostgreSQL: $search_result = pg_query($dbconn, "select m from pat.mols where m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('".$_POST['smiles']. "'))) LIMIT 20;"); Extracting rdkit functionality leaves me with: m@>mol_from_smarts(mol_to_smiles(mol_from_smiles('".$_POST['smiles']. "'))) and adding a smiles string to make it more readable: m@>mol_from_smarts(mol_to_smiles(mol_from_smiles(' C([*])1=CC=CC=C1'))) (This is how Ketcher creates the smiles string, using explicit double bonds) This query does actually work and returns structures that are correct (visually inspected a few examples) The same query without all the molecule conversion methods does not return anything m@>' C([*])1=CC=CC=C1' I guess the reason for this is that the default interpretation is smiles and it is looking for actual dummy atoms in the database (there are none). That's my first question: Is this assumption correct? My next issue is a query with explicit hydrogens: Using "C([*])1=C([H])C([H])=C([H])C([H])=C1[H]" as a query with the all the molecule conversion as shown above to make SMARTS happen, returns among others: "C(C)1=CC=C(C)C=C1" Which is correct for implicit hydrogens but not for explicit - so my guess is they are lost. Can I enforce at query time against the cartridge to work with explicit hydrogens so that only molecules are found that have different substitutes at the "*" position? I could not find a pre-defined function for that. Thank you very much for any hints or solutions, Best regards, Alex Best regards / Mit freundlichen Grüßen / Sincères salutations Dr. Alexander Garvin Klenner-Bajaja Administrator Requirements Engineering-Solution Design | Dir. 2.8.3.3 European Patent Office Patentlaan 3-9 | 2288 EE Rijswijk | The Netherlands Tel. +31(0)70340-1991 aklen...@epo.org<mailto:aklen...@epo.org> www.epo.org<http://www.epo.org/> Please consider the environment before printing this email.
------------------------------------------------------------------------------
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss