Re: [Rdkit-discuss] Substructure search for an aldehyde returns ketones and acids

2021-07-21 Thread Greg Landrum
Yeah, this is exactly the case where using qmol_from_ctab() should help.

Below is a short example demonstrating this by querying my local ChEMBL
instance. Notice that the first form of the query, which uses
mol_from_ctab() matches what you describe: the results include amides,
esters, etc. The second query, which uses qmol_from_ctab(), only returns
molecules which have a ketone.

I hope this helps,
-greg

chembl_28=# select * from rdk.mols where m@>mol_from_ctab('aldehyde query
  MJ192500

  4  3  0  0  0  0  0  0  0  0999 V2000
   -2.81231.55080. C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.52671.13830. C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.24121.55080. H   0  0  0  0  0  0  0  0  0  0  0  0
   -3.52670.31330. O   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0  0  0  0
  2  4  2  0  0  0  0
  2  3  1  0  0  0  0
M  END
') limit 5;
 molregno |   m
--+
   310993 | O=C(NO)c1cc(CS(=O)(=O)c2ccc(Cl)cc2)on1
   310992 | O=C(NO)c1cc(CS(=O)(=O)c2(Cl)c2)on1
   318822 | CCC(NC(=O)C[C@H](N)C(=O)N1CCC[C@H]1C#N)c1c1
   310016 | O=C(CCNC(=O)c1c1)NC1CCN(Cc2ccc(Cl)cc2)C1
   319381 | CCOC(=O)/C=C/c1ccc(CN(C(=O)C2C2)c2(/C=C/C(=O)OC)c2)cc1
(5 rows)

chembl_28=# select * from rdk.mols where m@>qmol_from_ctab('aldehyde query
  MJ192500

  4  3  0  0  0  0  0  0  0  0999 V2000
   -2.81231.55080. C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.52671.13830. C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.24121.55080. H   0  0  0  0  0  0  0  0  0  0  0  0
   -3.52670.31330. O   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0  0  0  0
  2  4  2  0  0  0  0
  2  3  1  0  0  0  0
M  END
') limit 5;
 molregno |
m

--+
   284772 | COC(=O)NC1[C@H](C)O[C@@H](O[C@H]2C/C=C(\C)[C@@H]3C=C[C@@H]4[C@
@H](O)[C@@H](C)C[C@H](C)[C@H]4[C@]3(C)/C(O)=C3\C(=O)O[C@]4(CC(C=O)=C[C@H
](OC(C)=O)[C@H]4/C=C\2C)C3=O)CC1(C)[N+](=O)[O-]
   284633 | COC(=O)NC1[C@H](C)O[C@@H](O[C@H]2C/C=C(\C)[C@@H]3C=C[C@@H]4[C@
@H](O[C@H]5O5)[C@@H](C)C[C@H](C)[C@H]4[C@]3(C)/C(O)=C3\C(=O)O[C@
]4(CC(C=O)=C[C@H](OC(C)=O)[C@H]4/C=C\2C)C3=O)CC1(C)[N+](=O)[O-]
   284865 | COC(=O)NC1[C@H](C)O[C@@H](O[C@H]2C/C=C(\C)[C@@H]3C=C[C@@H]4[C@
@H](OCc5ccc(OC)cc5)[C@@H](C)C[C@H](C)[C@H]4[C@]3(C)/C(O)=C3\C(=O)O[C@
]4(CC(C=O)=C[C@H](OC(C)=O)[C@H]4/C=C\2C)C3=O)CC1(C)[N+](=O)[O-]
   299586 | CC1(C)C2CC[C@]3(C)C(CC=C4C5CC(C)(C)[C@@H](OC(=O)c6c6)[C@H
](OC(=O)/C=C/c6c6)[C@]5(C=O)[C@H](O)C[C@]43C)[C@@]2(C)CC[C@@H]1O
   317613 | Cn1cncc1C=O
(5 rows)



On Tue, Jul 20, 2021 at 11:55 PM Webster Homer <
webster.ho...@milliporesigma.com> wrote:

> I should have included the query. It looks like RD Kit is ignoring the H
> atom
>
> The user put in an explicit H
>
> ===MOL file after this
>
> aldehyde query
>
>   MJ192500
>
>
>
>   4  3  0  0  0  0  0  0  0  0999 V2000
>
>-2.81231.55080. C   0  0  0  0  0  0  0  0  0  0  0  0
>
>-3.52671.13830. C   0  0  0  0  0  0  0  0  0  0  0  0
>
>-4.24121.55080. H   0  0  0  0  0  0  0  0  0  0  0  0
>
>-3.52670.31330. O   0  0  0  0  0  0  0  0  0  0  0  0
>
>   2  1  1  0  0  0  0
>
>   2  4  2  0  0  0  0
>
>   2  3  1  0  0  0  0
>
> M  END
>
> =MOL file above this
>
>
>
>
>
> *From:* Greg Landrum 
> *Sent:* Friday, July 16, 2021 11:38 PM
> *To:* Webster Homer 
> *Cc:* rdkit-discuss@lists.sourceforge.net
> *Subject:* Re: [Rdkit-discuss] Substructure search for an aldehyde
> returns ketones and acids
>
>
>
> *[WARNING – EXTERNAL EMAIL]* Do not open links or attachments unless you
> recognize the sender of this email. If you are unsure please click the
> button "Report suspicious email"
>
>
>
> Hi Webster,
>
>
>
> Without seeing an actual query I am inclined to believe that it’s not a
> bug. The problem is more likely a query which has not been drawn explicitly
> or an easily made mistake in the way the cartridge is being used.
>
>
>
> Assuming that the aldehyde queries have been drawn with an explicit H atom
> connected to the C (apologies for not showing this, I’m on my phone and
> don’t have a sketcher available), you should be calling the cartridge
> function qmol_from_ctab(), not mol_from_ctab(), before doing the query.
> qmol_from_ctab() will use the H to help define the query.
>
>
>
> If you’re doing this and still seeing incorrect search results, please
> share a query and the way y

Re: [Rdkit-discuss] Substructure search for an aldehyde returns ketones and acids

2021-07-20 Thread Webster Homer
I should have included the query. It looks like RD Kit is ignoring the H atom
The user put in an explicit H
===MOL file after this
aldehyde query
  MJ192500

  4  3  0  0  0  0  0  0  0  0999 V2000
   -2.81231.55080. C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.52671.13830. C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.24121.55080. H   0  0  0  0  0  0  0  0  0  0  0  0
   -3.52670.31330. O   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0  0  0  0
  2  4  2  0  0  0  0
  2  3  1  0  0  0  0
M  END
=MOL file above this


From: Greg Landrum 
Sent: Friday, July 16, 2021 11:38 PM
To: Webster Homer 
Cc: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] Substructure search for an aldehyde returns 
ketones and acids


[WARNING – EXTERNAL EMAIL] Do not open links or attachments unless you 
recognize the sender of this email. If you are unsure please click the button 
"Report suspicious email"

Hi Webster,

Without seeing an actual query I am inclined to believe that it’s not a bug. 
The problem is more likely a query which has not been drawn explicitly or an 
easily made mistake in the way the cartridge is being used.

Assuming that the aldehyde queries have been drawn with an explicit H atom 
connected to the C (apologies for not showing this, I’m on my phone and don’t 
have a sketcher available), you should be calling the cartridge function 
qmol_from_ctab(), not mol_from_ctab(), before doing the query. qmol_from_ctab() 
will use the H to help define the query.

If you’re doing this and still seeing incorrect search results, please share a 
query and the way you’re doing the search and we can try to help (or diagnose 
the bug if there is one)

Best,
-greg


On Fri, 16 Jul 2021 at 17:53, Webster Homer 
mailto:webster.ho...@milliporesigma.com>> 
wrote:
We use RDKit Postgresql cartridge as our substructure searcher. When a user 
sketches an aldehyde and submits the mol fle as the query. RD Kit returns 
aldehydes, but also returns ketones and acids. Is this a bug?



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click 
merckgroup.com/disclaimer<https://www.merckgroup.com/en/legal-disclaimer/mail-disclaimer.html>
 to access the German, French, Spanish, Portuguese, Turkish, Polish and Slovak 
versions of this disclaimer.



Please find our Privacy Statement information by clicking here 
merckgroup.com/en/privacy-statement.html<https://www.merckgroup.com/en/privacy-statement.html>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click 
merckgroup.com/disclaimer<https://www.merckgroup.com/en/legal-disclaimer/mail-disclaimer.html>
 to access the German, French, Spanish, Portuguese, Turkish, Polish and Slovak 
versions of this disclaimer.



Please find our Privacy Statement information by clicking here 
merckgroup.com/en/privacy-statement.html<https://www.merckgroup.com/en/privacy-statement.html>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
http

Re: [Rdkit-discuss] Substructure search for an aldehyde returns ketones and acids

2021-07-16 Thread Greg Landrum
Hi Webster,

Without seeing an actual query I am inclined to believe that it’s not a
bug. The problem is more likely a query which has not been drawn explicitly
or an easily made mistake in the way the cartridge is being used.

Assuming that the aldehyde queries have been drawn with an explicit H atom
connected to the C (apologies for not showing this, I’m on my phone and
don’t have a sketcher available), you should be calling the cartridge
function qmol_from_ctab(), not mol_from_ctab(), before doing the query.
qmol_from_ctab() will use the H to help define the query.

If you’re doing this and still seeing incorrect search results, please
share a query and the way you’re doing the search and we can try to help
(or diagnose the bug if there is one)

Best,
-greg


On Fri, 16 Jul 2021 at 17:53, Webster Homer <
webster.ho...@milliporesigma.com> wrote:

> We use RDKit Postgresql cartridge as our substructure searcher. When a
> user sketches an aldehyde and submits the mol fle as the query. RD Kit
> returns aldehydes, but also returns ketones and acids. Is this a bug?
>
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
>
>
> Click merckgroup.com/disclaimer
>  to
> access the German, French, Spanish, Portuguese, Turkish, Polish and Slovak
> versions of this disclaimer.
>
>
>
> Please find our Privacy Statement information by clicking here
> merckgroup.com/en/privacy-statement.html
> 
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Substructure search for an aldehyde returns ketones and acids

2021-07-16 Thread Webster Homer
We use RDKit Postgresql cartridge as our substructure searcher. When a user 
sketches an aldehyde and submits the mol fle as the query. RD Kit returns 
aldehydes, but also returns ketones and acids. Is this a bug?



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click 
merckgroup.com/disclaimer
 to access the German, French, Spanish, Portuguese, Turkish, Polish and Slovak 
versions of this disclaimer.



Please find our Privacy Statement information by clicking here 
merckgroup.com/en/privacy-statement.html
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search racemic compounds only

2021-03-17 Thread Ivan Tubert-Brohman
Hi Lauren,

SMARTS doesn't have a direct way of saying an atom is non-racemic, but you
can express that idea using recursive SMARTS. For example,

In [46]: racemic =
Chem.MolFromSmiles('c12c1cncc2NC(=O)C(CCO2)c1cc(Cl)ccc12')

In [47]: chiral1 = Chem.MolFromSmiles('c12c1cncc2NC(=O)[C@H
](CCO2)c1cc(Cl)ccc12')

In [48]: chiral2 = Chem.MolFromSmiles('c12c1cncc2NC(=O)[C@
@H](CCO2)c1cc(Cl)ccc12')

In [49]: [m.HasSubstructMatch(Chem.MolFromSmarts('c12c1cncc2NC(=O)
[CH;!$([@])](CC)c1cc(Cl)ccc1'), useChirality=True) for m in [racemic,
chiral1, chiral2]]

Out[49]: [True, False, False]

Where the highlighted atom [CH;!$([@])] means "a carbon with a hydrogen AND
not a chiral atom".

Hope this helps,
Ivan

On Wed, Mar 17, 2021 at 6:18 AM Lauren Reid 
wrote:

> Hi,
>
> I would like to perform a substructure search in which a racemic chiral
> SMARTS finds only racemic compounds and not those that have specified
> stereochemistry, e.g these compounds from the COVID moonshot project:
>
> Does anyone know if there’s a way to specify this distinction in an rdkit
> substructure search?
>
> Thanks,
>
> Lauren
>
> Dr Lauren Reid
> Computational Chemist / Developer
> lauren.r...@medchemica.com
> www.medchemica.com
>
> Medchemica Ltd is a company registered in England and Wales with company
> number 8162245.
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Substructure search racemic compounds only

2021-03-17 Thread Lauren Reid
Hi,

I would like to perform a substructure search in which a racemic chiral SMARTS 
finds only racemic compounds and not those that have specified stereochemistry, 
e.g these compounds from the COVID moonshot project:


Does anyone know if there’s a way to specify this distinction in an rdkit 
substructure search?

Thanks,

Lauren 

Dr Lauren Reid
Computational Chemist / Developer
lauren.r...@medchemica.com
www.medchemica.com

Medchemica Ltd is a company registered in England and Wales with company number 
8162245.

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search issue with aliphatic/aromatic bonds

2020-05-20 Thread theozh
Hi Paolo,

argh... I thought if you are setting  params.sanitize=False  then you don't 
want sanitization.
Apparently, you need it for the mols but skipping only aromatization.
I guess, I slowly start to understand...

What you explained here and in your github example I tried to find something 
similar in the RDKit documentation or in the web... without too much success. 
Wouldn't this be an essential step in substructure search or even a FAQ?

Well, now I took a larger list and I got as many hits as I expected. Happy End!

Thank you very much for your kind help!
Theo.

Am 20.05.2020 um 15:22 schrieb Paolo Tosco:
> Hi Theo,
>
> that's because you omitted the sanitization step completely, so the molecule 
> is missing crucial information for the SubstructureMatch to do a proper job.
>
> If you put back sanitization, only leaving out the aromatization step, things 
> work as expected.
> Also, you do not need to create pattern again from SMILES, you can make a 
> copy of the molecule that you have already created and sanitized using the 
> Chem.Mol copy constructor.
>
> from rdkit import Chem
>
> smiles_strings = '''
> N12N3C(CC4=CC=CC(NC=C2)=C14)=CC=C3
> C12=CC=CC3=C1N(N4C=CC=C4C2)C=CN3
> '''
>
> smiles_list = smiles_strings.splitlines()[1:]
> print(smiles_list)
>
> params = Chem.SmilesParserParams()
> params.sanitize=False
>
> mols = [Chem.MolFromSmiles(x,params) for x in smiles_list]
> for m in mols:
>     Chem.SanitizeMol(m, Chem.SANITIZE_ALL ^ Chem.SANITIZE_SETAROMATICITY)
>
> pattern = Chem.Mol(mols[0])
>
> query_params = Chem.AdjustQueryParameters()
> query_params.makeBondsGeneric = True
> query_params.aromatizeIfPossible = False
> query_params.adjustDegree = False
> query_params.adjustHeavyDegree = False
> pattern_generic_bonds = Chem.AdjustQueryProperties(pattern,query_params)
>
> matches = [idx for idx,m in enumerate(mols) if 
> m.HasSubstructMatch(pattern_generic_bonds)]
> print("{} of {}: {}".format(len(matches),len(smiles_list),matches))
>
> $ python3 SubstructMatch2.py
>
> ['N12N3C(CC4=CC=CC(NC=C2)=C14)=CC=C3', 'C12=CC=CC3=C1N(N4C=CC=C4C2)C=CN3']
> 2 of 2: [0, 1]
>
> Cheers,
> p.
>
> On 20/05/2020 09:50, theozh wrote:
>> from rdkit import Chem
>>
>> smiles_strings = '''
>> N12N3C(CC4=CC=CC(NC=C2)=C14)=CC=C3
>> C12=CC=CC3=C1N(N4C=CC=C4C2)C=CN3
>> '''
>>
>> smiles_list = smiles_strings.splitlines()[1:]
>> print(smiles_list)
>>
>> params = Chem.SmilesParserParams()
>> params.sanitize=False
>>
>> mols = [Chem.MolFromSmiles(x,params) for x in smiles_list]
>>
>> pattern = Chem.MolFromSmiles(smiles_list[0],params)
>>
>> query_params = Chem.AdjustQueryParameters()
>> query_params.makeBondsGeneric = True
>> query_params.aromatizeIfPossible = False
>> query_params.adjustDegree = False
>> query_params.adjustHeavyDegree = False
>> pattern_generic_bonds = Chem.AdjustQueryProperties(pattern,query_params)
>>
>> matches = [idx for idx,m in enumerate(mols) if 
>> m.HasSubstructMatch(pattern_generic_bonds)]
>> print("{} of {}: {}".format(len(matches),len(smiles_list),matches))


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search issue with aliphatic/aromatic bonds

2020-05-20 Thread Paolo Tosco

Hi Theo,

that's because you omitted the sanitization step completely, so the 
molecule is missing crucial information for the SubstructureMatch to do 
a proper job.


If you put back sanitization, only leaving out the aromatization step, 
things work as expected.
Also, you do not need to create pattern again from SMILES, you can make 
a copy of the molecule that you have already created and sanitized using 
the Chem.Mol copy constructor.


from rdkit import Chem

smiles_strings = '''
N12N3C(CC4=CC=CC(NC=C2)=C14)=CC=C3
C12=CC=CC3=C1N(N4C=CC=C4C2)C=CN3
'''

smiles_list = smiles_strings.splitlines()[1:]
print(smiles_list)

params = Chem.SmilesParserParams()
params.sanitize=False

mols = [Chem.MolFromSmiles(x,params) for x in smiles_list]
for m in mols:
    Chem.SanitizeMol(m, Chem.SANITIZE_ALL ^ Chem.SANITIZE_SETAROMATICITY)

pattern = Chem.Mol(mols[0])

query_params = Chem.AdjustQueryParameters()
query_params.makeBondsGeneric = True
query_params.aromatizeIfPossible = False
query_params.adjustDegree = False
query_params.adjustHeavyDegree = False
pattern_generic_bonds = Chem.AdjustQueryProperties(pattern,query_params)

matches = [idx for idx,m in enumerate(mols) if 
m.HasSubstructMatch(pattern_generic_bonds)]

print("{} of {}: {}".format(len(matches),len(smiles_list),matches))

$ python3 SubstructMatch2.py

['N12N3C(CC4=CC=CC(NC=C2)=C14)=CC=C3', 'C12=CC=CC3=C1N(N4C=CC=C4C2)C=CN3']
2 of 2: [0, 1]

Cheers,
p.

On 20/05/2020 09:50, theozh wrote:

from rdkit import Chem

smiles_strings = '''
N12N3C(CC4=CC=CC(NC=C2)=C14)=CC=C3
C12=CC=CC3=C1N(N4C=CC=C4C2)C=CN3
'''

smiles_list = smiles_strings.splitlines()[1:]
print(smiles_list)

params = Chem.SmilesParserParams()
params.sanitize=False

mols = [Chem.MolFromSmiles(x,params) for x in smiles_list]

pattern = Chem.MolFromSmiles(smiles_list[0],params)

query_params = Chem.AdjustQueryParameters()
query_params.makeBondsGeneric = True
query_params.aromatizeIfPossible = False
query_params.adjustDegree = False
query_params.adjustHeavyDegree = False
pattern_generic_bonds = Chem.AdjustQueryProperties(pattern,query_params)

matches = [idx for idx,m in enumerate(mols) if 
m.HasSubstructMatch(pattern_generic_bonds)]
print("{} of {}: {}".format(len(matches),len(smiles_list),matches))



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search issue with aliphatic/aromatic bonds

2020-05-20 Thread theozh
Hi Paolo,

sorry, I made a typo (makeBondGeneric instead of makeBondsGeneric) that's why 
the bonds weren't UNSPECIFIED.
The following examples seem to work fine now for these two SMILES, the first 
structure will be found in the second one.

C12=CC=CN1NCCC2
and
C12C=CC=C(C=C3)C=1N3NCC2

However, there is another example where it still doesn't work with this code. 
See my code below.
The two SMILES

N12N3C(CC4=CC=CC(NC=C2)=C14)=CC=C3
and
C12=CC=CC3=C1N(N4C=CC=C4C2)C=CN3

actually describe the identical structure, but were drawn in a different way in 
ChemDraw. As a consequence the SMILES are different which shouldn't be a 
problem. But if I put these SMILES into the code below the first one won't 
match the second one and the other way around as well.
I must be doing something horribly wrong.
Do I have to canonicalize the SMILES first?
Isn't there a good tutorial on substructure search with RDKit and all its 
options and frequently asked questions and tons of examples?

best,
Theo.


### start of code
from rdkit import Chem

smiles_strings = '''
N12N3C(CC4=CC=CC(NC=C2)=C14)=CC=C3
C12=CC=CC3=C1N(N4C=CC=C4C2)C=CN3
'''

smiles_list = smiles_strings.splitlines()[1:]
print(smiles_list)

params = Chem.SmilesParserParams()
params.sanitize=False

mols = [Chem.MolFromSmiles(x,params) for x in smiles_list]

pattern = Chem.MolFromSmiles(smiles_list[0],params)

query_params = Chem.AdjustQueryParameters()
query_params.makeBondsGeneric = True
query_params.aromatizeIfPossible = False
query_params.adjustDegree = False
query_params.adjustHeavyDegree = False
pattern_generic_bonds = Chem.AdjustQueryProperties(pattern,query_params)

matches = [idx for idx,m in enumerate(mols) if 
m.HasSubstructMatch(pattern_generic_bonds)]
print("{} of {}: {}".format(len(matches),len(smiles_list),matches))
### end of code


Am 19.05.2020 um 18:30 schrieb Paolo Tosco:
> Hi Theo,
>
> I don't think the RDKit version should make a difference; did you notice that 
> rdmolops.AdjustQueryProperties() does not modify the molecule in place, but 
> rather returns a modified copy?
>
> pattern_generic_bonds = Chem.AdjustQueryProperties(pattern, query_params)
>
> That might be the reason. Also, only pattern_generic_bonds will have 
> UNSPECIFIED bonds, the mols will still have SINGLE and DOUBLE bonds.
>
> Feel free to contact me off-list if you need help with the above.
>
> Cheers,
> p.
>
> On 19/05/2020 17:01, theozh wrote:
>> Hi Paolo,
>>
>> thank you very much for your detailed answer.
>> I tried to reproduce your last suggestion (but I don't have Jupyter 
>> Notebook).
>> However, my bonds are still SINGLE and DOUBLE instead of UNSPECIFIED.
>> Does this maybe depend on the RDKit Version, I have 2019.03... ?
>>
>> Maybe, I should update and need to investigate further.
>> Theo.
>>
>>
>> Am 19.05.2020 um 16:44 schrieb Paolo Tosco:
>>> Hi Theo,
>>>
>>> the lack of match is due to different aromaticity flags on atoms and bonds 
>>> in the larger molecule.
>>>
>>> This gist provides some explanation and a possible solution:
>>>
>>> https://gist.github.com/ptosco/e410e45278b94e8f047ff224193d7788
>>>
>>> Cheers,
>>> p.
>>>
>>> On 19/05/2020 14:13, theozh wrote:
 Dear RDKit-users,

 I would like to do a very simple substructure search.
 The chapter 3.5 "Substructure Searching" in RDKit Documentation 
 (2019.09.1) is pretty short and doesn't point to a solution. So far, I've 
 learned that you can create your search pattern via Chem.MolFromSmiles() 
 or Chem.MolFromSmarts().

 In the below copy minimal example, I want to use the first SMILES in 
 the list as search pattern. I expect 2 matches but I get either 1 or 0 
 matches. So, I'm doing something wrong. What am I missing?
 Is it about implicit/explicit aromatic and aliphatic bonds or some 
 explicit/implicit hydrogen?
 How to find the first structure in both SMILES?

 thank you for any hints,
 Theo.

 ### simple substructure search (but doesn't find what is expected)
 from rdkit import Chem

 smiles_strings = '''
 C12=CC=CN1NCCC2
 C12=CC=CC(C=C3)=C1N3NCC2
 '''
 smiles_list = smiles_strings.splitlines()[1:]
 print(smiles_list)

 pattern = Chem.MolFromSmiles(smiles_list[0])  # MolFromSmiles
 matches = [x for x in smiles_list if 
 Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
 print(len(matches))   # result: 1, why not 2?

 pattern = Chem.MolFromSmarts(smiles_list[0])  # MolFromSmarts
 matches = [x for x in smiles_list if 
 Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
 print(len(matches))   # result: 0, why not 2?
 ### end of code


 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


___
Rdkit-discuss mailing list

Re: [Rdkit-discuss] Substructure search issue with aliphatic/aromatic bonds

2020-05-19 Thread Paolo Tosco

Hi Theo,

I don't think the RDKit version should make a difference; did you notice 
that rdmolops.AdjustQueryProperties() does not modify the molecule in 
place, but rather returns a modified copy?


pattern_generic_bonds  =  Chem.AdjustQueryProperties(pattern,  query_params)

That might be the reason. Also, only pattern_generic_bonds will have 
UNSPECIFIED bonds, the mols will still have SINGLE and DOUBLE bonds.


Feel free to contact me off-list if you need help with the above.

Cheers,
p.

On 19/05/2020 17:01, theozh wrote:

Hi Paolo,

thank you very much for your detailed answer.
I tried to reproduce your last suggestion (but I don't have Jupyter Notebook).
However, my bonds are still SINGLE and DOUBLE instead of UNSPECIFIED.
Does this maybe depend on the RDKit Version, I have 2019.03... ?

Maybe, I should update and need to investigate further.
Theo.


Am 19.05.2020 um 16:44 schrieb Paolo Tosco:

Hi Theo,

the lack of match is due to different aromaticity flags on atoms and bonds in 
the larger molecule.

This gist provides some explanation and a possible solution:

https://gist.github.com/ptosco/e410e45278b94e8f047ff224193d7788

Cheers,
p.

On 19/05/2020 14:13, theozh wrote:

Dear RDKit-users,

I would like to do a very simple substructure search.
The chapter 3.5 "Substructure Searching" in RDKit Documentation (2019.09.1) is 
pretty short and doesn't point to a solution. So far, I've learned that you can create 
your search pattern via Chem.MolFromSmiles() or Chem.MolFromSmarts().

In the below copy minimal example, I want to use the first SMILES in the 
list as search pattern. I expect 2 matches but I get either 1 or 0 matches. So, I'm 
doing something wrong. What am I missing?
Is it about implicit/explicit aromatic and aliphatic bonds or some 
explicit/implicit hydrogen?
How to find the first structure in both SMILES?

thank you for any hints,
Theo.

### simple substructure search (but doesn't find what is expected)
from rdkit import Chem

smiles_strings = '''
C12=CC=CN1NCCC2
C12=CC=CC(C=C3)=C1N3NCC2
'''
smiles_list = smiles_strings.splitlines()[1:]
print(smiles_list)

pattern = Chem.MolFromSmiles(smiles_list[0])  # MolFromSmiles
matches = [x for x in smiles_list if 
Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
print(len(matches))   # result: 1, why not 2?

pattern = Chem.MolFromSmarts(smiles_list[0])  # MolFromSmarts
matches = [x for x in smiles_list if 
Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
print(len(matches))   # result: 0, why not 2?
### end of code


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search issue with aliphatic/aromatic bonds

2020-05-19 Thread theozh
Hi Paolo,

thank you very much for your detailed answer.
I tried to reproduce your last suggestion (but I don't have Jupyter Notebook).
However, my bonds are still SINGLE and DOUBLE instead of UNSPECIFIED.
Does this maybe depend on the RDKit Version, I have 2019.03... ?

Maybe, I should update and need to investigate further.
Theo.


Am 19.05.2020 um 16:44 schrieb Paolo Tosco:
> Hi Theo,
>
> the lack of match is due to different aromaticity flags on atoms and bonds in 
> the larger molecule.
>
> This gist provides some explanation and a possible solution:
>
> https://gist.github.com/ptosco/e410e45278b94e8f047ff224193d7788
>
> Cheers,
> p.
>
> On 19/05/2020 14:13, theozh wrote:
>> Dear RDKit-users,
>>
>> I would like to do a very simple substructure search.
>> The chapter 3.5 "Substructure Searching" in RDKit Documentation (2019.09.1) 
>> is pretty short and doesn't point to a solution. So far, I've learned that 
>> you can create your search pattern via Chem.MolFromSmiles() or 
>> Chem.MolFromSmarts().
>>
>> In the below copy minimal example, I want to use the first SMILES in 
>> the list as search pattern. I expect 2 matches but I get either 1 or 0 
>> matches. So, I'm doing something wrong. What am I missing?
>> Is it about implicit/explicit aromatic and aliphatic bonds or some 
>> explicit/implicit hydrogen?
>> How to find the first structure in both SMILES?
>>
>> thank you for any hints,
>> Theo.
>>
>> ### simple substructure search (but doesn't find what is expected)
>> from rdkit import Chem
>>
>> smiles_strings = '''
>> C12=CC=CN1NCCC2
>> C12=CC=CC(C=C3)=C1N3NCC2
>> '''
>> smiles_list = smiles_strings.splitlines()[1:]
>> print(smiles_list)
>>
>> pattern = Chem.MolFromSmiles(smiles_list[0])  # MolFromSmiles
>> matches = [x for x in smiles_list if 
>> Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
>> print(len(matches))   # result: 1, why not 2?
>>
>> pattern = Chem.MolFromSmarts(smiles_list[0])  # MolFromSmarts
>> matches = [x for x in smiles_list if 
>> Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
>> print(len(matches))   # result: 0, why not 2?
>> ### end of code
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search issue with aliphatic/aromatic bonds

2020-05-19 Thread Paolo Tosco

Hi Theo,

the lack of match is due to different aromaticity flags on atoms and 
bonds in the larger molecule.


This gist provides some explanation and a possible solution:

https://gist.github.com/ptosco/e410e45278b94e8f047ff224193d7788

Cheers,
p.

On 19/05/2020 14:13, theozh wrote:

Dear RDKit-users,

I would like to do a very simple substructure search.
The chapter 3.5 "Substructure Searching" in RDKit Documentation (2019.09.1) is 
pretty short and doesn't point to a solution. So far, I've learned that you can create 
your search pattern via Chem.MolFromSmiles() or Chem.MolFromSmarts().

In the below copy minimal example, I want to use the first SMILES in the 
list as search pattern. I expect 2 matches but I get either 1 or 0 matches. So, I'm 
doing something wrong. What am I missing?
Is it about implicit/explicit aromatic and aliphatic bonds or some 
explicit/implicit hydrogen?
How to find the first structure in both SMILES?

thank you for any hints,
Theo.

### simple substructure search (but doesn't find what is expected)
from rdkit import Chem

smiles_strings = '''
C12=CC=CN1NCCC2
C12=CC=CC(C=C3)=C1N3NCC2
'''
smiles_list = smiles_strings.splitlines()[1:]
print(smiles_list)

pattern = Chem.MolFromSmiles(smiles_list[0])  # MolFromSmiles
matches = [x for x in smiles_list if 
Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
print(len(matches))   # result: 1, why not 2?

pattern = Chem.MolFromSmarts(smiles_list[0])  # MolFromSmarts
matches = [x for x in smiles_list if 
Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
print(len(matches))   # result: 0, why not 2?
### end of code


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Substructure search issue with aliphatic/aromatic bonds

2020-05-19 Thread theozh
Dear RDKit-users,

I would like to do a very simple substructure search.
The chapter 3.5 "Substructure Searching" in RDKit Documentation (2019.09.1) is 
pretty short and doesn't point to a solution. So far, I've learned that you can 
create your search pattern via Chem.MolFromSmiles() or Chem.MolFromSmarts().

In the below copy minimal example, I want to use the first SMILES in the 
list as search pattern. I expect 2 matches but I get either 1 or 0 matches. So, 
I'm doing something wrong. What am I missing?
Is it about implicit/explicit aromatic and aliphatic bonds or some 
explicit/implicit hydrogen?
How to find the first structure in both SMILES?

thank you for any hints,
Theo.

### simple substructure search (but doesn't find what is expected)
from rdkit import Chem

smiles_strings = '''
C12=CC=CN1NCCC2
C12=CC=CC(C=C3)=C1N3NCC2
'''
smiles_list = smiles_strings.splitlines()[1:]
print(smiles_list)

pattern = Chem.MolFromSmiles(smiles_list[0])  # MolFromSmiles
matches = [x for x in smiles_list if 
Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
print(len(matches))   # result: 1, why not 2?

pattern = Chem.MolFromSmarts(smiles_list[0])  # MolFromSmarts
matches = [x for x in smiles_list if 
Chem.MolFromSmiles(x).HasSubstructMatch(pattern)]
print(len(matches))   # result: 0, why not 2?
### end of code


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Substructure search with RDKit

2020-03-01 Thread theozh
Dear RDKit-experts,

I'm using RDKit to search substructures in molecular structures.
I used Chem.MolFromSmiles() for my substructure search and was wondering why 
the substructure was not found in some structures.
On Chemistry.StackExchange I got a helpful hint. And now, I guess, I better 
understand the difference between SMILES and SMARTS.

The following example:
I guess I cannot attach images here. So, for a visualization please check 
(https://chemistry.stackexchange.com/q/128440/81125)
The first SMILES is searched in the other structures. You will find molecules 2 
and 4, but not 3 and 5.

Code:

### substructure search with RDKit
from rdkit import Chem

smiles_list = ['C12=CC=CC=C1C3=CC=CC4=C3C2=CC=C4', 
'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=CC=C6', 
'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C=C4', 
'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=C7C7=C6', 
'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=C7CC8=CC=CC=C8CC7=C6']

pattern = Chem.MolFromSmiles(smiles_list[0])
for idx,smiles in enumerate(smiles_list):
m = Chem.MolFromSmiles(smiles)
print("Structure {}: pattern found 
{}".format(idx+1,m.HasSubstructMatch(pattern)))
### end of code

Result:
Structure 1: pattern found True
Structure 2: pattern found True
Structure 3: pattern found False
Structure 4: pattern found True
Structure 5: pattern found False


The solution I have come up so far is the following: (see also 
https://chemistry.stackexchange.com/a/128453/81125)

Basically, you convert the the search-SMILES 'C12=CC=CC=C1C3=CC=CC4=C3C2=CC=C4' 
to mol and this mol via Chem.MolToSmiles() back to SMILES again, you get 
c1ccc2c(c1)-c13-2c13. If you create your search pattern via 
Chem.MolFromSmarts(), you will still not find structures 3 and 5. Probably 
because of the defined single bonds. However, if you replace - by ~ in, you get 
smiles_1b: c1ccc2c(c1)~c13~2c13. With this, you will find also 
structures 3 and 5.

Code: (I also added Benzene as structure 6 to have a non-match)

### substructure search with RDKit
from rdkit import Chem

smiles_list = ['C12=CC=CC=C1C3=CC=CC4=C3C2=CC=C4', 
'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=CC=C6', 
'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C=C4', 
'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=C7C7=C6', 
'C12=CC=CC=C1C3=CC=C4C5=C(C=CC2=C35)C6=C4C=C7CC8=CC=CC=C8CC7=C6','c1c1']

def search_structure(pattern):
for idx,smiles in enumerate(smiles_list):
m = Chem.MolFromSmiles(smiles)
print("Structure {}: pattern found 
{}".format(idx+1,m.HasSubstructMatch(pattern)))

smiles_1a  = smiles_list[0]
pattern_1a = Chem.MolFromSmiles(smiles_1a)
smiles_1b  = Chem.MolToSmiles(pattern_1a).replace('-','~')   # replace bonds
pattern_1b = Chem.MolFromSmarts(smiles_1b)

print("\nSMILES 1a: {}".format(smiles_1a))
search_structure(pattern_1a)
print("\nSMILES 1b: {}".format(smiles_1b))
search_structure(pattern_1b)
### end of code

Result:

SMILES 1a: C12=CC=CC=C1C3=CC=CC4=C3C2=CC=C4
Structure 1: pattern found True
Structure 2: pattern found True
Structure 3: pattern found False
Structure 4: pattern found True
Structure 5: pattern found False
Structure 6: pattern found False

SMILES 1b: c1ccc2c(c1)~c13~2c13
Structure 1: pattern found True
Structure 2: pattern found True
Structure 3: pattern found True
Structure 4: pattern found True
Structure 5: pattern found True
Structure 6: pattern found False


My question is now: is this the way to go or could this lead to other surprises 
or unexpected results?
Please excuse that I'm asking the same question in the WWW twice, but I guess 
this is the primary place to ask.

Best regards,
Theo.


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search using RDKit PostgreSQL cartridge

2018-05-31 Thread Greg Landrum
That explains the problem. Glad everything is working.

On Thu, May 31, 2018 at 4:29 PM Alfredo Quevedo 
wrote:

> Hi Greg,
>
> thank you for your feedback.
>
> the tests you mentioned worked ok for me and both molecules are matched
> using the specified smiles. I found that the matching problem was really
> silly: I was expecting to match both molecules in the CHEMBL database I
> downloaded (i.e. CHEMBL1517804 and CHEMBL2442053) which are accesible
> though a search using the web interface of CHEMBL. However, for some reason
> compound CHEMBL2442053 is not present in the downloadable database (and
> obviously not being matched)
>
> best regards
>
> Alfredo
>
> El 31/05/2018 a las 1:09, Greg Landrum escribió:
>
> Hi Alfredo,
>
> I can't think of any reason this would be true based on the molecules you
> provide.
> Certainly each of the molecules has a substructure match:
> chembl_23=# select
> 'CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1'::mol@
> >'c1c[nH]nn1'::mol;
>  ?column?
> --
>  t
> (1 row)
>
> chembl_23=# select
> 'COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1'::mol@
> >'c1c[nH]nn1'::mol;
>  ?column?
> --
>  t
> (1 row)
>
> And if I put them in a small table, add an index, and search, I also get
> the expected results:
> chembl_23=# create temporary table twomols (smiles text,m  mol);
> CREATE TABLE
> chembl_23=# insert into twomols values
> ('CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1',
> 'CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1'::mol);
> INSERT 0 1
> chembl_23=# insert into twomols values
> ('COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1',
> 'COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1'::mol);
> INSERT 0 1
> chembl_23=# select smiles from twomols where m@>'c1c[nH]nn1'::mol;
> smiles
> ---
>  CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1
>  COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1
> (2 rows)
>
> chembl_23=# create index tidx on twomols using gist(m);
> CREATE INDEX
> chembl_23=# select smiles from twomols where m@>'c1c[nH]nn1'::mol;
> smiles
> ---
>  CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1
>  COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1
> (2 rows)
>
> Can you please check to see if this simple test works for you?
> To do more detailed troubleshooting I will need to know which version of
> the cartridge you are using and one which operating system.
>
> Best,
> -greg
>
>
>
> On Tue, May 29, 2018 at 8:00 PM Alfredo Quevedo 
> wrote:
>
>> Dear user,
>>
>> I am trying to perform a substructure search using smiles notation under
>> the ChEMBL database I have already loaded into my postgreSQL database. I
>> am here providing two sample molecules in smiles format as read by the
>> RDKit cartrigde into the database:
>>
>> Molecule 1: CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1
>>
>> Molecule 2: COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1
>>
>>
>> Both molecules contains a triazole scaffold, and I am trying to select
>> both compounds among a whole database using the following smiles
>> genereated by RDKit for a triazole: ´c1c[nH]nn1´
>>
>> My problem is that the search is only able to match molecule 1 but not
>> molecule 2. Which may be the problem? Since I am serching in a database
>> of compounds previously processed with the RDKit cartrigde, shouldnt the
>> subtructure match?
>>
>> thanks in advance for the help
>>
>> regards
>>
>> Alfredo
>>
>>
>>
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search using RDKit PostgreSQL cartridge

2018-05-31 Thread Alfredo Quevedo

Hi Greg,

thank you for your feedback.

the tests you mentioned worked ok for me and both molecules are matched 
using the specified smiles. I found that the matching problem was really 
silly: I was expecting to match both molecules in the CHEMBL database I 
downloaded (i.e. CHEMBL1517804 and CHEMBL2442053) which are accesible 
though a search using the web interface of CHEMBL. However, for some 
reason compound CHEMBL2442053 is not present in the downloadable 
database (and obviously not being matched)


best regards

Alfredo


El 31/05/2018 a las 1:09, Greg Landrum escribió:

Hi Alfredo,

I can't think of any reason this would be true based on the molecules 
you provide.

Certainly each of the molecules has a substructure match:
chembl_23=# select 
'CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1'::mol@>'c1c[nH]nn1'::mol;

 ?column?
--
 t
(1 row)

chembl_23=# select 
'COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1'::mol@>'c1c[nH]nn1'::mol;

 ?column?
--
 t
(1 row)

And if I put them in a small table, add an index, and search, I also 
get the expected results:

chembl_23=# create temporary table twomols (smiles text,m  mol);
CREATE TABLE
chembl_23=# insert into twomols values 
('CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1', 
'CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1'::mol);

INSERT 0 1
chembl_23=# insert into twomols values 
('COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1', 
'COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1'::mol);

INSERT 0 1
chembl_23=# select smiles from twomols where m@>'c1c[nH]nn1'::mol;
                        smiles
---
 CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1
 COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1
(2 rows)

chembl_23=# create index tidx on twomols using gist(m);
CREATE INDEX
chembl_23=# select smiles from twomols where m@>'c1c[nH]nn1'::mol;
                        smiles
---
 CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1
 COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1
(2 rows)

Can you please check to see if this simple test works for you?
To do more detailed troubleshooting I will need to know which version 
of the cartridge you are using and one which operating system.


Best,
-greg



On Tue, May 29, 2018 at 8:00 PM Alfredo Quevedo 
mailto:maquevedo@gmail.com>> wrote:


Dear user,

I am trying to perform a substructure search using smiles notation
under
the ChEMBL database I have already loaded into my postgreSQL
database. I
am here providing two sample molecules in smiles format as read by
the
RDKit cartrigde into the database:

Molecule 1: CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1

Molecule 2: COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1


Both molecules contains a triazole scaffold, and I am trying to
select
both compounds among a whole database using the following smiles
genereated by RDKit for a triazole: ´c1c[nH]nn1´

My problem is that the search is only able to match molecule 1 but
not
molecule 2. Which may be the problem? Since I am serching in a
database
of compounds previously processed with the RDKit cartrigde,
shouldnt the
subtructure match?

thanks in advance for the help

regards

Alfredo



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search using RDKit PostgreSQL cartridge

2018-05-30 Thread Greg Landrum
Hi Alfredo,

I can't think of any reason this would be true based on the molecules you
provide.
Certainly each of the molecules has a substructure match:
chembl_23=# select 'CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1'::mol@
>'c1c[nH]nn1'::mol;
 ?column?
--
 t
(1 row)

chembl_23=# select
'COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1'::mol@
>'c1c[nH]nn1'::mol;
 ?column?
--
 t
(1 row)

And if I put them in a small table, add an index, and search, I also get
the expected results:
chembl_23=# create temporary table twomols (smiles text,m  mol);
CREATE TABLE
chembl_23=# insert into twomols values
('CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1',
'CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1'::mol);
INSERT 0 1
chembl_23=# insert into twomols values
('COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1',
'COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1'::mol);
INSERT 0 1
chembl_23=# select smiles from twomols where m@>'c1c[nH]nn1'::mol;
smiles
---
 CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1
 COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1
(2 rows)

chembl_23=# create index tidx on twomols using gist(m);
CREATE INDEX
chembl_23=# select smiles from twomols where m@>'c1c[nH]nn1'::mol;
smiles
---
 CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1
 COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1
(2 rows)

Can you please check to see if this simple test works for you?
To do more detailed troubleshooting I will need to know which version of
the cartridge you are using and one which operating system.

Best,
-greg



On Tue, May 29, 2018 at 8:00 PM Alfredo Quevedo 
wrote:

> Dear user,
>
> I am trying to perform a substructure search using smiles notation under
> the ChEMBL database I have already loaded into my postgreSQL database. I
> am here providing two sample molecules in smiles format as read by the
> RDKit cartrigde into the database:
>
> Molecule 1: CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1
>
> Molecule 2: COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1
>
>
> Both molecules contains a triazole scaffold, and I am trying to select
> both compounds among a whole database using the following smiles
> genereated by RDKit for a triazole: ´c1c[nH]nn1´
>
> My problem is that the search is only able to match molecule 1 but not
> molecule 2. Which may be the problem? Since I am serching in a database
> of compounds previously processed with the RDKit cartrigde, shouldnt the
> subtructure match?
>
> thanks in advance for the help
>
> regards
>
> Alfredo
>
>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Substructure search using RDKit PostgreSQL cartridge

2018-05-29 Thread Alfredo Quevedo

Dear user,

I am trying to perform a substructure search using smiles notation under 
the ChEMBL database I have already loaded into my postgreSQL database. I 
am here providing two sample molecules in smiles format as read by the 
RDKit cartrigde into the database:


Molecule 1: CCc1ccc(-n2nc3ccc(NC(=O)c4ccc5c(c4)OCO5)cc3n2)cc1

Molecule 2: COc1ncc(-c2ccc(N(Cc3ccsc3)C(=O)Cn3nnc4c43)cc2)cn1


Both molecules contains a triazole scaffold, and I am trying to select 
both compounds among a whole database using the following smiles 
genereated by RDKit for a triazole: ´c1c[nH]nn1´


My problem is that the search is only able to match molecule 1 but not 
molecule 2. Which may be the problem? Since I am serching in a database 
of compounds previously processed with the RDKit cartrigde, shouldnt the 
subtructure match?


thanks in advance for the help

regards

Alfredo


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search in database

2017-08-18 Thread 山崎広之
Dear Greg,

Thank you for the rapid response.

It seems to be very difficult for me, so I will consider how to do it.

Thank you very much.

Have a nice vacation.

Hiroyuki Yamasaki

2017年8月18日(金) 16:58 Greg Landrum :

> Hi Hiroyuki,
>
> On Fri, Aug 18, 2017 at 12:19 AM, 山崎広之  wrote:
>
>>
>> I use PostgreSQL with RDKit database cartridge.
>>
>> And, I want to know if I can use my structure fingerprint instead of
>> default fingerprint (Pattern fingerprints?) for substructure search.
>>
>
> That should be possible, but it does require making changes to the source
> code of the rdkit cartridge itself, so it's not something to try unless you
> are comfortable working in C++.
>
> The function that you need to modify is here:
>
> https://github.com/rdkit/rdkit/blob/master/Code/PgSQL/rdkit/adapter.cpp#L386
> specifically around this line:
>
> https://github.com/rdkit/rdkit/blob/master/Code/PgSQL/rdkit/adapter.cpp#L392
>
> You would also need to let the build system know where to find the library
> that has your new fingerprinting function (by editing the cartridge's
> CMakeLists.txt file).
>
> -greg
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search in database

2017-08-18 Thread Greg Landrum
Hi Hiroyuki,

On Fri, Aug 18, 2017 at 12:19 AM, 山崎広之  wrote:

>
> I use PostgreSQL with RDKit database cartridge.
>
> And, I want to know if I can use my structure fingerprint instead of
> default fingerprint (Pattern fingerprints?) for substructure search.
>

That should be possible, but it does require making changes to the source
code of the rdkit cartridge itself, so it's not something to try unless you
are comfortable working in C++.

The function that you need to modify is here:
https://github.com/rdkit/rdkit/blob/master/Code/PgSQL/rdkit/adapter.cpp#L386
specifically around this line:
https://github.com/rdkit/rdkit/blob/master/Code/PgSQL/rdkit/adapter.cpp#L392

You would also need to let the build system know where to find the library
that has your new fingerprinting function (by editing the cartridge's
CMakeLists.txt file).

-greg
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Substructure search in database

2017-08-17 Thread 山崎広之
Dear all,

I use PostgreSQL with RDKit database cartridge.

And, I want to know if I can use my structure fingerprint instead of
default fingerprint (Pattern fingerprints?) for substructure search.

I appreciate for any advices.

Thanks.

Hiroyuki Yamasaki
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] substructure search with brics framents on cdk2 cmpds fails partially

2016-08-04 Thread Markus Metz
Hello all:

I am trying to use the brics algorithm to fragment my compounds, filter the
fragments and try to group the original compounds by selected fragments.

As test I used the cdk2 data set provided by rdkit.

Here is a sample code partly cannibalizing Greg's and others' example code:


This part creates and displays the fragments:
---
from rdkit.Chem import BRICS

df = PandasTools.LoadSDF('cdk2.sdf')
df.describe()

allfrags=set()

for i,rows in df.iterrows():
mol = rows['ROMol']
pieces = BRICS.BRICSDecompose(mol)
allfrags.update(pieces)

from rdkit.Chem import Descriptors
from rdkit.Chem import rdMolDescriptors

fragList = list(allfrags)
df1 = pd.Series(fragList)
df2 = df1.to_frame()
df2.columns = ['smiles']
PandasTools.AddMoleculeColumnToFrame(df2,smilesCol='smiles', molCol='ROMol')

df2['NumRings'] = df2['ROMol'].map(rdMolDescriptors.CalcNumRings)
df2['RingAroms'] = df2['ROMol'].apply(lambda x:
Descriptors.NumAromaticRings(x))
df2['HeavyAtoms'] = df2['ROMol'].apply(lambda x:
Descriptors.HeavyAtomCount(x))

df3 = df2[df2['HeavyAtoms']>6]
df4 = df3[df3['RingAroms'] > 0]
df5 = df4[df4['NumRings'] > 1]

PandasTools.FrameToGridImage(df5, column='ROMol')



This part removes the dummy atoms from smiles and tries to regenerate mol
objects:
---
import re
resultsList = pd.DataFrame()

with open('my_csv.csv', 'a') as f:

for smi in df5['ROMol']:
smi = Chem.MolToSmiles(smi)
smi = re.sub(r"(\(\[\*\]\))", "", smi)
smi = re.sub(r"(\[\*\])", "", smi)

pattern = Chem.MolFromSmiles(smi)
---

This throws me here an error saying:
RDKIT Error: Can't kekulize mol

Do you know what is going on?

Many thanks in advance,
Markus
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search

2016-04-25 Thread groberts
Hi Greg,

Thank you very much for your quick reply and taking the time to look 
into this.

As a crude work around, if I split the dot-disconnected string into 
individual and unique components then include in the where clause, the 
query returns the result rapidly:

select * from rdk.mols where m@>'O' and m@>'OS(O)(=O)=O' and 
m@>'O.O.O.O.O.O.O.O.O.OS(O)(=O)=O' limit 10;

I suppose this won't help in every case, but it helps.

Best regards,
Greg



On 2016-04-24 04:47, Greg Landrum wrote:
> On Sun, Apr 24, 2016 at 11:28 AM, Greg Landrum
>  wrote:
> 
>> Here's my guess: The highly redundant query is getting hung up on
>> one large molecule where there are a large number of possible
>> matches. The substructure engine is taking a long time to determine
>> whether or not that particular molecule has a match. PostgreSQL can
>> only interrupt the query when that call returns (the substructure
>> engine itself has no built-in timeout). This one is easy, though
>> time consuming, to track down. I'll see if I can do so.
> 
>  And there it is. Ironically it is the first molecule in my chembl_20
> structure table:
> 
> chembl_20=# select * from rdk.mols limit 1;
>  molregno | m
> 
> --+---
> 23681 |
> O[C@H]1[C@H](O)[C@@H](O)[C@@H](O)[C@H](O[C@H]2[C@@H](O)[C@H](O)[C@@H](O)[C@@H](O)[C@@H]2O)[C@H]1O
> (1 row)
> 
> chembl_20=# select
> 'O[C@H]1[C@H](O)[C@@H](O)[C@@H](O)[C@H](O[C@H]2[C@@H](O)[C@H](O)[C@@H](O)[C@@H](O)[C@@H]2O)[C@H]1O'::mol@>'O.O.O.O.O.O.O.O.O.OS(O)(=O)=O';
> ERROR:  canceling statement due to statement timeout
> Time: 35996.985 ms
> 
> Here's the same thing from Python:
> 
> In [3]: m =
> Chem.MolFromSmiles('O[C@H]1[C@H](O)[C@@H](O)[C@@H](O)[C@H](O[C@H]2[C@@H](O)[C@H](O)[C@@H](O)[C@@H](O)[C@@H]2O)[C@H]1O')
> 
> In [4]: p = Chem.MolFromSmiles('O.O.O.O.O.O.O.O.O.OS(O)(=O)=O')
> 
> In [5]:
> t1=time.time();m.HasSubstructMatch(p);t2=time.time();print(t2-t1)
> 36.09873843193054
> 
> Here's the github issue: https://github.com/rdkit/rdkit/issues/880 [1]
> 
> So now my task is to figure out why this substructure query is taking
> so long (there's clearly something pathological going on here since
> that molecule doesn't have a single S in it) and to explore adding a
> timeout to the substructure searching code.
> 
> Thanks for reporting this!
> -greg
> 
> 
> 
> Links:
> --
> [1] https://github.com/rdkit/rdkit/issues/880


--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Substructure search

2016-04-23 Thread groberts
Hello,

Very nice work on this project!

Sorry if this is a known issue.  I looked through the mailing lists and 
didn't see the same problem listed.

When I perform a substructure search using the postgres cartridge, >99% 
of the time it works perfectly and is incredibly fast.  Sometimes I 
encounter situations where the system never returns a result, even after 
many hours on a small dataset.  A good example is this:

select count(substance_id) from substance where 
rdkmol@>'Br'

(rdkmol is type mol with the index in place)

The only way to stop is by restarting postgres.

Interestingly though, the following returns the count rather quickly:

select count(substance_id) from substance where 
rdkmol@>'CCBr'

I've encountered other examples where repeated atoms or components, such 
as the O's in the example below cause the same problem:

select count(substance_id) from substance where 
rdkmol@>'O.O.O.O.O.O.O.O.O.O.OS(O)(=O)=O'

I'd like to be able to run this on an internal webserver.  When the 
query hangs, the cpu is at ~100%.  Unfortunately, setting the postgres 
statement_timeout parameter does not help in this case.

Any suggestions on how to improve the query or how to kill it after a 
certain amount of time without restarting postgres?

Thanks a lot,

Greg







--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Substructure search test cases

2014-01-15 Thread Gianluca Sforna
Before I go digging the repository, can anyone tell me if the test
suite includes stuff for the postgres cartridge?
I am particularly interested in comparing results and performance with
a custom solution I have here.

-- 
Gianluca Sforna

http://morefedora.blogspot.com
http://identi.ca/giallu - http://twitter.com/giallu

--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search test cases

2014-01-15 Thread Greg Landrum
Gianluca,

On Wed, Jan 15, 2014 at 4:57 AM, Gianluca Sforna gia...@gmail.com wrote:

 Before I go digging the repository, can anyone tell me if the test
 suite includes stuff for the postgres cartridge?
 I am particularly interested in comparing results and performance with
 a custom solution I have here.


There is a test suite for the cartridge that includes looking at
substructure search. It's a very limited set of query cases that is
primarily focussed on ensuring that searches using the fingerprint index
return the same results as searches without the fingerprint index. It's
probably too limited to use for much else.

I have some test sets that I use for testing substructure search accuracy
and performance. These two blog posts discuss the sets and some work with
them:
http://rdkit.blogspot.ch/2013/11/fingerprint-based-substructure.html
http://rdkit.blogspot.ch/2013/11/substructure-fingerprints-and-cartridge.html

The datasets for those posts are, like all the data sets from the blog, are
on github:
https://github.com/greglandrum/rdkit_blog/tree/master/data

If you're interested in SMARTS-based test cases, then the best query data
set I know of is the one that Andrew Dalke put together:
https://bitbucket.org/dalke/sqc
Andrew's set includes queries from a few different sources.

-greg
--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] substructure search with fingerprints in C++

2013-06-10 Thread Gonzalo Colmenarejo-Sanchez
Thanks a lot, Greg, this is extremely helpful.

From: Greg Landrum [mailto:greg.land...@gmail.com]
Sent: 10 June 2013 05:39
To: Gonzalo Colmenarejo-Sanchez
Cc: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] substructure search with fingerprints in C++



On Sun, Jun 9, 2013 at 7:33 PM, Gonzalo Colmenarejo-Sanchez 
gonzalo.2.colmenar...@gsk.commailto:gonzalo.2.colmenar...@gsk.com wrote:
I see. Are these what you call layered fingerprints? How do they differ from 
the Daylight-like fingerprints?

No, the pattern fingerprints use a different approach that I haven't yet done a 
reasonably description of. That's on my ToDo list.

Looking forward for the C++ sample code.

It's attached. The layout of the files isn't really great since I struggled 
with the file i/o stuff, but this should at least demonstrate the idea. 
Hopefully you're better at C++ file i/o than I am and can make something more 
useful out of this.

-greg
--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] substructure search with fingerprints in C++

2013-06-09 Thread Gonzalo Colmenarejo-Sanchez
Yes, C++ code examples for preprocessed molecules and fingerprints would be 
extremely helpful too.

By the way, if the query is a SMARTS like e.g. c1aaccc1 (representing several 
substructures), what fingerprint is exposed to AllProbeBitsMatch, the union of 
all the possible fingerprints, all the possible fingerprints sequentially, etc?

Thanks a lot for your help,

Gonzalo
From: Greg Landrum [mailto:greg.land...@gmail.com]
Sent: 09 June 2013 06:33
To: Gonzalo Colmenarejo-Sanchez
Cc: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] substructure search with fingerprints in C++

Hi Gonzalo,

On Sat, Jun 8, 2013 at 10:31 AM, Gonzalo Colmenarejo-Sanchez 
gonzalo.2.colmenar...@gsk.commailto:gonzalo.2.colmenar...@gsk.com wrote:

Could anyone provide some advice about how to run (fast but approximate) 
substructure searches with fingerprints using C++? I have a large set of SMILES 
for molecules and a relatively large set of SMILES/SMARTS for substructures.


Sorry, I meant to do this last weekend but it ended up slipping my mind.

The attached file demonstrates how to use fingerprints for substructure 
screening.

The usual required caveat: the substructure fingerprints are quite efficient 
for molecules that don't contain query features, but their efficacy drops as 
query features are introduced.

If you're going to be processing the same molecules/queries repeatedly (i.e. 
multiple runs on the same sets), it probably makes sense to use the RDKit's 
serialization code to save the pre-processed molecules and fingerprints. If 
applicable, let me know and I can generate some sample code for that too.

Hope this helps,
-greg


--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] substructure search with fingerprints in C++

2013-06-09 Thread Gonzalo Colmenarejo-Sanchez
I see. Are these what you call layered fingerprints? How do they differ from 
the Daylight-like fingerprints?

Looking forward for the C++ sample code.

Thank you very much for all your help.

Gonzalo

From: Greg Landrum [mailto:greg.land...@gmail.com]
Sent: 09 June 2013 17:29
To: Gonzalo Colmenarejo-Sanchez
Cc: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] substructure search with fingerprints in C++


On Sun, Jun 9, 2013 at 12:29 PM, Gonzalo Colmenarejo-Sanchez 
gonzalo.2.colmenar...@gsk.commailto:gonzalo.2.colmenar...@gsk.com wrote:
Yes, C++ code examples for preprocessed molecules and fingerprints would be 
extremely helpful too.

I'll put one together and send it along. I don't normally do file i/o from C++, 
so it's taking me longer than I expected to get it working.

By the way, if the query is a SMARTS like e.g. c1aaccc1 (representing several 
substructures), what fingerprint is exposed to AllProbeBitsMatch, the union of 
all the possible fingerprints, all the possible fingerprints sequentially, etc?

It's a single fingerprint. The code essentially doesn't include substructures 
in the fingerprint that include query features. This means that the FPs are not 
incredibly efficient if you have query molecules that include a high density of 
query features.

Here's an example showing what happens with an extremely simple case.

Start with a simple molecule:

In [21]: list(Chem.PatternFingerprint(Chem.MolFromSmiles('CC')).GetOnBits())
Out[21]: [429, 778, 1022]

This matches one substructure query pattern [*]~[*] twice, so it sets three 
bits: one bit for each match and one for the fact that the match is CC.

Constructing the same molecule from SMARTS gives the same result, the 
fingerprinter knows how to deal sensibly with the implicit queries in SMARTS:

In [22]: list(Chem.PatternFingerprint(Chem.MolFromSmarts('CC')).GetOnBits())
Out[22]: [429, 778, 1022]

But as soon as I add a query feature, I lose a bit:

In [23]: list(Chem.PatternFingerprint(Chem.MolFromSmarts('C[A]')).GetOnBits())
Out[23]: [429, 1022]

This still matches [*]~[*] twice, but since the match involves a query 
feature, there's no bit set for the match itself.

If I make the match asymmetric, I get four bits:

In [24]: list(Chem.PatternFingerprint(Chem.MolFromSmarts('CO')).GetOnBits())
Out[24]: [54, 429, 759, 1022]

This matches [*]~[*] twice, but OC and CO generate different bits.

Make sense?
-greg

--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] substructure search with fingerprints in C++

2013-06-09 Thread Greg Landrum
On Sun, Jun 9, 2013 at 7:33 PM, Gonzalo Colmenarejo-Sanchez 
gonzalo.2.colmenar...@gsk.com wrote:

  I see. Are these what you call “layered” fingerprints? How do they
 differ from the Daylight-like fingerprints?


No, the pattern fingerprints use a different approach that I haven't yet
done a reasonably description of. That's on my ToDo list.


 Looking forward for the C++ sample code.


It's attached. The layout of the files isn't really great since I struggled
with the file i/o stuff, but this should at least demonstrate the idea.
Hopefully you're better at C++ file i/o than I am and can make something
more useful out of this.

-greg
// $Id$
//
//  Copyright (C) 2008-2011 Greg Landrum
//   @@ All Rights Reserved @@
//  This file is part of the RDKit.
//  The contents are covered by the terms of the BSD license
//  which is included in the file license.txt, found at the root
//  of the RDKit source tree.
//
/*  Can be built with:
   g++ -o fingerprint_screen.exe fingerprint_screen.cpp -I$RDBASE/Code -I$RDBASE/Extern \
   -L$RDBASE/lib -lFileParsers -lSmilesParse -lFingerprints \
   -lSubstructMatch -lGraphMol -lDataStructs -lRDGeometryLib -lRDGeneral
*/

#include RDGeneral/Invariant.h
#include DataStructs/BitVects.h
#include DataStructs/BitOps.h
#include GraphMol/RDKitBase.h
#include GraphMol/MolPickler.h
#include GraphMol/SmilesParse/SmilesParse.h
#include GraphMol/SmilesParse/SmilesWrite.h
#include GraphMol/Substruct/SubstructMatch.h
#include GraphMol/Depictor/RDDepictor.h
#include GraphMol/FileParsers/MolSupplier.h
#include GraphMol/Fingerprints/Fingerprints.h


#include RDGeneral/RDLog.h
#include vector
#include algorithm
#include iostream
#include fstream
#include RDGeneral/StreamOps.h

using namespace RDKit;

typedef boost::shared_ptrExplicitBitVect EBV_SPTR;

void ReadMols(std::vectorROMOL_SPTR mols,
  std::vectorROMOL_SPTR queries){
  // 
  //   Read molecules
  // 
  std::string rdbase = getenv(RDBASE);
  std::string sdname = rdbase + /Regress/Data/mols.1000.sdf;
  std::string qname = rdbase + /Regress/Data/queries.txt;
  SDMolSupplier msuppl(sdname);
  SmilesMolSupplier qsuppl(qname, ,0,-1,false);
  BOOST_LOG(rdInfoLog)loading mols: std::endl;
  while(!msuppl.atEnd()){
ROMol *m=msuppl.next();
if(!m) continue;
ROMOL_SPTR mp(m);
mols.push_back(mp);
  }
  BOOST_LOG(rdInfoLog)loading queries: std::endl;
  while(!qsuppl.atEnd()){
ROMol *m=qsuppl.next();
if(!m) continue;
ROMOL_SPTR mp(m);
queries.push_back(mp);
  }
}

void BuildFps(const std::vectorROMOL_SPTR mols,
  std::vectorEBV_SPTR  mol_fps){
  // 
  //   Construct fingerprints
  // 
  BOOST_FOREACH(ROMOL_SPTR mp,mols){
ExplicitBitVect *fp=PatternFingerprintMol(*mp);
EBV_SPTR fpp(fp);
mol_fps.push_back(fpp);
  }
}

void FPScreen(const std::vectorROMOL_SPTR mols,
  const std::vectorROMOL_SPTR queries,
  const std::vectorEBV_SPTR  mol_fps,
  const std::vectorEBV_SPTR  query_fps)
{

  // 
  //   substructure searches
  // 
  unsigned int nMatches=0;
  for(unsigned int i=0;imols.size();++i){
ROMOL_SPTR mp=mols[i];
EBV_SPTR mfp=mol_fps[i];
for(unsigned int j=0;jqueries.size();++j){
  // fingerprint screen:
  EBV_SPTR qfp=query_fps[j];
  if(!AllProbeBitsMatch(*qfp,*mfp)) continue;

  // molecule substructure search:
  MatchVectType mv;
  ROMOL_SPTR qp=queries[j];
  if(SubstructMatch(*mp,*qp,mv)) ++nMatches;
}
  }
  BOOST_LOG(rdInfoLog) num matches: nMatchesstd::endl;
}

void WriteData(const std::vectorROMOL_SPTR mols,
   const std::vectorEBV_SPTR  mol_fps,
   std::string filen){
  std::ofstream molStream((filen+mols.bin).c_str(),std::ios_base::binary|std::ios_base::out);
  unsigned int sz=mols.size();
  streamWrite(molStream,sz);
  for(unsigned int i=0;imols.size();++i){
MolPickler::pickleMol(*(mols[i]),molStream);
  }
  std::ofstream fpStream((filen+fps.bin).c_str());
  for(unsigned int i=0;imols.size();++i){
fpStreamBitVectToFPSText(*mol_fps[i]);
fpStream\n;
  }
}
void ReadData(std::vectorROMOL_SPTR mols,
  std::vectorEBV_SPTR  mol_fps,
  std::string filen){
  mols.clear();
  mol_fps.clear();
  std::ifstream molStream((filen+mols.bin).c_str(),std::ios_base::binary|std::ios_base::in);
  unsigned int nMols;
  streamRead(molStream,nMols);
  for(unsigned int i=0;inMols;++i){
ROMol *nMol=new ROMol();
MolPickler::molFromPickle(molStream,nMol);
ROMOL_SPTR mp(nMol);
mols.push_back(mp);
  }
  std::ifstream fpStream((filen+fps.bin).c_str());
  for(unsigned int i=0;inMols;++i){
std::string pkl;
std::getline(fpStream,pkl);
ExplicitBitVect *bv=new 

[Rdkit-discuss] substructure search with fingerprints in C++

2013-06-08 Thread Gonzalo Colmenarejo-Sanchez
Hi all,

Could anyone provide some advice about how to run (fast but approximate) 
substructure searches with fingerprints using C++? I have a large set of SMILES 
for molecules and a relatively large set of SMILES/SMARTS for substructures.

Thanks a lot,

Gonzalo

--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] substructure search with fingerprints

2013-05-29 Thread Gonzalo Colmenarejo-Sanchez
Sorry,  I had to have said that I use C++. I search a bunch of substructures 
(sometimes SMILES, sometimes SMARTS) against a lot of molecules.

Thanks a lot,

Gonzalo

From: Greg Landrum [mailto:greg.land...@gmail.com]
Sent: 29 May 2013 05:41
To: Gonzalo Colmenarejo-Sanchez
Cc: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] substructure search with fingerprints

Hi Gonzalo,

On Tue, May 28, 2013 at 5:00 PM, Gonzalo Colmenarejo-Sanchez 
gonzalo.2.colmenar...@gsk.commailto:gonzalo.2.colmenar...@gsk.com wrote:

What's the best way of doing fast (approximate) substructure searches in RDKit 
using fingerprints? I'm a bit confused about this topic. Any advice would be 
really appreciated.


The answer depends on what you want to do.

If you have one or more molecules and a single query and you want to know if 
the query matches any the molecules, the fastest approach is just to do the 
substructure search (the time required to generate the fingerprints is larger 
than the time to do the individual search).

If you have a set of molecules you would like to search through using multiple 
queries or a set that is relatively static that you'd be searching through more 
than once, you have a variety of options. I'm going to run through some of the 
options from Python. If you want to do the same thing in C++ or Java, I can 
provide a separate answer for that.

-
1) Install postgresql and the RDKit postgresql cartridge and use that to do the 
searches. This is heavyweight, but gets you something that's flexible, 
relatively easy to use, and quite suited for dealing with millions of molecules.

-
2) Give Riccardo's Chemicallite a 
try:http://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg03077.html
 This cartridge for sqlite is still in development, but the early results 
that Riccardo shows look quite promising.

-
3) Using the pandas integration in the new version of the RDKit, you can easily 
work with sets of molecules and do efficient substructure searches:
In [47]: from rdkit.Chem import PandasTools

In [48]: df = 
PandasTools.LoadSDF('lopac_pubchem_28March07.sdf',includeFingerprints=True)
len(
In [49]: len(df)
Out[49]: 1232

In [50]: q = Chem.MolFromSmiles('c1nnccc1')

In [51]: subset = ndf[ndf['ROMol']=q]

In [52]: len(subset)
Out[52]: 6

If you want to use this set of molecules in later python sessions, you can save 
the dataframe using python's pickle module.

Needless to say, you'll need to have pandas installed (but it's great to have 
installed anyway).


-
4) If you want to avoid installing anything extra, you can do the book-keeping 
and fingerprint tracking yourself with something like this:

In [63]: ms = [x for x in Chem.SDMolSupplier('lopac_pubchem_28March07.sdf') if 
x is not None]
fps
In [64]: fps = [Chem.PatternFingerprint(x) for x in ms]

In [65]: def sss(ms,fps,q):
res=[]
qfp = Chem.PatternFingerprint(q)
for i,fp in enumerate(fps):
if DataStructs.AllProbeBitsMatch(qfp,fp):
if ms[i].HasSubstructMatch(q):
res.append(ms[i])
return res
   :

In [66]: subset=sss(ms,fps,Chem.MolFromSmiles('c1nnccc1'))

In [67]: len(subset)
Out[67]: 6

You can pickle the lists ms and fps together to use them in later python 
sessions.


Note that solutions 3) and 4) need to have all the molecules and fingerprints 
in memory at the same time, so dealing with large numbers of molecules this way 
will not be particularly efficient unless you have a *lot* of memory.


Does that help?
-greg


--
Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET
Get 100% visibility into your production application - at no cost.
Code-level diagnostics for performance bottlenecks with 2% overhead
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap1___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] substructure search with fingerprints

2013-05-28 Thread Gonzalo Colmenarejo-Sanchez
Hi,

What's the best way of doing fast (approximate) substructure searches in RDKit 
using fingerprints? I'm a bit confused about this topic. Any advice would be 
really appreciated.

Thanks a lot,

Gonzalo

--
Try New Relic Now  We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app,  servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] substructure search with fingerprints

2013-05-28 Thread Greg Landrum
Hi Gonzalo,

On Tue, May 28, 2013 at 5:00 PM, Gonzalo Colmenarejo-Sanchez 
gonzalo.2.colmenar...@gsk.com wrote:



 **

 What’s the best way of doing fast (approximate) substructure searches in
 RDKit using fingerprints? I’m a bit confused about this topic. Any advice
 would be really appreciated.

 **


The answer depends on what you want to do.

If you have one or more molecules and a single query and you want to know
if the query matches any the molecules, the fastest approach is just to do
the substructure search (the time required to generate the fingerprints is
larger than the time to do the individual search).

If you have a set of molecules you would like to search through using
multiple queries or a set that is relatively static that you'd be searching
through more than once, you have a variety of options. I'm going to run
through some of the options from Python. If you want to do the same thing
in C++ or Java, I can provide a separate answer for that.

-
1) Install postgresql and the RDKit postgresql cartridge and use that to do
the searches. This is heavyweight, but gets you something that's flexible,
relatively easy to use, and quite suited for dealing with millions of
molecules.

-
2) Give Riccardo's Chemicallite a try:
http://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg03077.html
This
cartridge for sqlite is still in development, but the early results that
Riccardo shows look quite promising.

-
3) Using the pandas integration in the new version of the RDKit, you can
easily work with sets of molecules and do efficient substructure searches:
In [47]: from rdkit.Chem import PandasTools

In [48]: df =
PandasTools.LoadSDF('lopac_pubchem_28March07.sdf',includeFingerprints=True)
len(
In [49]: len(df)
Out[49]: 1232

In [50]: q = Chem.MolFromSmiles('c1nnccc1')

In [51]: subset = ndf[ndf['ROMol']=q]

In [52]: len(subset)
Out[52]: 6

If you want to use this set of molecules in later python sessions, you can
save the dataframe using python's pickle module.

Needless to say, you'll need to have pandas installed (but it's great to
have installed anyway).


-
4) If you want to avoid installing anything extra, you can do the
book-keeping and fingerprint tracking yourself with something like this:

In [63]: ms = [x for x in Chem.SDMolSupplier('lopac_pubchem_28March07.sdf')
if x is not None]
fps
In [64]: fps = [Chem.PatternFingerprint(x) for x in ms]

In [65]: def sss(ms,fps,q):
res=[]
qfp = Chem.PatternFingerprint(q)
for i,fp in enumerate(fps):
if DataStructs.AllProbeBitsMatch(qfp,fp):
if ms[i].HasSubstructMatch(q):
res.append(ms[i])
return res
   :

In [66]: subset=sss(ms,fps,Chem.MolFromSmiles('c1nnccc1'))

In [67]: len(subset)
Out[67]: 6

You can pickle the lists ms and fps together to use them in later python
sessions.


Note that solutions 3) and 4) need to have all the molecules and
fingerprints in memory at the same time, so dealing with large numbers of
molecules this way will not be particularly efficient unless you have a
*lot* of memory.


Does that help?
-greg
--
Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET
Get 100% visibility into your production application - at no cost.
Code-level diagnostics for performance bottlenecks with 2% overhead
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap1___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Substructure search paper

2013-04-11 Thread Quentin Delettre

Hi,

I plan to use substructure search for around 1500 molecules versus 3000 
small fragments, stored in sdf with 3D coordinates. The goal is to 
identify and investigate which fragments are most/less found in the set.


I am quite new in the field and it's the occasion to compare programs 
and libraries that can do that. Can you provide me some links to papers 
that discuss and compare available tools ?


Thanks

*Quentin Delettre*, Ingénieur Bioinformaticien
Institut de Pharmacologie moléculaire et Cellulaire
UMR6097 -- CNRS
660 Route des Lucioles
06560 Valbonne
phone: +33-(0)4-93-95-77-31
fax : +33-(0)4-93-95-77-08
email : delet...@ipmc.cnrs.fr

--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis  visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search paper

2013-04-11 Thread Andrew Dalke
On Apr 11, 2013, at 10:10 AM, Quentin Delettre wrote:
 I plan to use substructure search for around 1500 molecules versus 3000 small 
 fragments ..
 I am quite new in the field and it's the occasion to compare programs and 
 libraries
 that can do that. Can you provide me some links to papers that discuss and 
 compare
 available tools ?

Pretty much every cheminformatics toolkit can do what you want.

You most likely want to start with the Chemistry Toolkit Rosetta, specifically:

  http://ctr.wikia.com/wiki/Calculate_TPSA
  http://ctr.wikia.com/wiki/Unique_SMARTS_matches_against_a_SMILES_string

For information about SMARTS, see
  http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
and the demo page at
  http://www.daylight.com/daycgi_tutorials/depictmatch.cgi

Cheers,

Andrew
da...@dalkescientific.com



--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis  visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search paper

2013-04-11 Thread Quentin Delettre
Hi Andrew,

Thanks for the ctr links, i already know them but forgot to check it.
Thanks for the daylight links anyway !

I was more concerned about algorithms/implementation, pitfalls that 
could happen and performance.

Le 11/04/2013 12:00, Andrew Dalke a écrit :
 On Apr 11, 2013, at 10:10 AM, Quentin Delettre wrote:
 I plan to use substructure search for around 1500 molecules versus 3000 
 small fragments ..
 I am quite new in the field and it's the occasion to compare programs and 
 libraries
 that can do that. Can you provide me some links to papers that discuss and 
 compare
 available tools ?
 Pretty much every cheminformatics toolkit can do what you want.

 You most likely want to start with the Chemistry Toolkit Rosetta, 
 specifically:

http://ctr.wikia.com/wiki/Calculate_TPSA
http://ctr.wikia.com/wiki/Unique_SMARTS_matches_against_a_SMILES_string

 For information about SMARTS, see
http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
 and the demo page at
http://www.daylight.com/daycgi_tutorials/depictmatch.cgi

 Cheers,

   Andrew
   da...@dalkescientific.com



 --
 Precog is a next-generation analytics platform capable of advanced
 analytics on semi-structured data. The platform includes APIs for building
 apps and a phenomenal toolset for data science. Developers can use
 our toolset for easy data analysis  visualization. Get a free account!
 http://www2.precog.com/precogplatform/slashdotnewsletter
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis  visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search paper

2013-04-11 Thread Andrew Dalke
On Apr 11, 2013, at 1:39 PM, Quentin Delettre wrote:
 I was more concerned about algorithms/implementation, pitfalls that 
 could happen and performance.

There are none. Pretty much every cheminformatics toolkit can do
what you want.

The toolkits I know of use either the Ullmann algorithm or the VF2
algorithm. Most use VF2, and that transition occurred some 5 years ago.

There are toolkit variations. In one benchmark I measured a factor of 5x
between RDKit and OEChem. There are also special cases where one toolkit
might be a lot better than another; Roger Sayle pointed out that there's
a huge variation for matching the radioactive isotopes.

However, those won't be a problem in the scenario you described. You may
have to wait a few minutes longer for one toolkit than the other, but
RDKit does about 100,000 matches per second (for a simple match), so
could finish your task in less time then you've spent on this email
thread.

Cheers,

Andrew
da...@dalkescientific.com



--
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis  visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search

2009-11-07 Thread Greg Landrum
Combining two answers into one:

On Fri, Nov 6, 2009 at 7:59 AM, Evgueni Kolossov ekolos...@gmail.com wrote:
 Hi Greg,

 Yes, this is solution I been thinking about as well but there is 2 problems:
 1. It will slow dawn mapping process which is slow already
 2. What atom to use for replacement?

I'm not sure I understand what you mean about slowing down the mapping
process. If you replace the dummies in your fragments with query
atoms, as I proposed in the sample code in my earlier message, the
substructure search should not be substantially slower. The
replacement itself also won't take that long, unless you really have a
*lot* of fragments.


On Fri, Nov 6, 2009 at 9:03 AM, Evgueni Kolossov ekolos...@gmail.com wrote:

 I think you should distinguish between dummy atoms and connection points -
 for fragments it is connection points we are talking about.

The code doesn't understand anything about connection points... it
just has atoms. Dummy atoms are atoms with atomic number zero. The
substructure matching code applied to normal Atoms (i.e. not
QueryAtoms) compares two atoms by checking to see if their atomic
numbers match, so dummies match dummies. Additionally, when isotopes
are specified, it checks that the specified isotopes match.
QueryAtoms, on the other had, allow client code to specify the
function that's used for matching. The example I provided showed how
to use a function that matches any atom; which I think is what you are
looking for.

 So, it suppose
 to ignore this atom (but not bond!) during matching process. May be just add
 another bool flag to allow user select different behavior?

The substructure matching uses atoms and bonds, and returns the
results as lists of atom indices; how (and why) would you propose to
ignore an atom but not a bond?

-greg

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search

2009-11-07 Thread Evgueni Kolossov
Thanks Greg,

I have calculated it will slow down on about 30% using this replacement
which is significant for big datasets.

The substructure matching uses atoms and bonds, and returns the
results as lists of atom indices; how (and why) would you propose to
ignore an atom but not a bond?
I mean take bond in account as it is but use match any for dummy atom

Regards,
Evgueni


2009/11/7 Greg Landrum greg.land...@gmail.com

 Combining two answers into one:

 On Fri, Nov 6, 2009 at 7:59 AM, Evgueni Kolossov ekolos...@gmail.com
 wrote:
  Hi Greg,
 
  Yes, this is solution I been thinking about as well but there is 2
 problems:
  1. It will slow dawn mapping process which is slow already
  2. What atom to use for replacement?

 I'm not sure I understand what you mean about slowing down the mapping
 process. If you replace the dummies in your fragments with query
 atoms, as I proposed in the sample code in my earlier message, the
 substructure search should not be substantially slower. The
 replacement itself also won't take that long, unless you really have a
 *lot* of fragments.


 On Fri, Nov 6, 2009 at 9:03 AM, Evgueni Kolossov ekolos...@gmail.com
 wrote:
 
  I think you should distinguish between dummy atoms and connection points
 -
  for fragments it is connection points we are talking about.

 The code doesn't understand anything about connection points... it
 just has atoms. Dummy atoms are atoms with atomic number zero. The
 substructure matching code applied to normal Atoms (i.e. not
 QueryAtoms) compares two atoms by checking to see if their atomic
 numbers match, so dummies match dummies. Additionally, when isotopes
 are specified, it checks that the specified isotopes match.
 QueryAtoms, on the other had, allow client code to specify the
 function that's used for matching. The example I provided showed how
 to use a function that matches any atom; which I think is what you are
 looking for.

  So, it suppose
  to ignore this atom (but not bond!) during matching process. May be just
 add
  another bool flag to allow user select different behavior?

 The substructure matching uses atoms and bonds, and returns the
 results as lists of atom indices; how (and why) would you propose to
 ignore an atom but not a bond?

 -greg

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search

2009-11-07 Thread Greg Landrum
On Sat, Nov 7, 2009 at 12:35 PM, Evgueni Kolossov ekolos...@gmail.com wrote:

 I have calculated it will slow down on about 30% using this replacement
 which is significant for big datasets.

Agreed, that's a huge difference. How does it come about? Where is the
time being spent?

-greg

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search

2009-11-07 Thread Greg Landrum
On Sat, Nov 7, 2009 at 3:44 PM, Evgueni Kolossov ekolos...@gmail.com wrote:

 I have not done full profiling - this came just from the difference between
 time with and without Replace Dummmy

are you doing the replace dummy for each fragment every time before
you do a search or do you do it just once?

I would guess that replacing the dummy atoms shouldn't take very long
at all, and then doing the searches should also be reasonably quick.
One complication might be that having the query atoms will return a
lot more matches than the non-query dummies; this will naturally take
longer.

-greg

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search

2009-11-07 Thread Greg Landrum
On Sat, Nov 7, 2009 at 5:43 PM, Evgueni Kolossov ekolos...@gmail.com wrote:
are you doing the replace dummy for each fragment every time before
you do a search or do you do it just once?
 I am iterating through all the structures and all the fragments:
 so for each structure do
    for each fragment ( and need to replace dummy
 here)

 probably can do it another way:
 for each fragment do
     for each structure

 In this case will need to do it only once for each fragment

yes, I imagine that will help a lot.

or:
for each fragment do: replace dummy atom
for each structure do
   for each fragment do something

-greg

-greg

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search

2009-11-06 Thread Evgueni Kolossov
Greg,

I think you should distinguish between dummy atoms and connection points -
for fragments it is connection points we are talking about. So, it suppose
to ignore this atom (but not bond!) during matching process. May be just add
another bool flag to allow user select different behavior?

Regards,
Evgueni

2009/11/6 Evgueni Kolossov ekolos...@gmail.com

 Hi Greg,

 Yes, this is solution I been thinking about as well but there is 2
 problems:
 1. It will slow dawn mapping process which is slow already
 2. What atom to use for replacement?

 What if I will just remove this atom(s)?

 Regards,
 Evgueni

 2009/11/6 Greg Landrum greg.land...@gmail.com

 On Wed, Nov 4, 2009 at 7:54 PM, Greg Landrum greg.land...@gmail.com
 wrote:
 
  On Wed, Nov 4, 2009 at 3:26 PM, Evgueni Kolossov ekolos...@gmail.com
 wrote:
 
  I found that SubstructMatch would not work if query is a fragment (with
 *
  atoms).
  Can you suggest solution for this problem?
 
  That's a bug. Dummy atoms (things with atomic number zero) that do not
  have an isotope specification should match anything. If you have a
  sourceforge account, please enter the bug, otherwise let me know and I
  will enter it.

 After going back through the code and thinking about this for a while
 I'm going to change my original answer: it's not a bug that standard
 dummy atoms only match other dummy atoms. When I saw the * in the
 original message I started thinking about the QueryAtoms produced by a
 * in SMARTS, which definitely should (and do) match other dummies.
 The behavior with standard Atoms is useful for things like flagging
 attachment points of R groups on a scaffold. Here's an example:

 [5]  f= Chem.MolFromSmiles('c1cccnc1*')

 [6]  p = Chem.MolFromSmarts('c1cccnc1*')

 [9]  m = Chem.MolFromSmiles('c1ccc(C)nc1*')

 Matching using f, which has dummy Atoms only gives one match:
 [10]  m.GetSubstructMatches(f)
 Out[10]: ((0, 1, 2, 3, 5, 6, 7),)

 But matching using p, which has a QueryAtom built from * matches twice:
 [11]  m.GetSubstructMatches(p)
 Out[11]: ((0, 1, 2, 3, 5, 6, 7), (2, 1, 0, 6, 5, 3, 4))

 For your use case, I'd suggest replacing the dummies in your fragments
 with QueryAtoms that have the appropriate query, something like this
 (not tested):

 //-
 #include GraphMol/RDKitQueries.h

 void replaceDummies(RWMol *frag){
  QueryAtom *qat = new QueryAtom();
  qat-setQuery(makeAtomNullQuery());
  for(unsigned int i=0;ifrag-getNumAtoms();++i){
if(frag-getAtomWithIdx(i)-getAtomicNum()==0){
  frag-replaceAtom(i,qat);
}
  }
  delete qat;
 }
 //-

 I hope this helps,
 -greg





--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search

2009-11-05 Thread Evgueni Kolossov
Hi Greg,

Yes, this is solution I been thinking about as well but there is 2 problems:
1. It will slow dawn mapping process which is slow already
2. What atom to use for replacement?

What if I will just remove this atom(s)?

Regards,
Evgueni

2009/11/6 Greg Landrum greg.land...@gmail.com

 On Wed, Nov 4, 2009 at 7:54 PM, Greg Landrum greg.land...@gmail.com
 wrote:
 
  On Wed, Nov 4, 2009 at 3:26 PM, Evgueni Kolossov ekolos...@gmail.com
 wrote:
 
  I found that SubstructMatch would not work if query is a fragment (with
 *
  atoms).
  Can you suggest solution for this problem?
 
  That's a bug. Dummy atoms (things with atomic number zero) that do not
  have an isotope specification should match anything. If you have a
  sourceforge account, please enter the bug, otherwise let me know and I
  will enter it.

 After going back through the code and thinking about this for a while
 I'm going to change my original answer: it's not a bug that standard
 dummy atoms only match other dummy atoms. When I saw the * in the
 original message I started thinking about the QueryAtoms produced by a
 * in SMARTS, which definitely should (and do) match other dummies.
 The behavior with standard Atoms is useful for things like flagging
 attachment points of R groups on a scaffold. Here's an example:

 [5]  f= Chem.MolFromSmiles('c1cccnc1*')

 [6]  p = Chem.MolFromSmarts('c1cccnc1*')

 [9]  m = Chem.MolFromSmiles('c1ccc(C)nc1*')

 Matching using f, which has dummy Atoms only gives one match:
 [10]  m.GetSubstructMatches(f)
 Out[10]: ((0, 1, 2, 3, 5, 6, 7),)

 But matching using p, which has a QueryAtom built from * matches twice:
 [11]  m.GetSubstructMatches(p)
 Out[11]: ((0, 1, 2, 3, 5, 6, 7), (2, 1, 0, 6, 5, 3, 4))

 For your use case, I'd suggest replacing the dummies in your fragments
 with QueryAtoms that have the appropriate query, something like this
 (not tested):

 //-
 #include GraphMol/RDKitQueries.h

 void replaceDummies(RWMol *frag){
  QueryAtom *qat = new QueryAtom();
  qat-setQuery(makeAtomNullQuery());
  for(unsigned int i=0;ifrag-getNumAtoms();++i){
if(frag-getAtomWithIdx(i)-getAtomicNum()==0){
  frag-replaceAtom(i,qat);
}
  }
  delete qat;
 }
 //-

 I hope this helps,
 -greg

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Substructure search

2009-11-04 Thread Greg Landrum
Hi Evgueni,

On Wed, Nov 4, 2009 at 3:26 PM, Evgueni Kolossov ekolos...@gmail.com wrote:

 I found that SubstructMatch would not work if query is a fragment (with *
 atoms).
 Can you suggest solution for this problem?

That's a bug. Dummy atoms (things with atomic number zero) that do not
have an isotope specification should match anything. If you have a
sourceforge account, please enter the bug, otherwise let me know and I
will enter it.

Thanks for finding the problem.

-greg

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss