Re: [Rdkit-discuss] Question matching substructures from SMARTS with explicit hydrogens

David Cosgrove Mon, 07 Mar 2022 14:36:56 -0800

Glad it works for you. As Greg pointed out to someone else today, it’s
marginally more efficient to do [#6] than [C,c] and likewise for nitrogen.
But it’s always a trade off between speed and legibility/maintainability.
If speed is of the essence and you’re running on millions of compounds it
might be worth trying.


On Mon, 7 Mar 2022 at 20:45, Adam Moyer <[email protected]> wrote:

> Ahh! Thank you so much, to both of you.
>
> Yes, the different meaning of H in the various contexts was tripping me up.
>
> Also, DescribeQuery() was definitely a function that I needed for
> debugging this solo. Thank you. I will keep that in mind in the future.
>
> I found that this smiles (S4) is exactly what I
> needed: '[C,c]1(Cl)[C,c][C,c]([N,n,C,c,#1])[C,c]([C,c])[C,c]([#1])[C,c]1'.
>
> Cheers,
> Adam
>
> On Tue, Mar 1, 2022 at 4:32 AM Ivan Tubert-Brohman <
> [email protected]> wrote:
>
>> A minor correction: [H] by itself *is* valid and means a hydrogen atom.
>> The Daylight docs say as much in section 4.1. But in other contexts it
>> means a hydrogen count, so to be safe, always using #1 to mean a hydrogen
>> atom can be a good practice.
>>
>> If you are ever in doubt about how RDKit is interpreting a SMARTS,
>> I recommend making use of the DescribeQuery function which provides a tree
>> representation of a query atom or bond. For example (comments added):
>>
>> >>> mol = Chem.MolFromSmarts('[H][N,H][N,#1]')
>>
>>
>> >>> print(mol.GetAtomWithIdx(0).DescribeQuery())  # [H]
>>
>> AtomAtomicNum 1 = val  # [H] interpreted as a hydrogen atom
>>
>> >>> print(mol.GetAtomWithIdx(1).DescribeQuery())  # [N,H]
>>
>> AtomOr
>>   AtomType 7 = val
>>   AtomHCount 1 = val  # H interpreted as a hydrogen count
>> # Overall query atom means "an aliphatic nitrogen OR (any atom with one
>> hydrogen)!
>>
>> >>> print(mol.GetAtomWithIdx(2).DescribeQuery())   # [N,#1]
>>
>> AtomOr
>>   AtomType 7 = val
>>   AtomAtomicNum 1 = val  # "#1" is atomic number, therefore a hydrogen
>> atom.
>> # Overall query atom means "an aliphatic nitrogen OR a hydrogen"
>>
>> One non-obvious convention in the DescribeQuery output is that AtomType
>> implies aliphatic when the value is a normal atomic number, or aromatic if
>> the atomic number is offset by 1000. For example, [n] is "AtomType 1007".
>>
>> Hope you find this approach useful in the future.
>>
>> Ivan
>>
>> On Tue, Mar 1, 2022 at 6:33 AM David Cosgrove <[email protected]>
>> wrote:
>>
>>> Hi Adam
>>> There are a number of issues here.  The key one, I think, is a
>>> misunderstanding about the meaning of H in SMARTS.  It means "a single
>>> attached hydrogen", and is a qualifier for another atom, it cannot be used
>>> by itself.  So [*H] is valid, [H] isn't.  See the table at
>>> https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html.  If you
>>> want to refer to an explicit hydrogen, you have to use [#1].  However, that
>>> will only match an explicit hydrogen in the molecule, not an implicit one.
>>> Thus c[#1] doesn't match anything in c1ccccc1.  If you have read in a
>>> molecule from a molfile, for example, that has explicit hydrogens then you
>>> will be ok.
>>>
>>> Further to that, your SMARTS strings, at least as they have appeared in
>>> gmail, which may have garbled them, are incorrect.  In S1, the brackets
>>> round [N,n,H] make it a substituent, so it will not match the indole
>>> nitrogen.  Also, it would probably be better as [N,n;H], which would be
>>> read as "(aliphatic nitrogen OR aromatic nitrogen) AND 1 attached
>>> hydrogen."  The [N,n,H] will match a methylated indole nitrogen which I
>>> imagine is not what you want. Similar remarks apply to S2.  A SMARTS that
>>> matches both 6CI and PCT
>>> is [C,c]1(Cl)[C,c][C,c;H][C,c]([C,c])[C,c;H][C,c]1, but that won't match
>>> the H atoms themselves if you want to use them in the overlay, and it also
>>> won't work in the aliphatic case of, for example, ClC1CCC(C)CC1 because
>>> there the carbon atoms have 2 attached hydrogens.   If you really do want
>>> it to match aliphatic cases as well, then you will need something
>>> like 
>>> [C,c]1(Cl)[$([CH2]),$([cH])][$([CH2]),$([cH])][C,c]([C,c])[$([CH2]),$([cH])][$([CH2]),$([cH])]1
>>> which is quite a mouthful.  The carbons at the 2,3,5 and 6 positions on the
>>> ring are specified as either [CH2] or [cH].
>>>
>>> Jupyter notebook can be really useful for debugging SMARTS patterns like
>>> this.  The one I used was variations of
>>> ```
>>> from rdkit import Chem
>>> from IPython.display import SVG
>>> mol = Chem.MolFromSmiles('C1=CC(=CC2=C1C=CN2C)Cl')
>>> qmol = Chem.MolFromSmarts('[C,c]1(Cl)[C,c][C,c][C,c]([C,c])[C,c][C,c]1')
>>> print(mol.GetSubstructMatches(qmol))
>>> mol
>>> ```
>>> which prints the numbers of the matching atoms and also draws the
>>> molecule with the match highlighted.
>>> Regards,
>>> Dave
>>>
>>>
>>> On Tue, Mar 1, 2022 at 1:43 AM Adam Moyer <[email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a baffling case where I am trying to match substructures on two
>>>> ligands for the goal of aligning them.
>>>>
>>>> I have two ligands; one is a 6-chloroindole (6CI) and the other is a
>>>> para-chloro toluene (PCT).
>>>>
>>>> I am attempting to use the following SMARTS (S1) to match
>>>> them: '[C,c]1(Cl)[C,c][C,c]*([N,n,H])*[C,c]([C,c,H])[C,c]([H])[C,c]1'.
>>>> For some reason S1 only finds a match in 6CI.
>>>>
>>>> When I use the following SMARTS (S2) I only match to PCT as expected:
>>>> '[C,c]1(Cl)[C,c][C,c]*([H])*[C,c]([C,c,H])[C,c]([H])[C,c]1'.
>>>>
>>>> How can S1 not match PCT? S1 is strictly a superset of S2 because I am
>>>> using the "or" operation. Do I have a misunderstanding of how explicit
>>>> hydrogens work in RDKit/SMARTS?
>>>>
>>>> Lastly when I use the last SMARTS (S3) I am able to match to both, but
>>>> I cannot use that smarts due to other requirements in my
>>>> project: '[C,c]1(Cl)[C,c][C,c][C,c]([C,c,H])[C,c]([H])[C,c]1'
>>>>
>>>> Thanks!
>>>> Adam
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>
>>>
>>> --
>>> David Cosgrove
>>> Freelance computational chemistry and chemoinformatics developer
>>> http://cozchemix.co.uk
>>>
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>> --
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Question matching substructures from SMARTS with explicit hydrogens

Reply via email to