from:"Chenyang Shi"

[Rdkit-discuss] delete a substructure

2017-03-05 Thread Chenyang Shi

Hi everyone,

I am new to rdkit but I am already impressed by its vibrant community. I
have a question regarding deleting substructure. In the RDKIT
documentation, this is a snippet of code describing how to delete
substructure:

>>>m = Chem.MolFromSmiles("CC(=O)O")
>>>patt = Chem.MolFromSmarts("C(=O)[OH]")
>>>rm = AllChem.DeleteSubstructs(m, patt)
>>>Chem.MolToSmiles(rm)
'C'

This block of code first loads a molecule CH3COOH using SMILES code, then
defines a substructure COOH using SMARTS code which is to be deleted. After
final line of code, the program outputs 'C', in SMILES form.

I had wanted to develop a method for detecting number of groups in a
molecule. In CH3COOH case, I can search number of --CH3 and --COOH group by
using their respective SMARTS code with no problem. However, when molecule
becomes more complicated, it is preferred to delete the substructure that
has been searched before moving to next search using SMARTS code. Well, in
current case, after searching -COOH group and deleting it, the leftover is
'C' which is essentially CH4 instead of --CH3. I cannot proceed with
searching with SMARTS code for --CH3 ([CH3;A;X4!R]).

Is there any way to work around this?
Thanks,
Chenyang
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] delete a substructure

2017-03-05 Thread Chenyang Shi

Hi Greg,

Thanks for a prompt reply. I did try "GetSubstructMatches()" and it returns
correct numbers of substructures for CH3COOH. The potential problem with
this approach is that if the molecule is getting complicated, it will
possibly generate duplicate numbers for certain functional groups. For
example, --OH (alcohol) group will be likely also counted in --COOH. A
safer way, in my mind, is to remove the substructure that has been counted.

Greg, you mentioned "chemical reaction functionality", can you show me a
demo script with that using CH3COOH as an example. I will definitely delve
into the manual to learn more. But reading your code will be a good start.

Thanks,
Chenyang



On Sun, Mar 5, 2017 at 10:15 PM, Greg Landrum 
wrote:

> Hi Chenyang,
>
> If you're really interested in counting the number of times the
> substructure appears, you can do that much quicker with
> `GetSubstructMatches()`:
>
> In [2]: m = Chem.MolFromSmiles('CC(C)CCO')
> In [3]: len(m.GetSubstructMatches(Chem.MolFromSmarts('[CH3;X4]')))
> Out[3]: 2
>
> Is that sufficient, or do you actually want to sequentially remove all of
> the groups in your list?
>
> If you actually want to remove them, you are probably better off using the
> chemical reaction functionality instead of DeleteSubstructs(), which
> recalculates the number of implicit Hs on atoms after each call.
>
> -greg
>
>
> On Mon, Mar 6, 2017 at 4:21 AM, Chenyang Shi  wrote:
>
>> I am new to rdkit but I am already impressed by its vibrant community. I
>> have a question regarding deleting substructure. In the RDKIT
>> documentation, this is a snippet of code describing how to delete
>> substructure:
>>
>> >>>m = Chem.MolFromSmiles("CC(=O)O")
>> >>>patt = Chem.MolFromSmarts("C(=O)[OH]")
>> >>>rm = AllChem.DeleteSubstructs(m, patt)
>> >>>Chem.MolToSmiles(rm)
>> 'C'
>>
>> This block of code first loads a molecule CH3COOH using SMILES code, then
>> defines a substructure COOH using SMARTS code which is to be deleted. After
>> final line of code, the program outputs 'C', in SMILES form.
>>
>> I had wanted to develop a method for detecting number of groups in a
>> molecule. In CH3COOH case, I can search number of --CH3 and --COOH group by
>> using their respective SMARTS code with no problem. However, when molecule
>> becomes more complicated, it is preferred to delete the substructure that
>> has been searched before moving to next search using SMARTS code. Well, in
>> current case, after searching -COOH group and deleting it, the leftover is
>> 'C' which is essentially CH4 instead of --CH3. I cannot proceed with
>> searching with SMARTS code for --CH3 ([CH3;A;X4!R]).
>>
>> Is there any way to work around this?
>> Thanks,
>> Chenyang
>>
>>
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] delete a substructure

2017-03-06 Thread Chenyang Shi

Hongbin and Greg,
Thank you both for kind suggestions. I will try both approaches and report
my progress later.
Best,
Chenyang

On Monday, March 6, 2017, Greg Landrum  wrote:

> The solution that Hongbin proposes to the double-counting problem is a
> good one. Just be sure to sort your substructure queries in the right order
> so that the more complex ones come first.
>
> Another thing you might think about is making your queries more specific.
> For example, as you pointed out "[OH]" is very general and matches parts of
> carboxylic acids and a number of other functional groups. The RDKit has a
> set of fairly well tested (though certainly not perfect) functional group
> definitions in $RDBASE/Data/Functional_Group_Hierarchy.txt. The alcohol
> definition from there looks like this:
> [O;H1;$(O-!@[#6;!$(C=!@[O,N,S])])]
>
>
> -greg
>
>
> On Mon, Mar 6, 2017 at 7:20 AM, 杨弘宾  > wrote:
>
>> Hi, Chenyang,
>> You don't need to delete the substructure from the molecule. Just
>> check whehter the mapped atoms have been matched. For example:
>>
>> m = Chem.MolFromSmiles('CC(=O)O')
>> OH = Chem.MolFromSmarts('[OH]')
>> COOH = Chem.MolFromSmarts('C(O)=O')
>>
>> m.GetSubstructMatches(OH)
>> >> ((3,),)
>> m.GetSubstructMatchs(COOH)
>> >> ((1, 3, 2),)
>>
>> Since atom "3" has been already matched, it should be ignored.
>> So you can create a "set" to record the matched atoms to avoid
>> repetitive count.
>>
>> --
>> Hongbin Yang 杨弘宾
>>
>>
>> *From:* Chenyang Shi
>> 
>> *Date:* 2017-03-06 14:04
>> *To:* Greg Landrum
>> 
>> *CC:* RDKit Discuss
>> 
>> *Subject:* Re: [Rdkit-discuss] delete a substructure
>> Hi Greg,
>>
>> Thanks for a prompt reply. I did try "GetSubstructMatches()" and it
>> returns correct numbers of substructures for CH3COOH. The potential problem
>> with this approach is that if the molecule is getting complicated, it will
>> possibly generate duplicate numbers for certain functional groups. For
>> example, --OH (alcohol) group will be likely also counted in --COOH. A
>> safer way, in my mind, is to remove the substructure that has been counted.
>>
>> Greg, you mentioned "chemical reaction functionality", can you show me a
>> demo script with that using CH3COOH as an example. I will definitely delve
>> into the manual to learn more. But reading your code will be a good start.
>>
>> Thanks,
>> Chenyang
>>
>>
>>
>> On Sun, Mar 5, 2017 at 10:15 PM, Greg Landrum > > wrote:
>>
>>> Hi Chenyang,
>>>
>>> If you're really interested in counting the number of times the
>>> substructure appears, you can do that much quicker with
>>> `GetSubstructMatches()`:
>>>
>>> In [2]: m = Chem.MolFromSmiles('CC(C)CCO')
>>> In [3]: len(m.GetSubstructMatches(Chem.MolFromSmarts('[CH3;X4]')))
>>> Out[3]: 2
>>>
>>> Is that sufficient, or do you actually want to sequentially remove all
>>> of the groups in your list?
>>>
>>> If you actually want to remove them, you are probably better off using
>>> the chemical reaction functionality instead of DeleteSubstructs(), which
>>> recalculates the number of implicit Hs on atoms after each call.
>>>
>>> -greg
>>>
>>>
>>> On Mon, Mar 6, 2017 at 4:21 AM, Chenyang Shi >> > wrote:
>>>
>>>> I am new to rdkit but I am already impressed by its vibrant community.
>>>> I have a question regarding deleting substructure. In the RDKIT
>>>> documentation, this is a snippet of code describing how to delete
>>>> substructure:
>>>>
>>>> >>>m = Chem.MolFromSmiles("CC(=O)O")
>>>> >>>patt = Chem.MolFromSmarts("C(=O)[OH]")
>>>> >>>rm = AllChem.DeleteSubstructs(m, patt)
>>>> >>>Chem.MolToSmiles(rm)
>>>> 'C'
>>>>
>>>> This block of code first loads a molecule CH3COOH using SMILES code,
>>>> then defines a substructure COOH using SMARTS code which is to be deleted.
>>>> After final line of code, the program outputs 'C', in SMILES form.
>>>>
>>>> I had wanted to develop a method for detecting number of groups in a
>>>> molecule. In CH3COOH case, I can search number of --CH3 and --COOH gro

Re: [Rdkit-discuss] delete a substructure

2017-03-09 Thread Chenyang Shi

Thanks Hongbin and Pavel for the suggestions. I am now confident that the
approach Hongbin proposed to remove duplicate counts is a robust one. Now I
need to revisit/recheck all my SMARTS definitions.

One last question I have is do you guys have convenient online or local
documents to look up desired SMARTS.
Greg mentioned $RDBASE/Data/Functional_Group_Hierarchy.txt, which comes
with the installation of RDKIT.
Brian suggested daylight website,
http://www.daylight.com/dayhtml_tutorials/languages/
smarts/smarts_examples.html, which is a good place as well.

Best,
Chenyang

On Thu, Mar 9, 2017 at 1:09 AM, 杨弘宾  wrote:

> Hi Chemyang,
>
> Your issue was caused by the definition of "-OH(phenol)", I think.  If
> you define this pattern as "cO", the atom *3* will be matched since it is
> the aromatic carbon bond to an oxygen.  I guess you just wanted to match
> exactly the oxygen and restrict it with "bonding with an aromatic carbon".
> So the SMARTS should ber "[$(Oc)]", which indicates an oxygen with the
> environment of "bonding with an aromatic carbon".
>
> m = Chem.MolFromSmiles('CC1=CC(=C(C=C1)C(=O)O)O')
> m.GetSubstructMatches(Chem.MolFromSmiles('[$(Oc)]'))
> >>> ((10,),)
>
> Then only atom *10* will be matched and it won't interfere with other
> counts.
>
> Reference: http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
> 4.4
>
> --
> Hongbin Yang
>
>
> *From:* Chenyang Shi 
> *Date:* 2017-03-09 01:32
> *To:* Greg Landrum 
> *CC:* rdkit-discuss ; 杨弘宾
> 
> *Subject:* Re: [Rdkit-discuss] delete a substructure
> Dear Hongbin,
>
> I tried your method on a molecule, 4-Methylsalicylic acid
> (CC1=CC(=C(C=C1)C(=O)O)O). I looped through all groups defined in Joback
> method (using SMARTS), and used m.GetSubstructMatches to print out all
> atom positions. The result is summarized in the table.
>
> We can see there are duplicated counts--coming from COOH group. As
> suggested by Hongbin, we can remove duplicated atoms by looking at their
> positions--in this case, ((9),), ((7,8,),), ((7,),), and ((8,),) are
> subsets of ((7,8,9)) from -COOH. Indeed we can get rid of these duplicates.
> However, I also noticed that Atom (3,) from =C< (ring) group is also a part
> of -OH (phenol) ((10,3),). If we apply the same algorithm to remove
> duplicates, the =C<(ring) group will be only counted twice instead of three
> times.
>
> Greg, you mentioned as an alternative I can delete substructure using
> chemical reaction method. It would be greatly appreciated if you could show
> me (point me to) a simple example code, perhaps on a simple molecule? I
> find myself at a loss when browsing the manual. I would like to try also in
> that direction.
>
> Thanks,
> Chenyang
>
>
> [image: Inline image 1]
>
>
> On Mon, Mar 6, 2017 at 1:52 AM, Greg Landrum 
> wrote:
>
>> The solution that Hongbin proposes to the double-counting problem is a
>> good one. Just be sure to sort your substructure queries in the right order
>> so that the more complex ones come first.
>>
>> Another thing you might think about is making your queries more specific.
>> For example, as you pointed out "[OH]" is very general and matches parts of
>> carboxylic acids and a number of other functional groups. The RDKit has a
>> set of fairly well tested (though certainly not perfect) functional group
>> definitions in $RDBASE/Data/Functional_Group_Hierarchy.txt. The alcohol
>> definition from there looks like this:
>> [O;H1;$(O-!@[#6;!$(C=!@[O,N,S])])]
>>
>>
>> -greg
>>
>>
>> On Mon, Mar 6, 2017 at 7:20 AM, 杨弘宾  wrote:
>>
>>> Hi, Chenyang,
>>> You don't need to delete the substructure from the molecule. Just
>>> check whehter the mapped atoms have been matched. For example:
>>>
>>> m = Chem.MolFromSmiles('CC(=O)O')
>>> OH = Chem.MolFromSmarts('[OH]')
>>> COOH = Chem.MolFromSmarts('C(O)=O')
>>>
>>> m.GetSubstructMatches(OH)
>>> >> ((3,),)
>>> m.GetSubstructMatchs(COOH)
>>> >> ((1, 3, 2),)
>>>
>>> Since atom "3" has been already matched, it should be ignored.
>>> So you can create a "set" to record the matched atoms to avoid
>>> repetitive count.
>>>
>>> --
>>> Hongbin Yang 杨弘宾
>>>
>>>
>>> *From:* Chenyang Shi 
>>> *Date:* 2017-03-06 14:04
>>> *To:* Greg Landrum 
>>> *CC:* RDKit Discuss 
>>> *Subject:* Re: [Rd

Re: [Rdkit-discuss] delete a substructure

2017-03-09 Thread Chenyang Shi

Thank you Chris. I found that one too; it is quite convenient to visualize
both SMARTS and SMILES strings.

On Thu, Mar 9, 2017 at 11:28 AM, Chris Swain  wrote:

> I use SMARTSviewer at Univ of Hamburg
>
> http://www.zbh.uni-hamburg.de/en/bioinformatics-server.html
>
> Chris
>
> On 9 Mar 2017, at 17:21, rdkit-discuss-requ...@lists.sourceforge.net
> wrote:
>
> One last question I have is do you guys have convenient online or local
> documents to look up desired SMARTS.
> Greg mentioned $RDBASE/Data/Functional_Group_Hierarchy.txt, which comes
> with the installation of RDKIT.
> Brian suggested daylight website,
> http://www.daylight.com/dayhtml_tutorials/languages/
> smarts/smarts_examples.html, which is a good place as well.
>
> Best,
> Chenyang
>
>
>
> 
> --
> Announcing the Oxford Dictionaries API! The API offers world-renowned
> dictionary content that is easy and intuitive to access. Sign up for an
> account today to start using our lexical data to power your apps and
> projects. Get started today and enter our developer competition.
> http://sdm.link/oxford
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Announcing the Oxford Dictionaries API! The API offers world-renowned
dictionary content that is easy and intuitive to access. Sign up for an
account today to start using our lexical data to power your apps and
projects. Get started today and enter our developer competition.
http://sdm.link/oxford___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] segmentation fault 11

2017-10-27 Thread Chenyang Shi

Hi Greg,

Thank you for a quick response; it worked, both for RDKit, and for JRgui
program I wrote (that used RDKit).
The error message seems a bit odd, but good to know a way to get around it.

Best,
Chenyang

On Fri, Oct 27, 2017 at 10:54 PM, Greg Landrum 
wrote:

> Hi Chenyang,
>
> This looks like the breakage caused by conda v4.3.27. There's some more
> information here:
> https://www.mail-archive.com/rdkit-discuss@lists.
> sourceforge.net/msg07325.html
>
> Best,
> -greg
>
>
> On Sat, Oct 28, 2017 at 5:27 AM, Chenyang Shi  wrote:
>
>> Hi Everyone,
>>
>> I am writing to report a possible bug in RDKit on mac.
>> I have a program that uses Chem from rdkit. The program works fine in
>> Linux and Windows systems. However, I had hard time on macOS. I think I
>> might be doing something wrong myself before I did a test on a clean macOS
>> system on someone else's mac (10.12.4 Sierra; mine is 10.10.3 Yosemite). On
>> Sierra computer, I downloaded anaconda, installed it, and conda install
>> rdkit as instructed. After I source activate my-rkdit-env, it failed to
>> execute my program jrgui.py. I then import rdkit it worked fine; however,
>> if I type from rdkit import Chem, it reports error segmentation fault 11.
>>
>> All these are screen captured, and attached in the email. I am not sure
>> if this is a bug or something else. Do you have a hint what's going bad?
>>
>> Thanks,
>> Chenyang
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

[Rdkit-discuss] SMARTS for =C=, #CH, #C-

2017-11-08 Thread Chenyang Shi

Dear RDKitters,

I have a question regarding SMARTS codes for three simple functional
groups, these are =C=, #CH and #C-. I am new to SMARTS/SMILES. I indeed
tried to guess their codes. Here are my guesses:

=C= : [CH0;A;X2;!R](=[$(*)])=[$(*)]

#CH : [CH1;A;X2;!R]#[$(*)]

#C- :  [CH0;A;X2;!R]#[$(*)]

I checked these SMARTS at
http://smartsview.zbh.uni-hamburg.de/smartsview/calculate?method=get; they
all seem make sense.

For example, the webpage prints out following messages:

=C=: it says "aliphatic C with 0 further total connections, with 0 further
hydrogen, not in a ring".

#CH: "aliphatic C with 0 further total connections, with 1 further
hydrogen, not in a ring".

#C-: "aliphatic C with 1 further total connections, with 0 further
hydrogen, not in a ring".

However, when I search subgroups using these SMARTS, I had problems.

For example, if I search "C=C=O" using "[CH0;A;X2;!R](=[$(*)])=[$(*)]",
>>> from rdkit import Chem
>>> m = Chem.MolFromSmiles('C=C=O')
>>>
m.GetSubstructMatches(Chem.MolFromSmarts("[CH0;A;X2;!R](=[$(*)])=[$(*)]"))
((1, 0, 2),)

it prints out atomic positions 1, 0, 2--three positions. But I would expect
only one position for the Carbon in the middle.

Similarly, if I search "C#C" using "[CH1;A;X2;!R]#[$(*)]",
>>> from rdkit import Chem
>>> m = Chem.MolFromSmiles('C#C')
>>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH1;A;X2;!R]#[$(*)]"))
((0, 1),)
I would expect two separate positions such as (0,), (1,), indicating there
are two carbon triple bonds (with an hydrogen).


Then if  if I search "CC#CC" using " [CH0;A;X2;!R]#[$(*)]",
>>> from rdkit import Chem
>>> m = Chem.MolFromSmiles('CC#CC')
>>> m.GetSubstructMatches(Chem.MolFromSmarts(" [CH0;A;X2;!R]#[$(*)]"))
((1, 2),)
Again, I would expect two separate positions such as (1,), (2,), indicating
two carbon triple bonds.

I think the problem might be my SMARTS for these three groups are not
SPECIFIC. I would appreciate everyone's help on this.

Cheers,
Chenyang
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] SMARTS for =C=, #CH, #C-

2017-11-08 Thread Chenyang Shi

Dear Andy,

Thank you for a quick and thorough email. I find it very instructional,
although I need to read it a couple times more to digest it.

Cheers,
Chenyang

On Wed, Nov 8, 2017 at 2:27 PM, Andrew Dalke 
wrote:

> On Nov 8, 2017, at 21:00, Chenyang Shi  wrote:
> > =C= : [CH0;A;X2;!R](=[$(*)])=[$(*)]
>
> The recursive SMARTS notation, which is the term inside of the [$(...)],
> finds a match for the entire pattern and returns the first atom in that
> pattern.
>
> > For example, if I search "C=C=O" using "[CH0;A;X2;!R](=[$(*)])=[$(*)]",
> > >>> from rdkit import Chem
> > >>> m = Chem.MolFromSmiles('C=C=O')
> > >>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH0;A;X2;!R](=
> [$(*)])=[$(*)]"))
> > ((1, 0, 2),)
> >
> > it prints out atomic positions 1, 0, 2--three positions. But I would
> expect only one position for the Carbon in the middle.
>
> The $(*) finds the pattern, which is a "*" and in this case the terminal
> carbons, and returns it. The substructure search returns 3 positions
> because the first is [CH0;A;X2;!R], the second is the first atom of "*",
> and the third is the first atom of the other "*".
>
> If you only want the first atom the entire pattern, then put the entire
> pattern in a recursive SMARTS, as in:
>
>   [$([CH0;A;X2;!R](=*)=*)]
>
> >>> pat = Chem.MolFromSmarts("[$([CH0;A;X2;!R](=*)=*)]")
> >>> mol = Chem.MolFromSmiles('C=C=O')
> >>> mol.GetSubstructMatches(pat)
> ((1,),)
>
> > Similarly, if I search "C#C" using "[CH1;A;X2;!R]#[$(*)]",
> > >>> from rdkit import Chem
> > >>> m = Chem.MolFromSmiles('C#C')
> > >>> m.GetSubstructMatches(Chem.MolFromSmarts("[CH1;A;X2;!R]#[$(*)]"))
> > ((0, 1),)
> > I would expect two separate positions such as (0,), (1,), indicating
> there are two carbon triple bonds (with an hydrogen).
>
> Since you are only looking for a single atom, try putting the entire
> pattern in a recursive SMARTS, as in
>
>   [$([CH1;A;X2;!R]#*)]
>
> >>> mol = Chem.MolFromSmiles("C#C")
> >>> pat = Chem.MolFromSmarts("[$([CH1;A;X2;!R]#*)]")
> >>> mol.GetSubstructMatches(pat)
> ((0,), (1,))
>
>
> > Then if  if I search "CC#CC" using " [CH0;A;X2;!R]#[$(*)]",
>
> I believe you want "[$([CH0;A;X2;!R]#*)]"
>
> Thank you for your clear description of what you expected.
>
> Cheers,
>
> Andrew
> da...@dalkescientific.com
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

[Rdkit-discuss] SMARTS for Joback and Reid method

2017-11-08 Thread Chenyang Shi

Hi everyone,

I have been recently working on a project that implements Joback method
using RDKit (https://en.wikipedia.org/wiki/Joback_method).

I believe the core to the success of this project is to make the 41
functional groups correctly represented by SMARTS code. I have compiled my
own codes, see attachment. I would appreciate your review of it and let me
know if you spot errors.

I think building a robust/well-tested SMARTS database (though small in my
case) would be helpful to others and other projects.

Thank you,
Chenyang

PS: The ones highlighted red in the document are robust.


SMARTS.docx
Description: MS-Word 2007 document
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] SMARTS for Joback and Reid method

2017-11-09 Thread Chenyang Shi

Dear Emanuel,

Thank you for pointing at SMARTSviewer; it is a good place to check our
codes.

Cheers,
Chenyang

On Thu, Nov 9, 2017 at 4:44 AM, Emanuel Ehmki 
wrote:

> Dear Chenyang,
>
> at http://smartsview.zbh.uni-hamburg.de/ you will find a useful tool to
> visualize your Smarts pattern and also get them checked for correctness.
>
> Best,
> Emanuel
>
> Jason Biggs  schrieb am Do., 9. Nov. 2017 um
> 00:51 Uhr:
>
>> Chenyang,
>> I haven't looked at your smarts strings yet, but I do have this list of
>> SMARTS strings for the joback method I compiled myself (for use here:
>> https://www.wolframalpha.com/input/?i=2,3-methano-5,6-dichloroindene&lk=3
>> ).
>>
>> Perhaps this can be of use.  If you spot any mistakes, please let me know
>>
>> Jason
>>
>> $JobackSubstructures={
>>
>> {"Methyl","-CH3", "[CX4H3]"},
>>
>> {"SecondaryAcyclic", "-CH2-", "[!R;CX4H2]"},
>>
>> {"TertiaryAcyclic",">CH-", "[!R;CX4H]"},
>>
>> {"QuaternaryAcyclic", ">C<", "[!R;CX4H0]"},
>>
>> {"PrimaryAlkene", "=CH2", "[CX3H2]"},
>>
>> {"SecondaryAlkeneAcyclic", "=CH-", "[!R;CX3H1;!$([CX3H1](=O))]"},
>>
>> {"TertiaryAlkeneAcyclic", "=C<", "[$([!R;#6X3H0]);!$([!R;#
>> 6X3H0]=[#8])]"},
>>
>> {"CumulativeAlkene", "=C=", "[$([CX2H0](=*)=*)]"},
>>
>> {"TerminalAlkyne", "\[Congruent]CH","[$([CX2H1]#[!#7])]"},
>>
>> {"InternalAlkyne","\[Congruent]C-","[$([CX2H0]#[!#7])]"},
>>
>> {"SecondaryCyclic", "-CH2- (ring)", "[R;CX4H2]"},
>>
>> {"TertiaryCyclic", ">CH- (ring)", "[R;CX4H]"},
>>
>> {"QuaternaryCyclic", ">C< (ring)", "[R;CX4H0]"},
>>
>> {"SecondaryAlkeneCyclic", "=CH- (ring)", "[R;CX3H1,cX3H1]"},
>>
>> {"TertiaryAlkeneCyclic", "=C< (ring)","[$([R;#6X3H0]);!$([R;
>> #6X3H0]=[#8])]"},
>>
>> {"Fluoro", "-F", "[F]"},
>>
>> {"Chloro", "-Cl", "[Cl]"},
>>
>> {"Bromo", "-Br", "[Br]"},
>>
>> {"Iodo", "-I", "[I]"},
>>
>> {"Alcohol","-OH", "[OX2H;!$([OX2H]-[#6]=[O]);!$([OX2H]-a)]"},(* alcohol
>> - not matching a carboxylic acid *)
>>
>> {"Phenol","-OH", "[$([OX2H]-a)]"},
>>
>> {"EtherAcyclic", "-O-", "[OX2H0;!R;!$([OX2H0]-[#6]=[#8])]"},
>>
>> {"EtherCyclic", "-O- (ring)", "[#8X2H0;R;!$([#8X2H0]~[#6]=[#8])]"},
>>
>> {"CarbonylAcyclic", ">C=O", "[$([CX3H0](=[OX1]));!$([CX3](
>> =[OX1])-[OX2]);!R]=O"},
>>
>> {"CarbonylCyclic", ">C=O (ring)","[$([#6X3H0](=[OX1]));
>> !$([#6X3](=[#8X1])~[#8X2]);R]=O"},
>>
>> {"Aldehyde","O=CH-","[CX3H1](=O)"},
>>
>> {"CarboxylicAcid", "COOH", "[OX2H]-[C]=O"},
>>
>> {"Ester", "-C(=O)O-", "[#6X3H0;!$([#6X3H0](~O)(~O)(~
>> O))](=[#8X1])[#8X2H0]"},
>>
>> {"OxygenDoubleBondOther", "=O", "[OX1H0;!$([OX1H0]~[#6X3]);!$(
>> [OX1H0]~[#7X3]~[#8])]"},
>>
>> {"PrimaryAmino","NH2", "[NX3H2]"},
>>
>> {"SecondaryAminoAcyclic",">NH", "[NX3H1;!R]"},
>>
>> {"SecondaryAminoCyclic",">NH (ring)", "[#7X3H1;R]"},
>>
>> {"TertiaryAmino", ">N-","[#7X3H0;!$([#7](~O)~O)]"}, (* Tertiary amine
>> except nitro group *)
>>
>> {"ImineCyclic","=N- (ring)","[#7X2H0;R]"},
>>
>> {"ImineAcyclic","=N-","[#7X2H0;!R]"},
>>
>> {"Aldimine", "=NH", "[#7X2H1]"},
>>
>> {"Cyano", "-C\[Congruent]N","[#6X2]#[#7X1H0]"},
>>
>> {"Nitro", "NO2", "[$([#7X3,#7X3+][!#8])](=[O])~[O-]"},
>>
>> {"Thiol", "-SH", "[SX2H]"},
>>
>> {"ThioetherAcyc

[Rdkit-discuss] convert a smiles file to a xyz file

2018-05-23 Thread Chenyang Shi

Hi Everyone,

I am seeking helps about how to convert a SMILES file to a series of
coordinates for the molecule, in the format of xyz.
I saw some online service that can do the job (e.g.
http://www.cheminfo.org/Chemistry/Cheminformatics/FormatConverter/index.html),
but it is not convenient to use.

I am wondering how can we do this by writing RDKit code. A separate
question is that is the converted molecular structure from SMILES the same
as that taken from a crystal structure?

Many thanks!
Chenyang
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] convert a smiles file to a xyz file

2018-05-23 Thread Chenyang Shi

Thank you all. It seems OpenBabel supports this.

Here is a tutorial I found from Dr. Kulik's group that might be useful
http://hjkgrp.mit.edu/content/geometries-strings-smiles-and-openbabel.

On Wed, May 23, 2018 at 10:59 AM, Benjamin Bucior <
bbuc...@u.northwestern.edu> wrote:

> I'm not sure which flags the online tool uses, but it's based on Open
> Babel so you might have some success with that tool.
> http://open-babel.readthedocs.io/en/latest/3DStructureGen/Overview.html
>
> For a quick guess at the structure, an example with the command line tool
> is something like
> obabel -:"[O-]C(=O)c1ccc(cc1)C(=O)[O-]" --gen3D -O structure.xyz
>
> If your workflow is in Python, there are some make3D and addh (for
> hydrogens) convenience functions in the openbabel (or its pybel) package.
>
> As Dima mentioned, there's several challenges/nonuniqueness in going from
> SMILES to 3D.  Some of the conformer search tools in the link can help a
> little bit, but in general it's a tricky problem.
>
> Best,
> Ben
>
> On Wed, May 23, 2018 at 10:30 AM, Dimitri Maziuk via Rdkit-discuss <
> rdkit-discuss@lists.sourceforge.net> wrote:
>
>> On 5/23/2018 10:23 AM, Chenyang Shi wrote:
>>
>> A separate question is that is the converted molecular structure from
>>> SMILES the same as that taken from a crystal structure?
>>>
>>
>> Provided there's no undefined/different stereochemistry on SMILES side,
>> no quirks with added protons, and so on and so forth... for a small simple
>> molecule... maybe.
>>
>> Dima
>>
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] convert a smiles file to a xyz file

2018-05-24 Thread Chenyang Shi

Thank you Prof. Jensen. I will study the module.

Best,
Chenyang

On Thu, May 24, 2018 at 1:30 AM, Jan Halborg Jensen 
wrote:

> Have a look at write_xtb_input_file in this module: https://github.com/
> jensengroup/take_elementary_step/blob/master/write_input_files.py
>
> The xtb input is simple an xyz file with some additional lines below if
> the molecule is charged. You can simply those lines in the code.
>
> Best regards, Jan
>
> On 23 May 2018, at 17:23, Chenyang Shi  wrote:
>
> Hi Everyone,
>
> I am seeking helps about how to convert a SMILES file to a series of
> coordinates for the molecule, in the format of xyz.
> I saw some online service that can do the job (e.g.
> http://www.cheminfo.org/Chemistry/Cheminformatics/
> FormatConverter/index.html), but it is not convenient to use.
>
> I am wondering how can we do this by writing RDKit code. A separate
> question is that is the converted molecular structure from SMILES the same
> as that taken from a crystal structure?
>
> Many thanks!
> Chenyang
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org <http://slashdot.org>!
> http://sdm.link/slashdot___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

[Rdkit-discuss] GETAWAY descriptors

2020-04-11 Thread Chenyang Shi

Hi everyone,

I hope to calculate the R3m descriptor which is among a family of GETAWAY
descriptors originally proposed by this paper
https://pubs.acs.org/doi/pdf/10.1021/ci015504a .

I tried writing some code as below to calculate it but obtained with a
Python list of a length of 273. I also tested with other molecules, always
ending up with a list of length 273. I am not sure what each of 273 numbers
correspond to, and in particular, which one is the R3m descriptor. Can
anyone help me understand it? Thank you.

Chenyang

from rdkit import Chem
from rdkit.Chem import AllChem

def return_getaway_descriptors(smiles):
mol = Chem.MolFromSmiles(smiles)
mol_hydrogen = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol_hydrogen, randomSeed = 1234)
res  = Chem.rdMolDescriptors.CalcGETAWAY(mol_hydrogen)
return res, len(res)
chlorobenzene = 'C1=CC=C(C=C1)Cl'

print (return_getaway_descriptors(chlorobenzene))
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] [External] GETAWAY descriptors

2020-04-12 Thread Chenyang Shi

Thank you Guillaume, this is very helpful.

On Sun, Apr 12, 2020 at 3:11 AM Guillaume GODIN <
guillaume.go...@firmenich.com> wrote:

> Hello,
>
>
>
> /*
>
> std::vector
>
>
> GETAWAYNAMES={"ITH","ISH","HIC","HGM","H0u","H1u","H2u","H3u","H4u","H5u","H6u","H7u","H8u","HTu",
>
>
> "HATS0u","HATS1u","HATS2u","HATS3u","HATS4u","HATS5u","HATS6u","HATS7u","HATS8u","HATSu","H0m","H1m","H2m","H3m","H4m","H5m",
>
>
> "H6m","H7m","H8m","HTm","HATS0m","HATS1m","HATS2m","HATS3m","HATS4m","HATS5m","HATS6m","HATS7m","HATS8m","HATSm","H0v","H1v",
>
>
> "H2v","H3v","H4v","H5v","H6v","H7v","H8v","HTv","HATS0v","HATS1v","HATS2v","HATS3v","HATS4v","HATS5v","HATS6v","HATS7v","HATS8v",
>
>
> "HATSv","H0e","H1e","H2e","H3e","H4e","H5e","H6e","H7e","H8e","HTe","HATS0e","HATS1e","HATS2e","HATS3e","HATS4e","HATS5e","HATS6e",
>
>
> "HATS7e","HATS8e","HATSe","H0p","H1p","H2p","H3p","H4p","H5p","H6p","H7p","H8p","HTp","HATS0p","HATS1p","HATS2p","HATS3p","HATS4p",
>
>
> "HATS5p","HATS6p","HATS7p","HATS8p","HATSp","H0i","H1i","H2i","H3i","H4i","H5i","H6i","H7i","H8i","HTi","HATS0i","HATS1i","HATS2i",
>
>
> "HATS3i","HATS4i","HATS5i","HATS6i","HATS7i","HATS8i","HATSi","H0s","H1s","H2s","H3s","H4s","H5s","H6s","H7s","H8s","HTs","HATS0s",
>
>
> "HATS1s","HATS2s","HATS3s","HATS4s","HATS5s","HATS6s","HATS7s","HATS8s","HATSs","RCON","RARS","REIG","R1u","R2u","R3u","R4u","R5u",
>
>
> "R6u","R7u","R8u","RTu","R1u+","R2u+","R3u+","R4u+","R5u+","R6u+","R7u+","R8u+","RTu+","R1m","R2m","R3m","R4m","R5m","R6m","R7m",
>
>
> "R8m","RTm","R1m+","R2m+","R3m+","R4m+","R5m+","R6m+","R7m+","R8m+","RTm+","R1v","R2v","R3v","R4v","R5v","R6v","R7v","R8v","RTv",
>
>
> "R1v+","R2v+","R3v+","R4v+","R5v+","R6v+","R7v+","R8v+","RTv+","R1e","R2e","R3e","R4e","R5e","R6e","R7e","R8e","RTe","R1e+","R2e+",
>
>
> "R3e+","R4e+","R5e+","R6e+","R7e+","R8e+","RTe+","R1p","R2p","R3p","R4p","R5p","R6p","R7p","R8p","RTp","R1p+","R2p+","R3p+","R4p+",
>
>
> "R5p+","R6p+","R7p+","R8p+","RTp+","R1i","R2i","R3i","R4i","R5i","R6i","R7i","R8i","RTi","R1i+","R2i+","R3i+","R4i+","R5i+","R6i+",
>
>
> "R7i+","R8i+","RTi+","R1s","R2s","R3s","R4s","R5s","R6s","R7s","R8s","RTs","R1s+","R2s+","R3s+","R4s+","R5s+","R6s+","R7s+","R8s+","RTs+"};
>
> */
>
>
>
>
>
> This is the list of descriptors name in GETAWAY.
>
>
>
> Best regards,
>
>
> guillaume
>
>
>
> *De : *Chenyang Shi 
> *Date : *samedi, 11 avril 2020 à 23:32
> *À : *RDKit Discuss 
> *Objet : *[*External*] [Rdkit-discuss] GETAWAY descriptors
>
>
>
> Hi everyone,
>
>
>
> I hope to calculate the R3m descriptor which is among a family of GETAWAY
> descriptors originally proposed by this paper
> https://pubs.acs.org/doi/pdf/10.1021/ci015504a .
>
>
>
> I tried writing some code as below to calculate it but obtained with a
> Python list of a length of 273. I also tested with other molecules, always
> ending up with a list of length 273. I am not sure what each of 273 numbers
> correspond to, and in particular, which one is the R3m descriptor. Can
> anyone help me understand it? Thank you.
>
>
>
> Chenyang
>
>
>
> from rdkit import Chem
> from rdkit.Chem import AllChem
>
> def return_getaway_descriptors(smiles):
> mol = Chem.MolFromSmiles(smiles)
> mol_hydrogen = Chem.AddHs(mol)
> AllChem.EmbedMolecule(mol_hydrogen, randomSeed = 1234)
> res  = Chem.rdMolDescriptors.CalcGETAWAY(mol_hydrogen)
> return res, len(res)
> chlorobenzene = 'C1=CC=C(C=C1)Cl'
>
> print (return_getaway_descriptors(chlorobenzene))
>
>
>
> ***
> DISCLAIMER
> This email and any files transmitted with it, including replies and
> forwarded copies (which may contain alterations) subsequently transmitted
> from Firmenich, are confidential and solely for the use of the intended
> recipient. The contents do not represent the opinion of Firmenich except to
> the extent that it relates to their official business.
>
> ***
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

[Rdkit-discuss] delete a substructure

Re: [Rdkit-discuss] delete a substructure

Re: [Rdkit-discuss] delete a substructure

Re: [Rdkit-discuss] delete a substructure

Re: [Rdkit-discuss] delete a substructure

Re: [Rdkit-discuss] segmentation fault 11

[Rdkit-discuss] SMARTS for =C=, #CH, #C-

Re: [Rdkit-discuss] SMARTS for =C=, #CH, #C-

[Rdkit-discuss] SMARTS for Joback and Reid method

Re: [Rdkit-discuss] SMARTS for Joback and Reid method

[Rdkit-discuss] convert a smiles file to a xyz file

Re: [Rdkit-discuss] convert a smiles file to a xyz file

Re: [Rdkit-discuss] convert a smiles file to a xyz file

[Rdkit-discuss] GETAWAY descriptors

Re: [Rdkit-discuss] [External] GETAWAY descriptors

15 matches

Site Navigation

Mail list logo

Footer information