Re: [Rdkit-discuss] one flavor of MCS

2012-12-13 Thread Andrew Dalke
On Dec 13, 2012, at 3:32 PM, paul.czodrow...@merckgroup.com wrote:
>> I think I figured out a way around that via some post-processing.
> 
> Great!


> Now let's come to another question:
> How does one code the "complete-ring-only" variation?
> Can your code be adapated, or shall I do some post-processing?

Ideally the MCS code should be modified so that a user-defined
class can come in and replace the pruning and matching code.

Post-processing is definitely trickier for this one. I can get
what you want through a horrible bit of monkey-patching to
internal and unpublished parts of fmcs.

% python paul.py paul.smi --threshold 0.15 --complete-rings-only
frequency #atoms #bonds SMILES SMARTS
3 5 5 c1cncn1 [#6]:1:[#6]:[#7]:[#6]:[#7]:1
3 8 8 NCc1c1 [#7]-!@[#6]-!@[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1
2 10 10 CCNCc1c1 
[#6]-!@[#6]-!@[#7]-!@[#6]-!@[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1


import sys
import re
from rdkit import Chem
import hacked_fmcs


import argparse
parser = argparse.ArgumentParser("Find fragments")
parser.add_argument("filename",
help="SMILES filename")
parser.add_argument("--threshold", "-t", type=float, default=0.5,
help="minimum match threshold (default: 0.5)")
parser.add_argument("--min-num-atoms", type=int, default=2,
help="minimum number of atom for the match (default: 2)")
parser.add_argument("--min-num-bonds", type=int, default=1,
help="minimum number of bonds for the match (default: 1)")
parser.add_argument("--complete-rings-only", action="store_true",
help="don't match part ring in a fragment")

args = parser.parse_args()
if args.min_num_atoms < 2:
parser.error("--min-num-atoms must be at least 2")
if args.min_num_bonds < 1:
parser.error("--min-num-bonds must be at least 2")
if args.threshold <= 0.0:
parser.error("--threshold must be positive")

table = Chem.GetPeriodicTable()
_atomic_num_to_symbol = dict((str(i), table.GetElementSymbol(i)) for i in range(105))
atom_pat = re.compile("\[#(\d+)\]")
def atomic_num_to_symbol(m):
return _atomic_num_to_symbol[m.group(1)]

def smarts_to_unique_smiles_fragment(smarts):
fragment_smiles = atom_pat.sub(atomic_num_to_symbol, smarts)
fragment_smiles = fragment_smiles.replace("!@", "").replace("@", "")
mol = Chem.MolFromSmiles(fragment_smiles, sanitize=False)
# Correct the aromaticity information
for bond in mol.GetBonds():
if bond.GetIsAromatic():
bond.GetBeginAtom().SetIsAromatic(1)
bond.GetEndAtom().SetIsAromatic(1)
return Chem.MolToSmiles(mol)

## Hack in code to check the results of the complete_rings_only check

only_has_complete_rings = {}
old_check_complete_rings_only = hacked_fmcs.check_complete_rings_only

def new_check_complete_rings_only(smarts, subgraph, enumeration_mol):
result = old_check_complete_rings_only(smarts, subgraph, enumeration_mol)
only_has_complete_rings[smarts] = result
return result

hacked_fmcs.check_complete_rings_only = new_check_complete_rings_only

hacked_fmcs._maximize_options[("all-matches", False)] = (
hacked_fmcs.no_pruning, hacked_fmcs.SingleBestAtoms)
hacked_fmcs._maximize_options[("all-matches", True)] = (
hacked_fmcs.no_pruning, hacked_fmcs.SingleBestAtomsCompleteRingsOnly)

# Read in the input file

mols = []
for lineno, line in enumerate(open(args.filename)):
fields = line.split()
if not fields:
sys.stderr.write("Missing SMILES on line %d" % (lineno+1,))
continue
smiles = fields[0]
mol = Chem.MolFromSmiles(smiles)
if mol is None:
sys.stderr.write("Could not process SMILES %s on line %d\n" % (smiles, lineno+1))
continue
mols.append(mol)

mcs_result = hacked_fmcs.fmcs(mols,
  threshold=args.threshold,
  min_num_atoms=args.min_num_atoms,
  complete_rings_only=args.complete_rings_only,
  # fmcs doesn't implement min_num_bonds; use post-processing
  maximize="all-matches")

# Convert to unique fragment SMILES
matches = []
unique_fragment_smiles = set()
for smarts, is_match in mcs_result.matches.items():
if not is_match:  # remember, this is hacked code, and a bit ugly!
continue
if args.complete_rings_only:
if not only_has_complete_rings.get(smarts, False):
continue

fragment_smiles = smarts_to_unique_smiles_fragment(smarts)
if fragment_smiles in unique_fragment_smiles:
continue
unique_fragment_smiles.add(fragment_smiles)
pattern = Chem.MolFromSmarts(smarts)
if pattern.GetNumBonds() < args.min_num_bonds:
continue
matches.append( (pattern, smarts, fragment_smiles) )

def order_by_number_of_atoms(term):
return term[0].GetNumAtoms()

matches.sort(key=order_by_number_of_atoms)


print "frequency #atoms #bonds SMILES SMARTS"
for pattern, smarts

Re: [Rdkit-discuss] one flavor of MCS

2012-12-13 Thread Paul . Czodrowski
> > Regarding the issues you mentioned
>...
> >
> > - non-canonical SMARTS
> > - duplicates are not filtered out
>
> I think I figured out a way around that via some post-processing.

Great!

>
>
> >> Or do you mean the number of molecules which contain that structure,
> >> in which case "CC" exists in 3 of the structures.
>
> Okay. You want a (say) 15% threshold, and the threshold MCS isn't
> yet ported over to RDKit. It is in that hacked module I sent the
> other day.
>
> > Following up on the example from the above section:
> > "
> > CCNCc1c1
> > c1ccc(cc1)CN
> > CCNCc1c1
> > c1cnc[nH]
> > "
>
> BTW, that last SMILES isn't complete.


Yep, you are absolutely right!
Here comes a slightly updated version of small data set:
CCNCc1c1
NCc1c1
CCNCc1c1
c1c[nH]cn1
CCCn1ccnc1
Cn1cncc1CCc2c2


>
> > => I would be more than happy to finally have this output:
> > "
> > newflavorOfMCS   frequency
> > ##
> > c1ccc(cc1)CN 3
> > c1cnc[nH]1
> > "
>
> That script is attached. Here's an example of it in use:
>
> % python paul.py --min-num-bonds 7 paul.smi

Cool, the script does exactly the job! I'm really grateful to your help!

For the above described dataset, the "most common sub-structures" (aka
mcss) are found:
3 5 5 c1cncn1 [#6]:1:[#6]:[#7]:[#6]:[#7]:1
3 8 8 NCc1c1 [#7]-[#6]-[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1

=>
Now let's come to another question:
How does one code the "complete-ring-only" variation?
Can your code be adapated, or shall I do some post-processing?


Cheers & a big thanks!

Paul

> A non-hacked solution could hook into the information about
> "complete-ring-only". That would give you structures more like
> what a chemist would expect, though not the scaffolds that Christos
> and Peter suggested.

This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.merckgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.


--
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] one flavor of MCS

2012-12-13 Thread Andrew Dalke
On Dec 13, 2012, at 9:18 AM, paul.czodrow...@merckgroup.com wrote:
> Regarding the issues you mentioned
   ...
> 
> - non-canonical SMARTS
> - duplicates are not filtered out

I think I figured out a way around that via some post-processing.


>> Or do you mean the number of molecules which contain that structure,
>> in which case "CC" exists in 3 of the structures.

Okay. You want a (say) 15% threshold, and the threshold MCS isn't
yet ported over to RDKit. It is in that hacked module I sent the
other day.

> Following up on the example from the above section:
> "
> CCNCc1c1
> c1ccc(cc1)CN
> CCNCc1c1
> c1cnc[nH]
> "

BTW, that last SMILES isn't complete.

> => I would be more than happy to finally have this output:
> "
> newflavorOfMCS   frequency
> ##
> c1ccc(cc1)CN 3
> c1cnc[nH]1
> "

That script is attached. Here's an example of it in use:

% python paul.py --min-num-bonds 7 paul.smi
[11:13:31] SMILES Parse Error: unclosed ring for input: c1cnc[nH]
Could not process SMILES c1cnc[nH] on line 4
frequency #atoms #bonds SMILES SMARTS
3 7 7 Cc1c1 [#6]:1:[#6]:[#6]:[#6](-[#6]):[#6]:[#6]:1
2 8 7 ccc(CNC)cc [#6]-[#7]-[#6]-[#6](:[#6]:[#6]):[#6]:[#6]
3 8 8 NCc1c1 [#7]-[#6]-[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1
3 8 7 ccCN [#6](:[#6]:[#6]):[#6]:[#6]:[#6]-[#6]-[#7]
2 8 7 (CNC)c [#6]-[#7]-[#6]-[#6](:[#6]):[#6]:[#6]:[#6]
3 8 7 c(CN)c [#6](:[#6]:[#6]:[#6](-[#6]-[#7]):[#6]):[#6]
2 8 7 cCNC [#6]-[#7]-[#6]-[#6]:[#6]:[#6]:[#6]:[#6]
2 8 7 ccc(c)CNCC [#6]-[#6]-[#7]-[#6]-[#6](:[#6]):[#6]:[#6]
2 8 7 CNCC [#6]-[#6]-[#7]-[#6]-[#6]:[#6]:[#6]:[#6]
3 8 7 (CN)cc [#7]-[#6]-[#6](:[#6]:[#6]):[#6]:[#6]:[#6]
2 9 8 c(CNC)c [#6]-[#7]-[#6]-[#6](:[#6]:[#6]:[#6]:[#6]):[#6]
2 9 8 ccc(CNCC)cc [#6]-[#6]-[#7]-[#6]-[#6](:[#6]:[#6]):[#6]:[#6]
2 9 8 (CNC)cc [#6]-[#7]-[#6]-[#6](:[#6]:[#6]):[#6]:[#6]:[#6]
2 9 8 (c)CNCC [#6]-[#6]-[#7]-[#6]-[#6](:[#6]:[#6]:[#6]):[#6]
2 9 9 CNCc1c1 [#6]-[#7]-[#6]-[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1
2 9 8 ccCNC [#6]-[#7]-[#6]-[#6]:[#6]:[#6]:[#6]:[#6]:[#6]
2 9 8 cCNCC [#6]-[#6]-[#7]-[#6]-[#6]:[#6]:[#6]:[#6]:[#6]
2 10 9 c(c)CNCC [#6]-[#6]-[#7]-[#6]-[#6](:[#6]):[#6]:[#6]:[#6]:[#6]
2 10 9 (cc)CNCC [#6]-[#6]-[#7]-[#6]-[#6](:[#6]:[#6]):[#6]:[#6]:[#6]
2 10 10 CCNCc1c1 [#6]-[#6]-[#7]-[#6]-[#6]:1:[#6]:[#6]:[#6]:[#6]:[#6]:1
2 10 9 ccCNCC [#6]-[#6]-[#7]-[#6]-[#6]:[#6]:[#6]:[#6]:[#6]:[#6]


You can see that 'c(CNC)c' and 'ccc(CNCC)cc' are rotations
around the same ring. That's a limitation of the current approach.

A non-hacked solution could hook into the information about
"complete-ring-only". That would give you structures more like
what a chemist would expect, though not the scaffolds that Christos
and Peter suggested.


import sys
import re
from rdkit import Chem
import hacked_fmcs

import argparse
parser = argparse.ArgumentParser("Find fragments")
parser.add_argument("filename",
help="SMILES filename")
parser.add_argument("--threshold", "-t", type=float, default=0.5,
help="minimum match threshold (default: 0.5)")
parser.add_argument("--min-num-atoms", type=int, default=2,
help="minimum number of atom for the match (default: 2)")
parser.add_argument("--min-num-bonds", type=int, default=1,
help="minimum number of bonds for the match (default: 1)")

args = parser.parse_args()
if args.min_num_atoms < 2:
parser.error("--min-num-atoms must be at least 2")
if args.min_num_bonds < 1:
parser.error("--min-num-bonds must be at least 2")
if args.threshold <= 0.0:
parser.error("--threshold must be positive")

table = Chem.GetPeriodicTable()
_atomic_num_to_symbol = dict((str(i), table.GetElementSymbol(i)) for i in range(105))
atom_pat = re.compile("\[#(\d+)\]")
def atomic_num_to_symbol(m):
return _atomic_num_to_symbol[m.group(1)]

def smarts_to_unique_smiles_fragment(smarts):
fragment_smiles = atom_pat.sub(atomic_num_to_symbol, smarts)
mol = Chem.MolFromSmiles(fragment_smiles, sanitize=False)
# Correct the aromaticity information
for bond in mol.GetBonds():
if bond.GetIsAromatic():
bond.GetBeginAtom().SetIsAromatic(1)
bond.GetEndAtom().SetIsAromatic(1)
return Chem.MolToSmiles(mol)

# Read in the input file
mols = []
for lineno, line in enumerate(open(args.filename)):
fields = line.split()
if not fields:
sys.stderr.write("Missing SMILES on line %d" % (lineno+1,))
continue
smiles = fields[0]
mol = Chem.MolFromSmiles(smiles)
if mol is None:
sys.stderr.write("Could not process SMILES %s on line %d\n" % (smiles, lineno+1))
continue
mols.append(mol)

mcs_result = hacked_fmcs.fmcs(mols,
  threshold=args.threshold,
  min_num_atoms=args.min_num_atoms,
  # fmcs doesn't implement min_num_bonds; use post-processing
  maximize="a

Re: [Rdkit-discuss] one flavor of MCS

2012-12-13 Thread Paul . Czodrowski
Dear Andrew,

thanks a lot for the quick hack and sorry for my late answer! I'm still
interested in that issue, but I had no access to coding facilities for
almost two days, what a shame!


>
> The current hooks in the MCS code don't make that possible. If you
> tweak the code a bit, I think you can get something working.
>
> % python hacked_fmcs.py sample_files/benzotriazole.sdf --maximize
all-matches
> [#6]:1:[#6]:[#6]:[#6]:2:[#7]:[#7]:[#7]:[#6]:2:[#6]:1 9 atoms 10
> bonds (complete search)
> == Successful matches ==
> [#7]:[#7]
> [#6]:[#6]
> [#6]:[#7]
> [#6]:[#6]:[#7]
> [#7]:[#7]:[#7]
> [#6]:[#6]:[#6]
> [#6]:[#7]:[#7]
> [#7]:[#7]:[#6]
>  ...
> [#6]:1:[#6]:[#6]:[#6]:2:[#7]:[#7]:[#7]:[#6]:2:[#6]:1
> [#6]:1:[#6]:[#6]:[#6](:[#6](:[#6]:1):[#7]:[#7]):[#7]
> [#6]:1:[#6]:[#6]:[#6](:[#6](:[#6]:1):[#7]):[#7]:[#7]
>
> I've attached both a patch against the current version of fmcs.py
> plus the full, hacked modification. It adds a new '.matches' field
> to the MCS result. The key is the SMARTS pattern tested, the value
> is if the pattern was in enough of the molecules. (In a non-hacked
> solution I would likely return only the valid matches.)
>


I tried your hacked script on the example you described below as well as on
a this small data set:
CCNCc1c1
c1ccc(cc1)CN
CCNCc1c1
c1cnc[nH]


Regarding the issues you mentioned
- incomplete, untested & awkward naming
- rather slow speed
- non-canonical SMARTS
- duplicates are not filtered out
=> these restrictions are OK for me!

> > In addition, I would like to output the frequency of the found
> > substructures
>
> That's much harder from fmcs. Then again, I don't know what you mean
> by frequency.
>
> Consider the compounds   CCOCC  CCNCC  CCPCCSCC
>
> The MCS is "CC". The "CC" exists twice (uniquely) in the first
> structure, twice in the second, and three times in the third.
>
> Is the frequency of "CC" 2 or 3?
>
> Or do you mean the number of molecules which contain that structure,
> in which case "CC" exists in 3 of the structures.

That's exactly what I meant.


>
>
> In either case, I don't think the right solution is to do this in
> fmcs. You have the SMARTS patterns and the molecules, so do a SMARTS
> match yourself and get the exact statistics you want.

Following up on that topic - and also taking into account the comments by
Peter and Christos - it become clearer to me that fmcs might be not
appropriate for that kind of question.

You may be interested how such a question came by my mind:
Table 6 of this publication sparked my interest:
http://dx.doi.org/10.1002/cmdc.20124
>From a conversation with Andreas Bender, it was explained to me that they
did a brute-force all-against-all MCS search in pipeline pilot. It's a
miracle for me how to code something similar in RDKit.

Following up on the example from the above section:
"
CCNCc1c1
c1ccc(cc1)CN
CCNCc1c1
c1cnc[nH]
"
=> I would be more than happy to finally have this output:
"
newflavorOfMCS   frequency
##
c1ccc(cc1)CN 3
c1cnc[nH]1
"


Cheers & Thanks for all your input so far!

Paul

>
> Cheers,
>
> Andrew
> da...@dalkescientific.com

This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.merckgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.


--
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] one flavor of MCS

2012-12-11 Thread Paul . Czodrowski

Dear RDKitters,

given a data set of let's say 2000 compounds, how do I extract the most
common substructures rather than the maximum common substructures?
In addition, I would like to output the frequency of the found
substructures

E.g., the output would look like that
N(CCc1c1)C, frequency: 12
Nc1c1, frequency: 10

Is such an analysis only possible via a brut-force approach?
Or can one do it smarter?



Cheers & Thanks,
Paul

This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.merckgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.


--
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss