Re: [Rdkit-discuss] Potential memory leak Ipc Descriptor

2015-10-15 Thread Michael Reutlinger
Hi Greg,

thanks for your swift response! I tried what suggested and I did not
observe any increased memory consumption. I investigated further and
eventually identified an issue with numpy (in
Graphs.CharacteristicPolynomial) as the main cause of the memory problem.
Updating numpy to a recent version solved it.

Best,
Michael

On Thu, Oct 15, 2015 at 6:36 AM, Greg Landrum <greg.land...@gmail.com>
wrote:

> Hi Michael,
>
> On Wed, Oct 14, 2015 at 7:06 PM, Michael Reutlinger <rd...@mulchi.de>
> wrote:
>
>>
>> I observed a memory leak while using the RDKit to calculate descriptors
>> for a large library of compounds.
>>
>> I tracked it down to the Ipc descriptor and it is reproducible with this
>> small script:
>>
>> from rdkit.ML.Descriptors import MoleculeDescriptors
>> from rdkit import Chem
>>
>> calculator = MoleculeDescriptors.MolecularDescriptorCalculator(['Ipc'])
>> for n in range(10):
>> mol = Chem.MolFromSmiles('CC(C)Cc1ccc(cc1)C(C)C(=O)O')
>> x = calculator.CalcDescriptors(mol)
>> if not n % 100: print n
>>
>> I tested it on my Linux workstation (Redhat 6). The process memory
>> consumption increases to several hundred mb. Interestingly, I can't
>> reproduce it on my Mac running the latest os.
>>
>
> I can't reproduce it on my Mac either. I'm on vacation and don't have
> access to my linux box, but I will see if I can reproduce it when I'm back
> next week. Which version(s) of python are you using on the machines?
>
> My guess is that the leak is caused by getDistanceMatrix in MolOps.cpp.
>> Specifically, a missing delete for the distMat pointer (in the getDistanceMat
>> documentation is a note that the pointer should be deleted by the
>> caller). However, I am not a c++ programmer myself and this analysis might
>> not be the true cause.
>>
>
> The docs actually say that the pointer should *not* be deleted by the
> caller, but that's not relevant here anyway. The C++ object is copied into
> a new python numpy array object before being returned to the user.
>
>
>> I hope it is reproducible on other systems and easy to fix :-) If you
>> need additional information please let me know.
>>
>
> The simplest possible test would be to see if you get the same leak when
> you just call Chem.GetDistanceMatrix(mol,0) repeatedly.
>
> Best,
> -greg
>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Potential memory leak Ipc Descriptor

2015-10-14 Thread Michael Reutlinger
Hi,

I observed a memory leak while using the RDKit to calculate descriptors for
a large library of compounds.

I tracked it down to the Ipc descriptor and it is reproducible with this
small script:

from rdkit.ML.Descriptors import MoleculeDescriptors
from rdkit import Chem

calculator = MoleculeDescriptors.MolecularDescriptorCalculator(['Ipc'])
for n in range(10):
mol = Chem.MolFromSmiles('CC(C)Cc1ccc(cc1)C(C)C(=O)O')
x = calculator.CalcDescriptors(mol)
if not n % 100: print n

I tested it on my Linux workstation (Redhat 6). The process memory
consumption increases to several hundred mb. Interestingly, I can't
reproduce it on my Mac running the latest os.

My guess is that the leak is caused by getDistanceMatrix in MolOps.cpp.
Specifically, a missing delete for the distMat pointer (in the getDistanceMat
documentation is a note that the pointer should be deleted by the caller).
However, I am not a c++ programmer myself and this analysis might not be
the true cause.

I hope it is reproducible on other systems and easy to fix :-) If you need
additional information please let me know.

Best,
Michael
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] cis/trans directional bond and smiles strings in python

2015-10-14 Thread Michael Reutlinger
Dear all,

I realised that I tricked myself into believing that there was a more
general issue, which is not the case :-)

However, I also think that it would be consistent to deal with this in the
same way as a blank / tab etc. is handled currently e.g. returning None and
printing an warning message to the console.

Best,
Michael

On Wed, Oct 14, 2015 at 4:39 PM, Rocco Moretti <rmoretti...@gmail.com>
wrote:

> On Wed, Oct 14, 2015 at 12:00 AM, Greg Landrum <greg.land...@gmail.com>
> wrote:
>
>>
>> On Mon, Oct 12, 2015 at 10:52 PM, Michael Reutlinger <rd...@mulchi.de>
>> wrote:
>>
>>>
>>> However, I thought that it might be something that could be done by the
>>> toolkit to avoid errors that could go unnoticed for a long time :-)
>>>
>>
>
>> Though it's possible to modify the SMILES and SMARTS parsers to attempt a
>> bit of "do what I mean, not what I say" logic in cases like this, that
>> would be error prone and counter to the general functioning of the rest of
>> the toolkit.
>>
>
> Would raising an error (or warning) be appropriate here? The SMILES parser
> is getting a two line string, but is only using the first. If it was a
> space or a tab rather than a newline, you get a parse error. I'm not sure
> why a newline should be any different, given the function signature. (If it
> was a multi-molecule return value, then I could see parsing each line as a
> separate molecule. But as it's only a single molecule return, there
> probably should be some error/warning if not all the input is used.)
>
> Regards,
> -Rocco
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] cis/trans directional bond and smiles strings in python

2015-10-12 Thread Michael Reutlinger
Hi all,

I just found an unexpected behaviour in the current RDKit. My input is a
perfectly valid smiles with explicitly specified double bond configuration.
Actually, similar smiles were obtained using the RDKit.

The problem is, when submitting the smiles string containing an \n to
MolFromSmiles only the part before the \n is used and the rest is
disregarded. The \ needs to be quoted to a \\ in order to work correct.

Is this a desired / expected behaviour?

Best,
Michael

[image: Inline image 2]
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] cis/trans directional bond and smiles strings in python

2015-10-12 Thread Michael Reutlinger
Hi David,

thanks for your answer and yes, this seems to be the case.

It could be solved by either using raw strings or escaping with
smiles = smiles.encode('string-escape')

However, I thought that it might be something that could be done by the
toolkit to avoid errors that could go unnoticed for a long time :-)

Best,
Michael


On Mon, Oct 12, 2015 at 10:42 PM, David Hall <li...@cowsandmilk.net> wrote:

> That behavior appears to all be in python; as you’ve written it, your
> smiles string has a newline before rdkit ever sees it:
>
> >>> print 'C/C=C\n1nc(nn1)C'
> C/C=C
> 1nc(nn1)C
> >>> print 'C/C=C\\n1nc(nn1)C'
> C/C=C\n1nc(nn1)C
>
>
> On Oct 12, 2015, at 4:37 PM, Michael Reutlinger <rd...@mulchi.de> wrote:
>
> Hi all,
>
> I just found an unexpected behaviour in the current RDKit. My input is a
> perfectly valid smiles with explicitly specified double bond configuration.
> Actually, similar smiles were obtained using the RDKit.
>
> The problem is, when submitting the smiles string containing an \n to
> MolFromSmiles only the part before the \n is used and the rest is
> disregarded. The \ needs to be quoted to a \\ in order to work correct.
>
> Is this a desired / expected behaviour?
>
> Best,
> Michael
>
> 
>
> --
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] AP / DP descriptors

2015-07-05 Thread Michael Reutlinger
Dear all,

I would like to use a machine learning method with the AP and DP
descriptors as described by Robert Sheridan.

AP descriptors are the 'atom pair' descriptors from Carhart et al. 1985 and
I think they are already available in RDKIT.
DP 'donor−acceptor pair', called 'BP' in Kearsley et al. 1996, is a reduced
pharmacophore version of AP.

I would like to know if you think there is a straightforward way to use the
existing AP functionality (maybe using  atomInvariants) to reproduce the
descriptor as described in Kearsley et al.?

Best,
Michael
--
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SDF properties in case of error

2015-05-02 Thread Michael Reutlinger
Hi Greg,

thanks for your answer, I agree that the lighter weighted solution is
certainly also a possibility and would clearly solve my (and possibly
others) problem. Maybe a suppl.GetLastItemError() would then also be handy
to get the error messages that usually are only visible in the log.

But maybe something like an ErrorMol (as described in more detail by Andrew
Dalke) could potentially be more versatile. If an ErrorMol class is
inherited from Mol it could be processed in a standard way but one could
clearly differentiate this vehicle from an empty molecule. By having
different handlers, it would also be possible to add Exceptions in the
future, if people prefer having this behaviour :-)

However, both implementations would be a big improvement and could help to
avoid dealing with special cases somewhere else in the workflow, leading to
more robust workflows and eventually less errors.

Have a nice weekend,
Michael




On Sat, May 2, 2015 at 2:25 PM, Greg Landrum greg.land...@gmail.com wrote:

 Hi Michael,

 What you request is certainly possible, but it is a pretty fundamental
 change in the way the supplier (and mol file parser) works, so it would
 need some thought.

 Once concern that immediately occurs to me is that you will not be able to
 tell which molecules from the input file were actually empty in the input
 and which were just empty because there was a problem parsing an input
 molecule.

 A possible alternative, more general and somewhat lighter weight, would be
 to ensure that you can always get the text of the last item parsed from a
 ForwardSDMolSupplier (a method like: suppl.GetLastItemText()); this would
 allow you to do whatever special error handling you are interested in doing

 -greg


 On Fri, May 1, 2015 at 12:01 AM, Michael Reutlinger rd...@mulchi.de
 wrote:

 Hi all,

 I am currently working on a program which needs to process libraries of
 large SDF files. One requirement is to always produce a valid output
 including the molecule title/name or a specified property for referencing.

 With specifying sanitize=False with ForwardSDMolSupplier and using
 Chem.Sanitize afterwards with an appropriate Exception handling helps in
 most cases to get the SD file properties and still detect errors in the
 molecules to avoid importing rubbish.

 However, in some cases this does not help. E.g. when an unknown atom
 (most of the time this is X) is found in the MolBlock the import fails with
 an Post-condition Violation and None is yielded. This is fine to detect the
 problem BUT it is impossible to get any information about the molecule
 which failed.

 My question is if there is a way to get to the data even for those cases?
 The files tend to be very big so accessing the molecule re-parsing it
 line-by-line in python to get the name for a specific molecule number
 (found by enumerating the supplier) is not really an option.

 What would be a good solution in my opinion is to create an empty
 molecule with all sd properties, including _Name, in case of an error
 instead of None. The actual error could then also be communicated into
 python via an '_Error' property. With this it would still be possible to
 continue processing of the file in a for loop, in contrast to raising an
 Exception, and it is easy to check if the molecule is empty.
 Maybe this behaviour could be activated via an option and the default
 would be to return None, to not break any existing code.

 I am very keen on getting your view on this issue.

 Best regards,
 Michael


 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Rdkit-discuss Digest, Vol 91, Issue 1

2015-05-02 Thread Michael Reutlinger
Hi Andrew,

thanks for your helpful and detailed email. Your chemfp package is clearly
also an alternative to use. Lets see how the discussion evolves as I would
love if this could be part of the standard RDKit.

I completely agree that it would also be nice to have an convenient way to
get the error messages using something like the proposed mechanism.

Best,
Michael


 On May 1, 2015, at 12:01 AM, Michael Reutlinger wrote:
  However, in some cases this does not help. E.g. when an unknown atom
 (most of the time this is X) is found in the MolBlock the import fails with
 an Post-condition Violation and None is yielded. This is fine to detect the
 problem BUT it is impossible to get any information about the molecule
 which failed.
 As a backup solution, outside of RDKit, you might try my chemfp package,
 available from
 https://chem-fingerprints.googlecode.com/files/chemfp-1.1.tar.gz
 (Hmm, looks like I need to migrate that away from Google Code.)
 One of the internal functions [*] has a way to read individual SDF records
 as text:
for record in sdf_reader.open_sdf(tests/pubchem.sdf):
   ...   print record.split(\n, 1)[0]
   ...
   9425004
   9425009
   9425012
   9425015
   9425018
   9425021
 If you use the bit of code in this email after my signature you can
 extract the tag/data pair from the record:
  from chemfp import sdf_reader
  for record in sdf_reader.open_sdf(tests/pubchem.sdf):
 ...   id = record.partition(\n)[0]
 ...   tags = dict(get_sdf_tag_pairs(record))
 ...   print id, tags[PUBCHEM_OPENEYE_ISO_SMILES]
 ...
 9425004 CC1=CC(=NN1CC(=O)NNC(=O)\C=C\C2=C(C=CC=C2Cl)F)C
 9425009 CC1=CC(=NN1CC(=O)NNC(=O)CCC2=NC(=NO2)C3=CC=CC=C3)C
 9425012 CCC1=NOC(=C1C(=O)NNC(=O)CN2C(=CC(=N2)C)C)C
 9425015 CC1=CC(=NN1CC(=O)NNC(=O)CCC(=O)C2=CC=C(C=C2)C3=CC=CC=C3)C
 9425018 CC1=CC(=NN1CC(=O)NNC(=O)C2=CC=CC=C2SCC(=O)N(C)C)C
   ...
 I also included a function called MolFromSDBlock which is like
 MolFromMolBlock except that it also copies over the tag data as
 properties. In that way you can get what you want from RDKit like this:

  for record in sdf_reader.open_sdf(/Users/
 dalke/databases/chembl_14.sdf):
 ...   mol = MolFromSDBlock(record)
 ...   if mol is None:
 ... print Could not process, dict(get_sdf_tag_pairs(record)
 )[chembl_id]
 ...   else:
 ... print mol.GetProp(chembl_id), mol.GetNumAtoms()
 ...
 CHEMBL438581 165
 CHEMBL155459 44
 CHEMBL154288 52
 CHEMBL443179 56
 CHEMBL443183 92
 CHEMBL443332 18
   ..
 CHEMBL265763 40
 [01:03:52] Explicit valence for atom # 0 B, 5, is greater than permitted
 Could not process CHEMBL268118
 CHEMBL265830 29
   ...
 I've also sketched out a solution which returns an empty molecule with the
 _Name, _Error, and properties set from the SD tag. There's only one
 line to comment out to get it, but I've not actually tested that code path.

 Be aware that I wasn't quite as experienced in how to parse SD files when
 I wrote code for chemfp-1.1 some 5 years ago. For example, you shouldn't
 have tag data with a line starting with a ''.

 [*] By internal I mean that it's not documented and not part of the
 stable API. In fact, it has changed in more recent versions of chemfp,
 where similar functionality is now part of the stable API. However, those
 more recent versions, while still free/open source software, are a
 commercial product and costs money.
 Contact me if you are interesting in purchasing a copy. :)
  My question is if there is a way to get to the data even for those
 cases? The files tend to be very big so accessing the molecule re-parsing
 it line-by-line in python to get the name for a specific molecule number
 (found by enumerating the supplier) is not really an option.
 My timing numbers for chemfp-1.1 had about the same performance as RDKit's
 own parser. In newer versions I fixed some of the corner cases, and rewrote
 the code in C for better performance.

  What would be a good solution in my opinion is to create an empty
 molecule with all sd properties, including _Name, in case of an error
 instead of None. The actual error could then also be communicated into
 python via an '_Error' property.
   ...
  Maybe this behaviour could be activated via an option and the default
 would be to return None, to not break any existing code.
 It would have to be via an option, for exactly the reason you highlighted.
 The option might look like:
ForwardSDMolSupplier(, onError=handler)
 The simplest is if handler is one of a handful of possible string values:
   - None to return None on failure; the current behavior
   - ErrorMol to return an error molecule like you describe
 Personally, I would love some easy way from the Python API to get access
 to the warning and error messages without having to intercept the log
 messages. I think that something like this is the way to get there.
--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box

[Rdkit-discuss] SDF properties in case of error

2015-04-30 Thread Michael Reutlinger
Hi all,

I am currently working on a program which needs to process libraries of
large SDF files. One requirement is to always produce a valid output
including the molecule title/name or a specified property for referencing.

With specifying sanitize=False with ForwardSDMolSupplier and using
Chem.Sanitize afterwards with an appropriate Exception handling helps in
most cases to get the SD file properties and still detect errors in the
molecules to avoid importing rubbish.

However, in some cases this does not help. E.g. when an unknown atom (most
of the time this is X) is found in the MolBlock the import fails with an
Post-condition Violation and None is yielded. This is fine to detect the
problem BUT it is impossible to get any information about the molecule
which failed.

My question is if there is a way to get to the data even for those cases?
The files tend to be very big so accessing the molecule re-parsing it
line-by-line in python to get the name for a specific molecule number
(found by enumerating the supplier) is not really an option.

What would be a good solution in my opinion is to create an empty molecule
with all sd properties, including _Name, in case of an error instead of
None. The actual error could then also be communicated into python via an
'_Error' property. With this it would still be possible to continue
processing of the file in a for loop, in contrast to raising an Exception,
and it is easy to check if the molecule is empty.
Maybe this behaviour could be activated via an option and the default would
be to return None, to not break any existing code.

I am very keen on getting your view on this issue.

Best regards,
Michael
--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] nan in Gasteiger Charges

2015-04-20 Thread Michael Reutlinger
Dear all,

I noticed a problem with compounds containing Sulfur hexafluoride and
similar groups. Gasteiger charges are contain -nan for all atoms.

Here is an example:
In [23]: s = 'c1ccc(cc1)S(F)(F)(F)(F)F'

In [24]: AllChem.ComputeGasteigerCharges(mol)

In [25]: mol.GetAtomWithIdx(0).GetProp(_GasteigerCharge)
Out[25]: '-nan'

Best,
Michael
--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] MolFromMolBlock RadicalElectrons descriptor

2015-03-25 Thread Michael Reutlinger
Dear all,

I noticed another small issue when importing mol / sd data and calculating
RDKIT descriptors.

There is a difference between direct import and smiles (or second mol-block
conversion):

Take Mercury as an simplified example: [Hg+]

If imported as smiles Descriptors.NumRadicalElectrons() reports 1.0

 smiMol = Chem.MolFromSmiles([Hg+])
 Descriptors.NumRadicalElectrons(smiMol)
1.0

If I use this mdl molfile:

mdl=Mercury
  Mrv0541 03241519412D

  1  0  0  0  0  0999 V2000
5.2935   -7.21640. Hg  0  3  0  0  0  0  0  0  0  0  0  0
M  CHG  1   1   1
M  END

 mdlMol = Chem.MolFromMolBlock(mdl)
 Descriptors.NumRadicalElectrons(mdlMol)
0.0

BUT if ​converted twice I get the same as with the Smiles input:

 mdlMol2 = Chem.MolFromMolBlock(Chem.MolToMolBlock(mdlMol))
 Descriptors.NumRadicalElectrons(mdlMol2)
1.0

​The difference in the Mol file is that the valency in the ctab is set to
15 (0) when exported using MolToMolBlock.

I am wondering if there might be a standardization / assignment step
missing in the MolFromMolBlock when called with an external mol file as
input?

Best,
Michael
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss