Re: [Rdkit-discuss] Building RDKit on Windows for pgAdmin (Postgres)

2022-04-14 Thread Paolo Tosco
Hi Charmaine,

my suggestion is to build starting from a conda environment. That will
significantly simplify your dependencies, since most packages are available
pre-built.
You can look into .azure-pipelines\vs_build_dll.yml for how to set up your
conda environment and for the cmake flags to use.
For Postgres, these are the cmake flags that I use to configure the
PostgreSQL DLL build:

-D RDK_BUILD_PGSQL=ON ^
-D RDK_PGSQL_STATIC=OFF ^
-D PostgreSQL_INCLUDE_DIR="C:/Program Files/PostgreSQL/14/include" ^
-D PostgreSQL_TYPE_INCLUDE_DIR="C:/Program
Files/PostgreSQL/14/include/server" ^
-D PostgreSQL_LIBRARY="C:/Program Files/PostgreSQL/14/lib/postgres.lib"
^
..

Note that this is to build against Postgres installed using the Windows
installer from the PostgreSQL website, not the version provided by conda.

Cheers,
p.


On Thu, Apr 14, 2022 at 11:36 AM Charmaine Siu Man Chu <
charmaine@liverpoolchirochem.com> wrote:

> Hello,
>
>
>
> I’ve been trying to build RDKit on Windows so that I can get the RDKit
> extension in pgAdmin (Postgres) but I’ve been unsuccessful.
>
> I’ve tried to follow the instruction on
> https://www.rdkit.org/docs/Install.html to build RDKit and encountered
> several problems.
>
> Some of them I manage to resolve myself but can someone help on this?
>
>
>
> Here is what I’ve tried:
>
>
>
> To start off, I’m using the Windows 11 64-bit OS and Postgres 14 was
> installed.
>
> I then went to install the following:
>
> Python 3.10 along with numpy and Pillow
>
> Visual Studio 2022 with Visual Studio Community 2022, Visual Studio Build
> Tools 2022
>
> Cmake 3.23.0
>
> Boost_1_78_0 using the .exe installer
>
> And downloaded and extracted the rdkit-Release_2022_03_1
>
>
>
> After install the above, I ran the below in Comand Prompt
>
> cmake -DRDK_BUILD_PYTHON_WRAPPERS=ON -DBOOST_ROOT=C:/locall/boost_1_78_0
> -DRDK_BUILD_INCHI_SUPPORT=ON -DRDK_BUILD_AVALON_SUPPORT=ON
> -DRDK_BUILD_PGSQL=ON -DPostgreSQL_ROOT="C:\Program Files\PostgreSQL\14"
> -G"Visual Studio 17 2022" ..
>
> came back with error:
>
> The following variants have been tried and rejected:
>
>
>
>   * boost_python310-vc143-mt-gd-x64-1_78.lib (shared, default on Windows is
>
>   static, set Boost_USE_STATIC_LIBS=OFF to override)
>
>
>
>   * boost_python310-vc143-mt-x64-1_78.lib (shared, default on Windows is
>
> static, set Boost_USE_STATIC_LIBS=OFF to override)
>
>
>
> I then edit the command line to
>
> cmake -DRDK_BUILD_PYTHON_WRAPPERS=ON -DBOOST_ROOT=C:/locall/boost_1_78_0
> -DBoost_USE_STATIC_LIBS=OFF -DRDK_BUILD_INCHI_SUPPORT=ON
> -DRDK_BUILD_AVALON_SUPPORT=ON -DRDK_BUILD_PGSQL=ON
> -DPostgreSQL_ROOT="C:\Program Files\PostgreSQL\14" -G"Visual Studio 17
> 2022" ..
>
> and this time following error occured
>
> Could NOT find Eigen3 (missing: EIGEN3_INCLUDE_DIR EIGEN3_VERSION_OK)
> (Required is at least version "2.91.0")
>
> CMake Error at C:/Program
> Files/CMake/share/cmake-3.23/Modules/ExternalProject.cmake:2540 (message):
>
>   error: could not find git for clone of Eigen
>
>
>
> So I went to download and extracted eigen 3.4.0 and edit the command line
> to
>
> cmake -DRDK_BUILD_PYTHON_WRAPPERS=ON -DBOOST_ROOT=C:/locall/boost_1_78_0
> -DBoost_USE_STATIC_LIBS=OFF -DEIGEN3_INCLUDE_DIR="C:/eigen-3.4.0"
> -DRDK_BUILD_INCHI_SUPPORT=ON -DRDK_BUILD_AVALON_SUPPORT=ON
> -DRDK_BUILD_PGSQL=ON -DPostgreSQL_ROOT="C:\Program Files\PostgreSQL\14"
> -G"Visual Studio 17 2022" ..
>
> which gave the following error
>
> CMake Error at C:/Program
> Files/CMake/share/cmake-3.23/Modules/FindPackageHandleStandardArgs.cmake:230
> (message):
>
> Could NOT find Freetype (missing: FREETYPE_LIBRARY FREETYPE_INCLUDE_DIRS)
>
>
>
> So again I went to obtain freetype-2.12.0 and build it following the
> instructions on
> https://bobhowto.wordpress.com/2020/11/10/build-freetype-on-windows-10-using-visual-studio-2017/,
> using VS 2022 instead, and edit the command line to
>
> cmake -DRDK_BUILD_PYTHON_WRAPPERS=ON -DBOOST_ROOT=C:/local/boost_1_78_0
> -DRDK_BUILD_INCHI_SUPPORT=ON -DRDK_BUILD_AVALON_SUPPORT=ON
> -DBoost_USE_STATIC_LIBS=OFF -DEIGEN3_INCLUDE_DIR="C:/eigen-3.4.0"
> -DFREETYPE_INCLUDE_DIRS="C:/freetype-2.12.0/include"
> -DFREETYPE_LIBRARY="C:/freetype-2.12.0/objs/freetype.lib"
> -DRDK_BUILD_PGSQL=ON -DPostgreSQL_ROOT="C:\Program Files\PostgreSQL\14"
> -G"Visual Studio 17 2022" ..
>
> and it managed to compile
>
> However, when running
>
> "C:\Program Files\Microsoft Visual
> Studio\2022\Community\MSBuild\Current\Bin\MSBuild.exe" /m:4
> /p:Configuration=Release INSTALL.vcxproj
>
> Gave the following error with 47 other warnings
>
> "C:\rdkit-Release_2022_03_1\build\INSTALL.vcxproj" (default target) (1) ->
>
>"C:\rdkit-Release_2022_03_1\build\ALL_BUILD.vcxproj" (default
> target) (3) ->
>
>
> "C:\rdkit-Release_2022_03_1\build\Code\GraphMol\Deprotect\Wrap\rdDeprotect.vcxproj"
> (default target) (20) ->
>
>
> "C:\rdkit-Release_2022_03_1\build\Code\GraphMol\ChemReactions\ChemReactions.vcxproj"
> (default target) (48) ->
>
>

Re: [Rdkit-discuss] how to report SDF records for which Chem.ForwardSDMolSupplier returns None?

2022-04-14 Thread Giovanni Tricarico
Thank you Ivan, that's great!
It does exactly what I wanted.
OK, I cannot read the properties for wrong records, but one could argue when 
the molecule is wrong there is no point reading the properties.

I only adjusted a couple of details (made sure that the file is read as text, 
not binary, otherwise no valid record is found; included the handling of 
gzipped files; added a writer to store the wrong records as they are read).
Something like this:

import gzip
import rdkit
from rdkit import Chem
import pickle

def read_record(fh):
lines = []
for line in fh:
lines.append(line)
if line.rstrip() == '':
return ''.join(lines)

def read_records(fh):
while True:
rec = read_record(fh)
if rec is None:
return
yield rec

sup = Chem.SDMolSupplier()
d = dict()
mols_pickled = []
i = 0
with gzip.open('x.sdf.gz', 'rt') as fh, gzip.open('x_wrong.sdf.gz', 'wt') as 
fh_wrong:
for rec in read_records(fh):
sup.SetData(rec)
mol = next(sup)
if mol is None:
fh_wrong.write(rec)
else:
d[i] = mol.GetPropsAsDict()
mols_pickled.append(pickle.dumps(mol))
i += 1

After running this, there is a file with the wrong records (if any), the 
correct molecules are pickled and stored in a list, and their data are in a 
dictionary of dictionaries that can be converted for instance to a DataFrame.

I take your point of caution regarding whether it is safe or not to split 
records by detecting a line exactly equal to string ''.
Not sure how much of a practical problem it really is.
I can report that I've just ran the above script on an SDF with 1.5 M molecules 
coming from several sources, and it completed in 5 minutes without stumbling.
But yes, if someone is dead set on making this go wrong, it is possible, by 
just defining a property with exact value '' and writing it out to an SDF.
I've yet to come across such an occurrence, after several years of working with 
SDF's. So I'm not overly worried.

Thanks again for your input!
Giovanni

From: Ivan Tubert-Brohman 
Sent: 14 April 2022 12:58
To: Giovanni Tricarico 
Cc: rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] how to report SDF records for which 
Chem.ForwardSDMolSupplier returns None?

You don't often get email from 
ivan.tubert-broh...@schrodinger.com.
 Learn why this is important
How about splitting the file on lines consisting of "", and then parsing 
each record? If the parsing fails, you can write out the bad record for future 
inspection. (This addresses the basic use case, but not the "even better" one.)

Here's a proof of concept:

from rdkit import Chem

def read_record(fh):
lines = []
for line in fh:
lines.append(line)
if line.rstrip() == '':
return ''.join(lines)

def read_records(fh):
while True:
rec = read_record(fh)
if rec is None:
return
yield rec

sup = Chem.SDMolSupplier()
with open('x.sdf') as fh:
for rec in read_records(fh):
sup.SetData(rec)
mol = next(sup)
if mol is None:
print("Bad record:\n", rec)
continue
print(mol.GetPropsAsDict())

I worry that this is not strictly correct, because what if the value of a 
property happens to be ""? But apparently RDKit's own SDMolSupplier is also 
confused by this (or maybe such values are forbidden by the file format and/or 
there's some escape mechanism? I haven't checked), so I don't feel nearly as 
bad about that.

Ivan

On Wed, Apr 13, 2022 at 4:29 PM Giovanni Tricarico 
mailto:giovanni.tricar...@glpg.com>> wrote:
Hello,
I am using rdkit to read data from SD files.

My goal is to extract both the molecules and their associated properties (which 
for our purposes are separate entities) from the SDF.
[For 100% clarity: by 'properties' I don't mean calculated properties or atom 
or bond properties, but the text properties that were saved in the SDF with 
each molecule, i.e. those that you get when you do mol.GetPropsAsDict() ].

After several tests I found that Chem.ForwardSDMolSupplier does what I need.

But there is an issue.
When Chem.ForwardSDMolSupplier decides that a molecule is not OK, i.e. when it 
says it is None, the SDF record is lost:
I cannot access its Props; I cannot save the failed SDF record for later 
inspection.
[Or at least, I don't know how to do it, hence this question].
At most I can collect the indices of the records that fail.

> Would anyone be able to suggest how to save to a text file (which an SDF 
> essentially already is) the SDF records for which Chem.ForwardSDMolSupplier 
> returns a None?
> Even better, could the properties associated to the failed molecules be read 
> independently? In theory the properties are in a separate part of the CTAB, 
> so even when the atoms, bonds, etc.

Re: [Rdkit-discuss] how to report SDF records for which Chem.ForwardSDMolSupplier returns None?

2022-04-14 Thread Andrew Dalke
On Apr 14, 2022, at 12:57, Ivan Tubert-Brohman 
 wrote:
> How about splitting the file on lines consisting of "", and then parsing 
> each record? If the parsing fails, you can write out the bad record for 
> future inspection. (This addresses the basic use case, but not the "even 
> better" one.)

Yes, if you know your data is "clean", then you can do that.

I wrote an essay at
  
http://www.dalkescientific.com/writings/diary/archive/2020/09/18/handling_the_sdf_record_delimiter.html
about some of the ways that can cause problems.

They do occur in real-world data sets. And they do cause problems in some 
processing pipelines.

Public data sets like PubChem, ChEMBL, etc. don't have these problems. They are 
mostly in in-house data sets. Though it's not common to have a problem.

> def read_record(fh):
> lines = []
> for line in fh:
> lines.append(line)
> if line.rstrip() == '':
> return ''.join(lines)

See also 
https://baoilleach.blogspot.com/2020/05/python-patterns-for-processing-large.html
 .

The reasons I think there should be a low-level library for this sort of work 
are:

1) the edge cases are tricky to handle,

2) the simple readers like this are slow

3) I believe good error reporting needs things like the starting line number 
and/or starting byte position for the record. Implementing that is a bit tricky 
(and boring), and tracking that information in a compiled extension has a much 
lower overhead than doing it in Python.


Cheers,


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] how to report SDF records for which Chem.ForwardSDMolSupplier returns None?

2022-04-14 Thread Ivan Tubert-Brohman
How about splitting the file on lines consisting of "", and then
parsing each record? If the parsing fails, you can write out the bad record
for future inspection. (This addresses the basic use case, but not the
"even better" one.)

Here's a proof of concept:

from rdkit import Chem

def read_record(fh):
lines = []
for line in fh:
lines.append(line)
if line.rstrip() == '':
return ''.join(lines)

def read_records(fh):
while True:
rec = read_record(fh)
if rec is None:
return
yield rec

sup = Chem.SDMolSupplier()
with open('x.sdf') as fh:
for rec in read_records(fh):
sup.SetData(rec)
mol = next(sup)
if mol is None:
print("Bad record:\n", rec)
continue
print(mol.GetPropsAsDict())

I worry that this is not strictly correct, because what if the value of a
property happens to be ""? But apparently RDKit's own SDMolSupplier is
also confused by this (or maybe such values are forbidden by the file
format and/or there's some escape mechanism? I haven't checked), so I don't
feel nearly as bad about that.

Ivan

On Wed, Apr 13, 2022 at 4:29 PM Giovanni Tricarico <
giovanni.tricar...@glpg.com> wrote:

> Hello,
>
> I am using rdkit to read data from SD files.
>
>
>
> My goal is to extract both the molecules and their associated properties
> (which for our purposes are separate entities) from the SDF.
>
> [For 100% clarity: by ‘properties’ I don’t mean calculated properties or
> atom or bond properties, but the text properties that were saved in the SDF
> with each molecule, i.e. those that you get when you do
> mol.GetPropsAsDict() ].
>
>
>
> After several tests I found that Chem.ForwardSDMolSupplier does what I
> need.
>
>
>
> But there is an issue.
>
> When Chem.ForwardSDMolSupplier decides that a molecule is not OK, i.e.
> when it says it is None, the SDF record is lost:
>
> I cannot access its Props; I cannot save the failed SDF record for later
> inspection.
>
> [Or at least, I don’t know how to do it, hence this question].
>
> At most I can collect the indices of the records that fail.
>
>
>
> > Would anyone be able to suggest how to save to a text file (which an SDF
> essentially already is) the SDF records for which
> Chem.ForwardSDMolSupplier returns a None?
>
> > Even better, could the properties associated to the failed molecules be
> read independently? In theory the properties are in a separate part of the
> CTAB, so even when the atoms, bonds, etc. have a problem, the properties
> might still be OK.
>
>
>
> (Note: PandasTools.LoadSDF gives the same issue, it does not even store
> in the DataFrame the records for which the molecule is None, and in any
> case it cannot be used with the kind of SDF’s I am handling, as it uses an
> enormous amount of memory for the molecules – hence the decision to use
> Chem.ForwardSDMolSupplier and pickle the molecules as soon as they are
> read).
>
>
>
> Thanks
> This e-mail and its attachment(s) (if any) may contain confidential and/or
> proprietary information and is intended for its addressee(s) only. Any
> unauthorized use of the information contained herein (including, but not
> limited to, alteration, reproduction, communication, distribution or any
> other form of dissemination) is strictly prohibited. If you are not the
> intended addressee, please notify the originator promptly and delete this
> e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor
> any of its affiliates shall be liable for direct, special, indirect or
> consequential damages arising from alteration of the contents of this
> message (by a third party) or as a result of a virus being passed on.
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] how to report SDF records for which Chem.ForwardSDMolSupplier returns None?

2022-04-14 Thread Andrew Dalke
On Apr 14, 2022, at 09:16, Gyro Funch  wrote:
> I don't know the sdf format well, so please excuse my ignorance, but instead 
> of a custom parser, would it be possible to write a preprocessor to eliminate 
> the offending information? Perhaps something using regular expressions in 
> python, perl, sed, or awk?

The SDF format is too complicated to be parsed with a regular expression[1], 
and the failure modes often cannot be detected at the syntax level[2]. I 
suggest people may consider using chemfp for this [3].

[1] For example, in a V2000-formatted record, the number of atom records and 
the number of bond records are given by a repeat count. A traditional/formal 
regular expression does not support counts where the count from the pattern 
matching. 

Most regular expression engines have more powerful capabilities than formal 
regular expression, such as matches to back-reference captured groups. However, 
few support using a backreference as a repeat count.

I wrote one that did, which would let you specify

(?P...)(?P...) and so on)
(?P(?P.{10})(?P.{10}) and so on){atom_count}
(?P(?P.{3})(?P.{3}) and so on){bond_count}

but in practice, defining the grammar through a regular expression grammar was 
decidedly not easy!

I've wanted to experiment with using WUFFS to make a low-level SDF parser 
library, see
  https://github.com/google/wuffs


[2] For example, RDKit by default rejects atoms where the valence is too high. 
Detecting this in filter code calls for reverse-engineering what RDKit already 
does.

[3] Chemfp is best known as a fingerprint generation and search program. 
However, there are a few use cases where I wanted to have access to the input 
record (eg, to detect toolkit failures, or to add fingerprint data to the input 
record rather than round-tripping the SDF through a toolkit.) I did this by 
writing my own SDF record reader (in the "text_toolkit"), and writing a wrapper 
to the RDKit toolkit (in the "rdkit_toolkit"), and using a error handler which 
can decide how to handle errors (ignore, report, raise an exception, log, 
etc.). That error handler has access to location information, which includes 
the record number, the record text, the line number of the start of the record, 
and more.

Here's what it looks like for Giovanni's use case:


from chemfp import rdkit_toolkit as T
from chemfp import text_toolkit

filename = "/Users/dalke/databases/ChEBI_complete_3star.sdf.gz"


class ErrorHandler:
def __init__(self):
self.error_ids = []

def error(self, msg, location, extra=None):
record = location.record
chebi_id = text_toolkit.get_sdf_tag(location.record, "ChEBI ID")
print(f"!!! Error reading record {location.recno} with ID: 
{chebi_id!r}")
print(f"at {location.where()}")
self.error_ids.append(chebi_id)

errors = ErrorHandler()
count = 0
num_atoms = 0
for mol in T.read_molecules(filename, errors=errors):
count += 1
num_atoms += mol.GetNumAtoms()  # This is a RDMol.

print(f"Parsed {count} records ({num_atoms} atoms), skipped 
{len(errors.error_ids)}.")


This functionality is available in the pre-compiled version of chemfp for 
Linux-base OSes, available from https://chemfp.com/download/ . The default 
license agreement (that is, you can use it without a license key) lets you use 
it for any internal purpose.

If anyone is interested in working on a stand-alone SDF parsing library under a 
free software license, I can provide some pointers and feedback, and will 
contribute chemfp's SDF parser under the MIT license.


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Building RDKit on Windows for pgAdmin (Postgres)

2022-04-14 Thread Charmaine Siu Man Chu
Hello,

I've been trying to build RDKit on Windows so that I can get the RDKit 
extension in pgAdmin (Postgres) but I've been unsuccessful.
I've tried to follow the instruction on https://www.rdkit.org/docs/Install.html 
to build RDKit and encountered several problems.
Some of them I manage to resolve myself but can someone help on this?

Here is what I've tried:

To start off, I'm using the Windows 11 64-bit OS and Postgres 14 was installed.
I then went to install the following:
Python 3.10 along with numpy and Pillow
Visual Studio 2022 with Visual Studio Community 2022, Visual Studio Build Tools 
2022
Cmake 3.23.0
Boost_1_78_0 using the .exe installer
And downloaded and extracted the rdkit-Release_2022_03_1

After install the above, I ran the below in Comand Prompt
cmake -DRDK_BUILD_PYTHON_WRAPPERS=ON -DBOOST_ROOT=C:/locall/boost_1_78_0 
-DRDK_BUILD_INCHI_SUPPORT=ON -DRDK_BUILD_AVALON_SUPPORT=ON -DRDK_BUILD_PGSQL=ON 
-DPostgreSQL_ROOT="C:\Program Files\PostgreSQL\14" -G"Visual Studio 17 2022" ..
came back with error:
The following variants have been tried and rejected:

  * boost_python310-vc143-mt-gd-x64-1_78.lib (shared, default on Windows is
  static, set Boost_USE_STATIC_LIBS=OFF to override)

  * boost_python310-vc143-mt-x64-1_78.lib (shared, default on Windows is
static, set Boost_USE_STATIC_LIBS=OFF to override)

I then edit the command line to
cmake -DRDK_BUILD_PYTHON_WRAPPERS=ON -DBOOST_ROOT=C:/locall/boost_1_78_0 
-DBoost_USE_STATIC_LIBS=OFF -DRDK_BUILD_INCHI_SUPPORT=ON 
-DRDK_BUILD_AVALON_SUPPORT=ON -DRDK_BUILD_PGSQL=ON 
-DPostgreSQL_ROOT="C:\Program Files\PostgreSQL\14" -G"Visual Studio 17 2022" ..
and this time following error occured
Could NOT find Eigen3 (missing: EIGEN3_INCLUDE_DIR EIGEN3_VERSION_OK) (Required 
is at least version "2.91.0")
CMake Error at C:/Program 
Files/CMake/share/cmake-3.23/Modules/ExternalProject.cmake:2540 (message):
  error: could not find git for clone of Eigen

So I went to download and extracted eigen 3.4.0 and edit the command line to
cmake -DRDK_BUILD_PYTHON_WRAPPERS=ON -DBOOST_ROOT=C:/locall/boost_1_78_0 
-DBoost_USE_STATIC_LIBS=OFF -DEIGEN3_INCLUDE_DIR="C:/eigen-3.4.0" 
-DRDK_BUILD_INCHI_SUPPORT=ON -DRDK_BUILD_AVALON_SUPPORT=ON -DRDK_BUILD_PGSQL=ON 
-DPostgreSQL_ROOT="C:\Program Files\PostgreSQL\14" -G"Visual Studio 17 2022" ..
which gave the following error
CMake Error at C:/Program 
Files/CMake/share/cmake-3.23/Modules/FindPackageHandleStandardArgs.cmake:230 
(message):
Could NOT find Freetype (missing: FREETYPE_LIBRARY FREETYPE_INCLUDE_DIRS)

So again I went to obtain freetype-2.12.0 and build it following the 
instructions on 
https://bobhowto.wordpress.com/2020/11/10/build-freetype-on-windows-10-using-visual-studio-2017/,
 using VS 2022 instead, and edit the command line to
cmake -DRDK_BUILD_PYTHON_WRAPPERS=ON -DBOOST_ROOT=C:/local/boost_1_78_0 
-DRDK_BUILD_INCHI_SUPPORT=ON -DRDK_BUILD_AVALON_SUPPORT=ON 
-DBoost_USE_STATIC_LIBS=OFF -DEIGEN3_INCLUDE_DIR="C:/eigen-3.4.0" 
-DFREETYPE_INCLUDE_DIRS="C:/freetype-2.12.0/include" 
-DFREETYPE_LIBRARY="C:/freetype-2.12.0/objs/freetype.lib" -DRDK_BUILD_PGSQL=ON 
-DPostgreSQL_ROOT="C:\Program Files\PostgreSQL\14" -G"Visual Studio 17 2022" ..
and it managed to compile
However, when running
"C:\Program Files\Microsoft Visual 
Studio\2022\Community\MSBuild\Current\Bin\MSBuild.exe" /m:4 
/p:Configuration=Release INSTALL.vcxproj
Gave the following error with 47 other warnings
"C:\rdkit-Release_2022_03_1\build\INSTALL.vcxproj" (default target) (1) ->
   "C:\rdkit-Release_2022_03_1\build\ALL_BUILD.vcxproj" (default target) 
(3) ->
   
"C:\rdkit-Release_2022_03_1\build\Code\GraphMol\Deprotect\Wrap\rdDeprotect.vcxproj"
 (default target) (20) ->
   
"C:\rdkit-Release_2022_03_1\build\Code\GraphMol\ChemReactions\ChemReactions.vcxproj"
 (default target) (48) ->
   
"C:\rdkit-Release_2022_03_1\build\Code\GraphMol\Descriptors\Descriptors.vcxproj"
 (default target) (55) ->
   
"C:\rdkit-Release_2022_03_1\build\Code\GraphMol\FileParsers\FileParsers.vcxproj"
 (default target) (56) ->
   (ClCompile target) ->

C:\rdkit-Release_2022_03_1\Code\GraphMol\FileParsers\PNGParser.cpp(25,10): 
fatal error C1083: Cannot open incl
   ude file: 'zlib.h': No such file or directory 
[C:\rdkit-Release_2022_03_1\build\Code\GraphMol\FileParsers\FilePa
   rsers.vcxproj]

So I went to obtain zlib 1.2.12 and build it via
cmake -G"Visual Studio 17 2022" ..
"C:\Program Files\Microsoft Visual 
Studio\2022\Community\MSBuild\Current\Bin\MSBuild.exe" zlib.sln
copy /Y zconf.h ..
and compiled the boost libraries via
b2 --prefix=C:\local\boost_1_78_0 -sZLIB_SOURCE="C:/zlib-1.2.12" 
-sZLIB_INCLUDE="C:/zlib-1.2.12" -sZLIB_LIBPATH="C:/zlib-1.2.12/build/Release" 
-sZLIB_BINARY="C:/zlib-1.2.12/build/Release/zlib.lib" --debug-configuration -d0 
address-model=64 link=shared install
and changed the PATH in the environment variables to C:\local\boost_1_78_0\lib
but running the build code and the INSTALL.vcxproj gave the same error

Re: [Rdkit-discuss] how to report SDF records for which Chem.ForwardSDMolSupplier returns None?

2022-04-14 Thread Gyro Funch

On 2022-04-14 08:12 AM, Giovanni Tricarico wrote:

Thank you Nils.

In fact I do want the sanitize + parse to happen, and I do some further checks 
on the molecules, too (ChEMBL pipeline etc).
The issue is that whatever does not pass the initial steps just completely 
disappears and cannot be reported or inspected in any way.

Indeed, making a custom SDF parser would be one option, as an SDF is just text, 
and rigidly 'structured' by its very definition; only, I was hoping someone had 
already written such a parser :)

For now I will just output the indices of the failed records; the user will 
then have to read them in another application for inspection.

Thanks
Giovanni

-Original Message-
From: Nils Weskamp 
Sent: 13 April 2022 22:55
To: Giovanni Tricarico ; 
rdkit-discuss@lists.sourceforge.net
Subject: Re: [Rdkit-discuss] how to report SDF records for which 
Chem.ForwardSDMolSupplier returns None?

[You don't often get email from nils.wesk...@gmail.com. Learn why this is 
important at http://aka.ms/LearnAboutSenderIdentification.]

Hello Giovanni,

have you tried using the ForwardSDMolSupplier with sanitize = False and / or 
strictParsing = False ?

This should at least reduce the number of cases where molecules are not 
accepted. You would then have to sanitize the structures yourself afterwards 
and handle possible errors explicitly.

If that doesn't solve your problem, I would consider to write my own parser 
that just ignores everything looking like a CTAB.

Hope this helps,
Nils

Am 13.04.2022 um 18:15 schrieb Giovanni Tricarico:

Hello,

I am using rdkit to read data from SD files.

My goal is to extract both the molecules and their associated
properties (which for our purposes are separate entities) from the SDF.

[For 100% clarity: by 'properties' I don't mean calculated properties
or atom or bond properties, but the text properties that were saved in
the SDF with each molecule, i.e. those that you get when you do
mol.GetPropsAsDict() ].

After several tests I found that Chem.ForwardSDMolSupplier does what I need.

But there is an issue.

When Chem.ForwardSDMolSupplier decides that a molecule is not OK, i.e.
when it says it is None, the SDF record is lost:

I cannot access its Props; I cannot save the failed SDF record for
later inspection.

[Or at least, I don't know how to do it, hence this question].

At most I can collect the indices of the records that fail.

  > Would anyone be able to suggest how to save to a text file (which
an SDF essentially already is) the SDF records for which
Chem.ForwardSDMolSupplier returns a None?

  > Even better, could the properties associated to the failed
molecules be read independently? In theory the properties are in a
separate part of the CTAB, so even when the atoms, bonds, etc. have a
problem, the properties might still be OK.

(Note: PandasTools.LoadSDF gives the same issue, it does not even
store in the DataFrame the records for which the molecule is None, and
in any case it cannot be used with the kind of SDF's I am handling, as
it uses an enormous amount of memory for the molecules - hence the
decision to use Chem.ForwardSDMolSupplier and pickle the molecules as
soon as they are read).

Thanks




I don't know the sdf format well, so please excuse my ignorance, but 
instead of a custom parser, would it be possible to write a preprocessor 
to eliminate the offending information? Perhaps something using regular 
expressions in python, perl, sed, or awk?


Kind regards,
gyro



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss