Re: [Rdkit-discuss] GetSubstructMatches() and resonance structures

2014-10-31 Thread Paolo Tosco
Thanks to both for your replies. That's more or less what I was thinking of - I 
just wanted to make sure that there was not something already available before 
starting coding :-) I will get back to the list once I have something ready.

Cheers,
p.


 On 31 Oct 2014, at 05:17, Greg Landrum greg.land...@gmail.com wrote:
 
 The reply that Ling forwards has one approach to doing this.
 
 It's a bit easier for someone who is willing to do some C++ work.[1]
 
 One could imagine writing a function prepareForResonanceFormMatching(ROMol 
 m) (or some such thing) that would be applied to the *query* molecule that 
 does the following:
 - identifies the groups that need to be resonance-symmetrized
 - changes the resonance bonds to Query bonds that match single or double 
 (possibly also aromatic?)
 - neutralizes any charges on resonating atoms in the group. 
 
 The last step is important because the query C(O)O matches the molecule 
 C(O)[O-] twice, but C(O)[O-] only matches once:
 
 In [11]: 
 Chem.MolFromSmiles('C(O)[O-]').GetSubstructMatches(Chem.MolFromSmiles('C(O)O'),uniquify=False)
 Out[11]: ((0, 1, 2), (0, 2, 1))
 
 In [12]: 
 Chem.MolFromSmiles('C(O)[O-]').GetSubstructMatches(Chem.MolFromSmiles('C([O-])O'),uniquify=False)
  
 Out[12]: ((0, 2, 1),)
 
 I suspect such a function would be useful to multiple people.
 
 For identifying the groups that are resonance symmetrized: though this could 
 be done using a set of particular patterns, it may be better to think about 
 doing it more generally by having it find resonance systems.[2] The flag 
 Bond.getIsConjugated(), set during sanitization, is probably useful for this. 
  
 
 -greg
 [1] well, to the extent that anything is ever easier in C++
 [2] this would allow finding the substructure matches within molecules like 
 C1=C(C)C=CC=CC=C1
 
 
 On Fri, Oct 31, 2014 at 2:09 AM, S.L. Chan slch...@yahoo.com wrote:
 Dear Paolo,
 
 I have asked a very similar question last year. This was what Greg said.
 
 Ling
 
 Re: [Rdkit-discuss] atom equivalence for substructure matching
  
  
 
  
  
  
  
  
 Re: [Rdkit-discuss] atom equivalence for substructure ma...
 Skip to site navigation (Press enter)
 View on www.mail-archive.com
 Preview by Yahoo
  
 
 From: Paolo Tosco paolo.to...@unito.it
 To: rdkit-discuss@lists.sourceforge.net 
 rdkit-discuss@lists.sourceforge.net 
 Sent: Thursday, October 30, 2014 4:26 PM
 Subject: [Rdkit-discuss] GetSubstructMatches() and resonance structures
 
 Dear all,
 
 The following code snippet compares two resonance structures of formate 
 anion:
 
 import rdkit
 from rdkit import Chem
 
 mol1=Chem.MolFromSmiles('C([O-])=O')
 mol2=Chem.MolFromSmiles('C(=O)[O-]')
 mol1.GetSubstructMatches(mol2, uniquify = False)
 ((0, 2, 1),)
 
 mol1.GetSubstructMatches(mol1, uniquify = False)
 ((0, 1, 2),)
 
 I would rather like to get, in both cases, the following output:
 ((0, 1, 2),(0, 2, 1))
 
 which would account for the carboxylate group symmetry due to resonance. 
 The same applies to amidinium, guanidinium, etc.
 
 Is that currently feasible within the RDKit API?
 
 Thanks in advance, cheers
 Paolo
 
 
 --
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
 
 
 
 --
 
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
 
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Export pandas DataFrame to xlsx with molecule images

2014-10-31 Thread Grégori Gerebtzoff

Hi Samo,

I used a few years ago the PHPExcel library to put images into an Excel 
file, and it was not necessary to use physical files.
Having a quick look at the library I found this class (probably the one 
I used): PHPExcel_Worksheet_MemoryDrawing (source code: 
https://github.com/clariondoor/PHPExcel/blob/master/Worksheet/MemoryDrawing.php)

The interesting bit:
public function __construct()
{
// Initialise values
$this-_imageResource= null;
$this-_renderingFunction = self::RENDERING_DEFAULT;
$this-_mimeType= self::MIMETYPE_DEFAULT;
$this-_uniqueName= md5(rand(0, ). time() . 
rand(0, ));


// Initialize parent
parent::__construct();
}

Thus I'm pretty sure you can use the same trick in python XlsxWriter 
(have a look at the _add_image_files function in packager.py), using a 
random file name and a bit stream to the image, as described here: 
http://xlsxwriter.readthedocs.org/en/latest/example_images_bytesio.html#ex-images-bytesio:


filename   = 'python.png'
image_file = open(filename, 'rb')
image_data = BytesIO(image_file.read())
image_file.close()
# Write the byte stream image to a cell. The filename must  be specified.
worksheet.insert_image('B8', filename, {'image_data': image_data})

At least it's worth a try!
Another trick I had to do both with PHPExcel and in VBA was to set the 
width of columns three times to make sure that it was actually correct. 
Don't ask me why... Just in case you face some width issues.


Good luck!

Grégori


On 30. 10. 14 16:49, Samo Turk wrote:

Hi rdkiters,

Due to popular demand I started to work on a function to export pandas 
DataFrame to xlsx with molecule images embedded.
Because of the xlsx specifics the code is not optimal. The most 
annoying thing about this implementation is that it has to write all 
images to the hard drive, before it packs them in xlsx (and deletes 
them at the end). I checked two python xlsx libraries and both save 
images that way. If someone finds better solution, please share it.


The dimensions of cells with images are not optimal because Excel is 
weird. :) From xlsxwriter docs): The width corresponds to the column 
width value that is specified in Excel. It is approximately equal to 
the length of a string in the default font of Calibri 11. 
Unfortunately, there is no way to specify “AutoFit” for a column in 
the Excel file format.


It crashes if value of a cell is of wrong type so use 
df['value'].astype() to fix incorrectly assigned types.


Resulting files work nicely in Office 365 (standalone and web app), 
but for some reason don't work optimally with LibreOffice (after row 
~125 it stacks all images).


I made a pull request on GitHub: https://github.com/rdkit/rdkit/pull/371
Demo: 
http://nbviewer.ipython.org/github/Team-SKI/snippets/blob/master/IPython/rdkit_hackaton/XLSX%20export.ipynb
Demo xlsx file: 
https://github.com/Team-SKI/snippets/blob/master/IPython/rdkit_hackaton/demo.xlsx


Regards,
Samo


--


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Problems with Cartridge since Q3 2014 release

2014-10-31 Thread Daniel Moser
Hi all,

since the new release I'm experiencing problems with exact structure search in 
the cartridge. If an index is defined on the mol column exact structure search 
( @= ) doesn't work (i.e. yields no results). I tried it with rdkit compiled 
from source under CentOS 6.5 and with the RPMs from Gianluca Sforna for Fedora 
20. In both cases postgres 9.3 was used. Can anyone confirm this or am I 
missing something?

Here's what I've done (based on the emolecules example from the docs):

RDKit 2014.09 (not working):
#
[moe@localhost db]$ createdb emolecules
[moe@localhost db]$ psql -c 'create extension rdkit' emolecules
CREATE EXTENSION
[moe@localhost db]$ psql -c 'SELECT rdkit_version()' emolecules
rdkit_version
---
0.73.0
(1 row)

[moe@localhost db]$ wget 
http://downloads.emolecules.com/free/2014-10-01/version.smi.gz
[...]
2014-10-31 10:52:29 (1,08 MB/s) - 'version.smi.gz' saved [88871202/88871202]

[moe@localhost db]$ psql -c 'create table raw_data (id SERIAL, smiles text, 
emol_id integer, parent_id integer)' emolecules
CREATE TABLE
[moe@localhost db]$ zcat version.smi.gz | sed '1d; s/\\//g' | psql -c copy 
raw_data (smiles,emol_id,parent_id) from stdin with delimiter ' ' emolecules
[moe@localhost db]$ psql emolecules
psql (9.3.5)
Type help for help.

emolecules=# SELECT * INTO mols FROM (SELECT 
id,mol_from_smiles(smiles::cstring) m FROM raw_data) tmp WHERE m IS NOT null 
LIMIT 1000;
SELECT 1000
emolecules=# SELECT * FROM mols WHERE m @= Mol_From_Smiles('CCOC(=N)CC');
id  | m
-+
383 | CCOC(=N)CC
(1 row)

emolecules=# CREATE INDEX molidx ON mols USING gist(m);
CREATE INDEX
emolecules=# SELECT * FROM mols WHERE m @= Mol_From_Smiles('CCOC(=N)CC');
id | m
+---
(0 rows)

emolecules=# DROP INDEX molidx; DROP TABLE mols; SELECT * INTO mols FROM 
(SELECT id,mol_from_smiles(smiles::cstring) m FROM raw_data) tmp WHERE m IS NOT 
null LIMIT 1000;  SELECT * FROM mols WHERE m @= Mol_From_Smiles('CCOC(=N)CC');
DROP INDEX
DROP TABLE
SELECT 1000
id  | m
-+
383 | CCOC(=N)CC
(1 row)
#


RDKit 2014.03 (working):
#
-bash-4.1$ createdb emolecules_test
-bash-4.1$ psql -c 'SELECT rdkit_version()' emolecules_test
rdkit_version
---
0.72.0
(1 row)

-bash-4.1$ psql -c 'create table raw_data (id SERIAL, smiles text, emol_id 
integer, parent_id integer)' emolecules_test
CREATE TABLE
-bash-4.1$ zcat version.smi.gz | sed '1d; s/\\//g' | psql -c copy raw_data 
(smiles,emol_id,parent_id) from stdin with delimiter ' ' emolecules_test
-bash-4.1$ psql emolecules_test
psql (9.3.4)
Type help for help.

emolecules_test=# SELECT * INTO mols FROM (SELECT 
id,mol_from_smiles(smiles::cstring) m FROM raw_data) tmp WHERE m IS NOT null 
LIMIT 1000;
SELECT 1000
emolecules_test=# SELECT * FROM mols WHERE m @= Mol_From_Smiles('CCOC(=N)CC');
id  | m
-+
383 | CCOC(=N)CC
(1 row)

emolecules_test=# CREATE INDEX molidx ON mols USING gist(m);
CREATE INDEX
emolecules_test=# SELECT * FROM mols WHERE m @= Mol_From_Smiles('CCOC(=N)CC');
id  | m
-+
383 | CCOC(=N)CC
(1 row)

emolecules_test=# DROP INDEX molidx; DROP TABLE mols; SELECT * INTO mols FROM 
(SELECT id,mol_from_smiles(smiles::cstring) m FROM raw_data) tmp WHERE m IS NOT 
null LIMIT 1000;  SELECT * FROM mols WHERE m @= Mol_From_Smiles('CCOC(=N)CC');
DROP INDEX
DROP TABLE
SELECT 1000
id  | m
-+
383 | CCOC(=N)CC
(1 row)
#



Best,
Daniel

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Problems with Cartridge since Q3 2014 release

2014-10-31 Thread Greg Landrum
Hi Daniel,

Looks like there is a difficult to reproduce bug with the interaction of
exact structure search and the index.
I will try to track it down.

In the meantime, it's probably good to be aware that the @= operator can,
under the best of circumstances, behave somewhat oddly. This is a
consequence of the less-than-satisfactory way the RDKit current implements
molecular equality. Some examples are below. If you want to search for
identical molecules, I'd suggest adding a SMILES column to your database
built using the mol_to_smiles(m) function, build an index on that, and
query with molecules that have been run through mol_to_smiles(m). This is a
bit painful and adds an extra computation step, but it's more likely to
generate a correct result than relying on the molecular equality function.

Here are the examples:

contrib_regression=# select 'C[17OH]'::mol @= 'CO'::mol ;

 ?column?

--

 f

(1 row)


contrib_regression=# select 'C[17OH]'::mol @= 'C[17OH]'::mol ;

 ?column?

--

 t

(1 row)


contrib_regression=# select 'CO'::mol @= 'C[17OH]'::mol ;

 ?column?

--

 f

(1 row)

On Fri, Oct 31, 2014 at 2:10 PM, Daniel Moser 
mo...@pharmchem.uni-frankfurt.de wrote:

  Hi all,



 since the new release I‘m experiencing problems with exact structure
 search in the cartridge. If an index is defined on the mol column exact
 structure search ( @= ) doesn’t work (i.e. yields no results). I tried it
 with rdkit compiled from source under CentOS 6.5 and with the RPMs from
 Gianluca Sforna for Fedora 20. In both cases postgres 9.3 was used. Can
 anyone confirm this or am I missing something?



 Here’s what I’ve done (based on the emolecules example from the docs):



 RDKit 2014.09 (not working):

 #

 [moe@localhost db]$ createdb emolecules

 [moe@localhost db]$ psql -c 'create extension rdkit' emolecules

 CREATE EXTENSION

 [moe@localhost db]$ psql -c 'SELECT rdkit_version()' emolecules

 rdkit_version

 ---

 0.73.0

 (1 row)



 [moe@localhost db]$ wget
 http://downloads.emolecules.com/free/2014-10-01/version.smi.gz

 [...]

 2014-10-31 10:52:29 (1,08 MB/s) - ‘version.smi.gz’ saved
 [88871202/88871202]



 [moe@localhost db]$ psql -c 'create table raw_data (id SERIAL, smiles
 text, emol_id integer, parent_id integer)' emolecules

 CREATE TABLE

 [moe@localhost db]$ zcat version.smi.gz | sed '1d; s/\\//g' | psql -c
 copy raw_data (smiles,emol_id,parent_id) from stdin with delimiter ' '
 emolecules

 [moe@localhost db]$ psql emolecules

 psql (9.3.5)

 Type help for help.



 emolecules=# SELECT * INTO mols FROM (SELECT
 id,mol_from_smiles(smiles::cstring) m FROM raw_data) tmp WHERE m IS NOT
 null LIMIT 1000;

 SELECT 1000

 emolecules=# SELECT * FROM mols WHERE m @= Mol_From_Smiles('CCOC(=N)CC');

 id  | m

 -+

 383 | CCOC(=N)CC

 (1 row)



 emolecules=# CREATE INDEX molidx ON mols USING gist(m);

 CREATE INDEX

 emolecules=# SELECT * FROM mols WHERE m @= Mol_From_Smiles('CCOC(=N)CC');

 id | m

 +---

 (0 rows)



 emolecules=# DROP INDEX molidx; DROP TABLE mols; SELECT * INTO mols FROM
 (SELECT id,mol_from_smiles(smiles::cstring) m FROM raw_data) tmp WHERE m IS
 NOT null LIMIT 1000;  SELECT * FROM mols WHERE m @=
 Mol_From_Smiles('CCOC(=N)CC');

 DROP INDEX

 DROP TABLE

 SELECT 1000

 id  | m

 -+

 383 | CCOC(=N)CC

 (1 row)

 #





 RDKit 2014.03 (working):

 #

 -bash-4.1$ createdb emolecules_test

 -bash-4.1$ psql -c 'SELECT rdkit_version()' emolecules_test

 rdkit_version

 ---

 0.72.0

 (1 row)



 -bash-4.1$ psql -c 'create table raw_data (id SERIAL, smiles text, emol_id
 integer, parent_id integer)' emolecules_test

 CREATE TABLE

 -bash-4.1$ zcat version.smi.gz | sed '1d; s/\\//g' | psql -c copy
 raw_data (smiles,emol_id,parent_id) from stdin with delimiter ' '
 emolecules_test

 -bash-4.1$ psql emolecules_test

 psql (9.3.4)

 Type help for help.



 emolecules_test=# SELECT * INTO mols FROM (SELECT
 id,mol_from_smiles(smiles::cstring) m FROM raw_data) tmp WHERE m IS NOT
 null LIMIT 1000;

 SELECT 1000

 emolecules_test=# SELECT * FROM mols WHERE m @=
 Mol_From_Smiles('CCOC(=N)CC');

 id  | m

 -+

 383 | CCOC(=N)CC

 (1 row)



 emolecules_test=# CREATE INDEX molidx ON mols USING gist(m);

 CREATE INDEX

 emolecules_test=# SELECT * FROM mols WHERE m @=
 Mol_From_Smiles('CCOC(=N)CC');

 id  | m

 -+

 383 | CCOC(=N)CC

 (1 row)



 emolecules_test=# DROP INDEX molidx; DROP TABLE mols; SELECT * INTO mols
 FROM (SELECT id,mol_from_smiles(smiles::cstring) m FROM raw_data) tmp WHERE
 m IS NOT null LIMIT 1000;  SELECT * FROM mols WHERE m @=
 Mol_From_Smiles('CCOC(=N)CC');

 DROP INDEX

 DROP TABLE

 SELECT 1000

 id  | m

 -+

 383 | CCOC(=N)CC

 (1 row)

 #







 Best,

 Daniel