Re: [Rdkit-discuss] Pandas to Excel

2024-02-22 Thread Chris Swain via Rdkit-discuss
Hi Both,

Many thanks for your rapid response, much appreciated.

Cheers

Chris


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Pandas to Excel

2024-02-22 Thread Taka Seri
Hi Chris,

I think you can do it with SaveXlsxFromFrame.
http://rdkit.org/docs/source/rdkit.Chem.PandasTools.html#SaveXlsxFromFrame

rdkit.Chem.PandasTools.SaveXlsxFromFrame(*frame*, *outFile*,
*molCol='ROMol'*, *size=(300, 300)*, *formats=None*)¶


Saves pandas DataFrame as a xlsx file with embedded images. molCol can be
either a single column label or a list of column labels. It maps numpy data
types to excel cell types: int, float -> number datetime -> datetime object
-> string (limited to 32k character - xlsx limitations)

The formats parameter can be optionally set to a dict of XlsxWriter formats
(https://xlsxwriter.readthedocs.io/format.html#format), e.g.: {

‘write_string’: {‘text_wrap’: True}

} Currently supported keys for the formats dict are: ‘write_string’,
‘write_number’, ‘write_datetime’.

Cells with compound images are a bit larger than images due to excel.
Column width weirdness explained (from xlsxwriter docs): The width
corresponds to the column width value that is specified in Excel. It is
approximately equal to the length of a string in the default font of
Calibri 11. Unfortunately, there is no way to specify “AutoFit” for a
column in the Excel file format. This feature is only available at runtime
from within Excel.

Thanks,
Taka

2024年2月22日(木) 19:19 Chris Swain via Rdkit-discuss <
rdkit-discuss@lists.sourceforge.net>:

> Hi,
>
> Is it possible to export from a Pandas data frame to Excel, inserting the
> structures as images in the excel sheet?
>
> Cheers
>
> Chris
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Pandas to Excel

2024-02-22 Thread Chris Swain via Rdkit-discuss
Hi,

Is it possible to export from a Pandas data frame to Excel, inserting the 
structures as images in the excel sheet?

Cheers

Chris

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Pandas

2016-11-26 Thread Chris Swain
Search and add similarity to resulting data frame 

> On 27 Nov 2016, at 07:55, Greg Landrum  wrote:
> 
> 
> You don't know if what could be done as a single line?
> 
> -greg

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Pandas

2016-11-26 Thread Greg Landrum
On Sun, Nov 27, 2016 at 8:45 AM, Chris Swain  wrote:

> I added the similarity scores
> by adding an extra line,
>
> sdf['sim']=DataStructs.BulkTanimotoSimilarity(ionised_fps,sdf['mfp2’])
>

Yes, that's what I would have done.


> I don’t know if it could be done in a single line?
>

You don't know if what could be done as a single line?

-greg



> Chris
>
> On 26 Nov 2016, at 04:48, Greg Landrum  wrote:
>
> That's a good question.
>
> I'm not a master of pandas indexing, but this seems to work:
> In [5]: sdf['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2)
> for x in sdf['ROMol']]
> In [8]: sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2'])
> In [13]: ids = [x for x,y in enumerate(sims) if y>0.5]
> In [18]: ndf = sdf.iloc[ids]
> In [19]: len(ndf)
> Out[19]: 3
>
> The question is whether or not that's actually faster.
>
> In [21]: def filt1(sdf,qry):
> ...: sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2'])
> ...: ids = [x for x,y in enumerate(sims) if y>0.5]
> ...: return sdf.iloc[ids]
> ...:
>
> In [22]: def filt2(sdf,qry):
> ...: return sdf[sdf.apply(lambda x:DataStructs.
> TanimotoSimilarity(x['mfp2'],qry)>0.5,axis=1)]
> ...:
>
> In [25]: %timeit filt1(sdf,qry)
> 1 loop, best of 3: 458 ms per loop
> In [28]: %timeit filt2(sdf,qry)
> 1 loop, best of 3: 798 ms per loop
>
> And it certainly is .
>
> -greg
>
>
>
> On Wed, Nov 23, 2016 at 4:06 PM, Peter Gedeck 
> wrote:
>
>> Is it possible to use the bulk similarity searching functionality for
>> better performance instead of the list comprehension?
>>
>> Best,
>>
>> Peter
>>
>>
>> On Wed, Nov 23, 2016 at 9:11 AM Greg Landrum 
>> wrote:
>>
>> No worries.
>> This, and Anna's question about similarity searching and clustering
>> illustrate a great opportunity for a tutorial on fingerprints and
>> similarity searching.
>>
>> -greg
>>
>>
>>
>>
>>
>> On Wed, Nov 23, 2016 at 3:00 PM +0100, "Chris Swain" 
>> wrote:
>>
>> Thanks for this,
>>
>> As a chemist who comes from the “cut and paste” school of scripting I’m
>> always concerned I’m asking something blindingly obvious
>>
>> ;-)
>>
>> Chris
>>
>> On 23 Nov 2016, at 12:36, Greg Landrum  wrote:
>>
>> [including rdkit-discuss, because it's relevant there and I'm pretty sure
>> Chris won't mind and the real Pandas experts may have a better answer than
>> me.]
>>
>> On Wed, Nov 23, 2016 at 9:51 AM, Chris Swain  wrote:
>>
>>
>> I quite like storing molecules and associated data in a data frame and
>> I’ve see that it is possible to use rdkit for substructure searching, it is
>> possible to also do similarity searching?
>>
>>
>> It's not built in since there are many possible fingerprints that could
>> be used.
>>
>> It's not quite as convenient as the substructure search, but here's a
>> little demo of what you can do to filter based on similarity:
>>
>> # Start by adding a fingerprint column:
>> In [18]: df['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2)
>> for x in df['ROMol']]
>>
>> # and now filter:
>> In [21]: ndf =df[df.apply(lambda x: 
>> DataStructs.TanimotoSimilarity(x['mfp2'],qry)>=0.7,
>> axis=1)]
>>
>> In [23]: len(df)
>> Out[23]: 1000
>> In [24]: len(ndf)
>> Out[24]: 2
>>
>> -greg
>>
>>
>> 
>> --
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Pandas

2016-11-26 Thread Chris Swain
I added the similarity scores

by adding an extra line,


sdf['sim']=DataStructs.BulkTanimotoSimilarity(ionised_fps,sdf['mfp2’])

I don’t know if it could be done in a single line?

Chris

> On 26 Nov 2016, at 04:48, Greg Landrum  wrote:
> 
> That's a good question.
> 
> I'm not a master of pandas indexing, but this seems to work:
> In [5]: sdf['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2) 
> for x in sdf['ROMol']]
> In [8]: sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2'])
> In [13]: ids = [x for x,y in enumerate(sims) if y>0.5]
> In [18]: ndf = sdf.iloc[ids]
> In [19]: len(ndf)
> Out[19]: 3
> 
> The question is whether or not that's actually faster.
> 
> In [21]: def filt1(sdf,qry):
> ...: sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2'])
> ...: ids = [x for x,y in enumerate(sims) if y>0.5]
> ...: return sdf.iloc[ids]
> ...: 
> 
> In [22]: def filt2(sdf,qry):
> ...: return sdf[sdf.apply(lambda 
> x:DataStructs.TanimotoSimilarity(x['mfp2'],qry)>0.5,axis=1)]
> ...: 
> 
> In [25]: %timeit filt1(sdf,qry)
> 1 loop, best of 3: 458 ms per loop
> In [28]: %timeit filt2(sdf,qry)
> 1 loop, best of 3: 798 ms per loop
> 
> And it certainly is .
> 
> -greg
> 
> 
> 
> On Wed, Nov 23, 2016 at 4:06 PM, Peter Gedeck  > wrote:
> Is it possible to use the bulk similarity searching functionality for better 
> performance instead of the list comprehension?
> 
> Best,
> 
> Peter
> 
> 
> On Wed, Nov 23, 2016 at 9:11 AM Greg Landrum  > wrote:
> No worries.
> This, and Anna's question about similarity searching and clustering 
> illustrate a great opportunity for a tutorial on fingerprints and similarity 
> searching. 
> 
> -greg
> 
> 
> 
> 
> 
> On Wed, Nov 23, 2016 at 3:00 PM +0100, "Chris Swain"  > wrote:
> 
> Thanks for this,
> 
> As a chemist who comes from the “cut and paste” school of scripting I’m 
> always concerned I’m asking something blindingly obvious
> 
> ;-)
> 
> Chris
>> On 23 Nov 2016, at 12:36, Greg Landrum > > wrote:
>> 
>> [including rdkit-discuss, because it's relevant there and I'm pretty sure 
>> Chris won't mind and the real Pandas experts may have a better answer than 
>> me.]
>> 
>> On Wed, Nov 23, 2016 at 9:51 AM, Chris Swain > > wrote:
>> 
>> I quite like storing molecules and associated data in a data frame and I’ve 
>> see that it is possible to use rdkit for substructure searching, it is 
>> possible to also do similarity searching?
>> 
>> It's not built in since there are many possible fingerprints that could be 
>> used.
>> 
>> It's not quite as convenient as the substructure search, but here's a little 
>> demo of what you can do to filter based on similarity:
>> 
>> # Start by adding a fingerprint column:
>> In [18]: df['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2) 
>> for x in df['ROMol']]
>> 
>> # and now filter:
>> In [21]: ndf =df[df.apply(lambda x: 
>> DataStructs.TanimotoSimilarity(x['mfp2'],qry)>=0.7, axis=1)]
>> 
>> In [23]: len(df)
>> Out[23]: 1000
>> In [24]: len(ndf)
>> Out[24]: 2
>> 
>> -greg
>> 
> 
> --
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net 
> 
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss 
> 
> 

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Pandas

2016-11-26 Thread Chris Swain
This works very nicely, would it be nice to add the similarity scores to the 
resulting data frame.

Cheers,

Chris

> On 26 Nov 2016, at 04:48, Greg Landrum  wrote:
> 
> That's a good question.
> 
> I'm not a master of pandas indexing, but this seems to work:
> In [5]: sdf['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2) 
> for x in sdf['ROMol']]
> In [8]: sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2'])
> In [13]: ids = [x for x,y in enumerate(sims) if y>0.5]
> In [18]: ndf = sdf.iloc[ids]
> In [19]: len(ndf)
> Out[19]: 3
> 
> The question is whether or not that's actually faster.
> 
> In [21]: def filt1(sdf,qry):
> ...: sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2'])
> ...: ids = [x for x,y in enumerate(sims) if y>0.5]
> ...: return sdf.iloc[ids]
> ...: 
> 
> In [22]: def filt2(sdf,qry):
> ...: return sdf[sdf.apply(lambda 
> x:DataStructs.TanimotoSimilarity(x['mfp2'],qry)>0.5,axis=1)]
> ...: 
> 
> In [25]: %timeit filt1(sdf,qry)
> 1 loop, best of 3: 458 ms per loop
> In [28]: %timeit filt2(sdf,qry)
> 1 loop, best of 3: 798 ms per loop
> 
> And it certainly is .
> 
> -greg
> 
> 
> 
> On Wed, Nov 23, 2016 at 4:06 PM, Peter Gedeck  > wrote:
> Is it possible to use the bulk similarity searching functionality for better 
> performance instead of the list comprehension?
> 
> Best,
> 
> Peter
> 
> 
> On Wed, Nov 23, 2016 at 9:11 AM Greg Landrum  > wrote:
> No worries.
> This, and Anna's question about similarity searching and clustering 
> illustrate a great opportunity for a tutorial on fingerprints and similarity 
> searching. 
> 
> -greg
> 
> 
> 
> 
> 
> On Wed, Nov 23, 2016 at 3:00 PM +0100, "Chris Swain"  > wrote:
> 
> Thanks for this,
> 
> As a chemist who comes from the “cut and paste” school of scripting I’m 
> always concerned I’m asking something blindingly obvious
> 
> ;-)
> 
> Chris
>> On 23 Nov 2016, at 12:36, Greg Landrum > > wrote:
>> 
>> [including rdkit-discuss, because it's relevant there and I'm pretty sure 
>> Chris won't mind and the real Pandas experts may have a better answer than 
>> me.]
>> 
>> On Wed, Nov 23, 2016 at 9:51 AM, Chris Swain > > wrote:
>> 
>> I quite like storing molecules and associated data in a data frame and I’ve 
>> see that it is possible to use rdkit for substructure searching, it is 
>> possible to also do similarity searching?
>> 
>> It's not built in since there are many possible fingerprints that could be 
>> used.
>> 
>> It's not quite as convenient as the substructure search, but here's a little 
>> demo of what you can do to filter based on similarity:
>> 
>> # Start by adding a fingerprint column:
>> In [18]: df['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2) 
>> for x in df['ROMol']]
>> 
>> # and now filter:
>> In [21]: ndf =df[df.apply(lambda x: 
>> DataStructs.TanimotoSimilarity(x['mfp2'],qry)>=0.7, axis=1)]
>> 
>> In [23]: len(df)
>> Out[23]: 1000
>> In [24]: len(ndf)
>> Out[24]: 2
>> 
>> -greg
>> 
> 
> --
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net 
> 
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss 
> 
> 

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Pandas

2016-11-25 Thread Greg Landrum
That's a good question.

I'm not a master of pandas indexing, but this seems to work:
In [5]: sdf['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2)
for x in sdf['ROMol']]
In [8]: sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2'])
In [13]: ids = [x for x,y in enumerate(sims) if y>0.5]
In [18]: ndf = sdf.iloc[ids]
In [19]: len(ndf)
Out[19]: 3

The question is whether or not that's actually faster.

In [21]: def filt1(sdf,qry):
...: sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2'])
...: ids = [x for x,y in enumerate(sims) if y>0.5]
...: return sdf.iloc[ids]
...:

In [22]: def filt2(sdf,qry):
...: return sdf[sdf.apply(lambda
x:DataStructs.TanimotoSimilarity(x['mfp2'],qry)>0.5,axis=1)]
...:

In [25]: %timeit filt1(sdf,qry)
1 loop, best of 3: 458 ms per loop
In [28]: %timeit filt2(sdf,qry)
1 loop, best of 3: 798 ms per loop

And it certainly is .

-greg



On Wed, Nov 23, 2016 at 4:06 PM, Peter Gedeck 
wrote:

> Is it possible to use the bulk similarity searching functionality for
> better performance instead of the list comprehension?
>
> Best,
>
> Peter
>
>
> On Wed, Nov 23, 2016 at 9:11 AM Greg Landrum 
> wrote:
>
> No worries.
> This, and Anna's question about similarity searching and clustering
> illustrate a great opportunity for a tutorial on fingerprints and
> similarity searching.
>
> -greg
>
>
>
>
>
> On Wed, Nov 23, 2016 at 3:00 PM +0100, "Chris Swain" 
> wrote:
>
> Thanks for this,
>
> As a chemist who comes from the “cut and paste” school of scripting I’m
> always concerned I’m asking something blindingly obvious
>
> ;-)
>
> Chris
>
> On 23 Nov 2016, at 12:36, Greg Landrum  wrote:
>
> [including rdkit-discuss, because it's relevant there and I'm pretty sure
> Chris won't mind and the real Pandas experts may have a better answer than
> me.]
>
> On Wed, Nov 23, 2016 at 9:51 AM, Chris Swain  wrote:
>
>
> I quite like storing molecules and associated data in a data frame and
> I’ve see that it is possible to use rdkit for substructure searching, it is
> possible to also do similarity searching?
>
>
> It's not built in since there are many possible fingerprints that could be
> used.
>
> It's not quite as convenient as the substructure search, but here's a
> little demo of what you can do to filter based on similarity:
>
> # Start by adding a fingerprint column:
> In [18]: df['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2)
> for x in df['ROMol']]
>
> # and now filter:
> In [21]: ndf =df[df.apply(lambda x: DataStructs.
> TanimotoSimilarity(x['mfp2'],qry)>=0.7, axis=1)]
>
> In [23]: len(df)
> Out[23]: 1000
> In [24]: len(ndf)
> Out[24]: 2
>
> -greg
>
>
> 
> --
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Pandas

2016-11-23 Thread Brian Kelley
Peter,
  If you have chemfp and can make a chemfp arena, RDKit now supports these
structures for reading and searching.  This, by far, is the fastest way I
know of similarity searching.  I believe that Greg's implementation is
compatible with chemfp 1.0 which is available on pypi:

https://pypi.python.org/pypi/chemfp/1.0

In my copious spare time, I've been trying to think of ways to embed this
directly in a pandas dataframe however, using them side by side is
certainly doable.

Cheers,
 Brian


On Wed, Nov 23, 2016 at 10:06 AM, Peter Gedeck 
wrote:

> Is it possible to use the bulk similarity searching functionality for
> better performance instead of the list comprehension?
>
> Best,
>
> Peter
>
>
> On Wed, Nov 23, 2016 at 9:11 AM Greg Landrum 
> wrote:
>
> No worries.
> This, and Anna's question about similarity searching and clustering
> illustrate a great opportunity for a tutorial on fingerprints and
> similarity searching.
>
> -greg
>
>
>
>
>
> On Wed, Nov 23, 2016 at 3:00 PM +0100, "Chris Swain" 
> wrote:
>
> Thanks for this,
>
> As a chemist who comes from the “cut and paste” school of scripting I’m
> always concerned I’m asking something blindingly obvious
>
> ;-)
>
> Chris
>
> On 23 Nov 2016, at 12:36, Greg Landrum  wrote:
>
> [including rdkit-discuss, because it's relevant there and I'm pretty sure
> Chris won't mind and the real Pandas experts may have a better answer than
> me.]
>
> On Wed, Nov 23, 2016 at 9:51 AM, Chris Swain  wrote:
>
>
> I quite like storing molecules and associated data in a data frame and
> I’ve see that it is possible to use rdkit for substructure searching, it is
> possible to also do similarity searching?
>
>
> It's not built in since there are many possible fingerprints that could be
> used.
>
> It's not quite as convenient as the substructure search, but here's a
> little demo of what you can do to filter based on similarity:
>
> # Start by adding a fingerprint column:
> In [18]: df['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2)
> for x in df['ROMol']]
>
> # and now filter:
> In [21]: ndf =df[df.apply(lambda x: DataStructs.
> TanimotoSimilarity(x['mfp2'],qry)>=0.7, axis=1)]
>
> In [23]: len(df)
> Out[23]: 1000
> In [24]: len(ndf)
> Out[24]: 2
>
> -greg
>
>
> 
> --
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
> 
> --
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Pandas

2016-11-23 Thread Peter Gedeck
Is it possible to use the bulk similarity searching functionality for
better performance instead of the list comprehension?

Best,

Peter


On Wed, Nov 23, 2016 at 9:11 AM Greg Landrum  wrote:

No worries.
This, and Anna's question about similarity searching and clustering
illustrate a great opportunity for a tutorial on fingerprints and
similarity searching.

-greg





On Wed, Nov 23, 2016 at 3:00 PM +0100, "Chris Swain"  wrote:

Thanks for this,

As a chemist who comes from the “cut and paste” school of scripting I’m
always concerned I’m asking something blindingly obvious

;-)

Chris

On 23 Nov 2016, at 12:36, Greg Landrum  wrote:

[including rdkit-discuss, because it's relevant there and I'm pretty sure
Chris won't mind and the real Pandas experts may have a better answer than
me.]

On Wed, Nov 23, 2016 at 9:51 AM, Chris Swain  wrote:


I quite like storing molecules and associated data in a data frame and I’ve
see that it is possible to use rdkit for substructure searching, it is
possible to also do similarity searching?


It's not built in since there are many possible fingerprints that could be
used.

It's not quite as convenient as the substructure search, but here's a
little demo of what you can do to filter based on similarity:

# Start by adding a fingerprint column:
In [18]: df['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2)
for x in df['ROMol']]

# and now filter:
In [21]: ndf =df[df.apply(lambda x:
DataStructs.TanimotoSimilarity(x['mfp2'],qry)>=0.7, axis=1)]

In [23]: len(df)
Out[23]: 1000
In [24]: len(ndf)
Out[24]: 2

-greg


--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Pandas

2016-11-23 Thread Greg Landrum
No worries.This, and Anna's question about similarity searching and clustering 
illustrate a great opportunity for a tutorial on fingerprints and similarity 
searching. 
-greg






On Wed, Nov 23, 2016 at 3:00 PM +0100, "Chris Swain"  wrote:










Thanks for this,
As a chemist who comes from the “cut and paste” school of scripting I’m always 
concerned I’m asking something blindingly obvious
;-)
Chris
On 23 Nov 2016, at 12:36, Greg Landrum  wrote:
[including rdkit-discuss, because it's relevant there and I'm pretty sure Chris 
won't mind and the real Pandas experts may have a better answer than me.]

On Wed, Nov 23, 2016 at 9:51 AM, Chris Swain  wrote:


I quite like storing molecules and associated data in a data frame and I’ve see 
that it is possible to use rdkit for substructure searching, it is possible to 
also do similarity searching?

It's not built in since there are many possible fingerprints that could be used.
It's not quite as convenient as the substructure search, but here's a little 
demo of what you can do to filter based on similarity:
# Start by adding a fingerprint column:In [18]: df['mfp2'] = 
[rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2) for x in df['ROMol']]

# and now filter:In [21]: ndf =df[df.apply(lambda x: 
DataStructs.TanimotoSimilarity(x['mfp2'],qry)>=0.7, axis=1)]
In [23]: len(df)
Out[23]: 1000In [24]: len(ndf)Out[24]: 2
-greg







--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Pandas dataframe manipulation

2016-03-11 Thread Paul Czodrowski
Maciek, thanks for the note via private message!
To all of you: Here comes the solution to just skip entries inside a column 
that contain a combination of float and “>” :

pd.read_csv('test_mw_r2.csv', sep=';', converters={'r2': lambda x: np.NaN if x 
and x[0] == '>' else x}).dropna(axis=0)


Paul

Von: Maciek Wójcikowski [mailto:mac...@wojcikowski.pl]
Gesendet: Freitag, 11. März 2016 12:29
An: Paul Czodrowski <paul.czodrow...@merckgroup.com>
Cc: rdkit <rdkit-discuss@lists.sourceforge.net>
Betreff: Re: [Rdkit-discuss] Pandas dataframe manipulation

Hi Paul,

I would suggest:

  *   assigning dtype of dataframe/column to str/np.object
  *   cleaning up the IC50s
  *   casting to float/int as dataframe.astype()
Or alternatively you could use "converters" argument:
pd.read_csv('filename.csv', converters={'ic50_colname': lambda x: 
x.replace('>', '')})

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl<mailto:mac...@wojcikowski.pl>

2016-03-11 11:12 GMT+01:00 Paul Czodrowski 
<paul.czodrow...@merckgroup.com<mailto:paul.czodrow...@merckgroup.com>>:
Dear RDKitter & Pandas-Dataframes heavy users,

please find below a question concerning the conversion of pandas dataframes:
df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"], "row1": [1,2,3,">2",5], 
"row2":[0.1,0.2,0.3,0.4,0.5],"row3":["ab","cd","ed","gh","ij"]})
df_new = df[df[["row1"]].applymap(np.isreal).all(1)]

I would like to get rid of this nasty ">2" entry in "row1" => This works 
perfect  given the snippet above.

However, when I read in a CSV file containing similar data (see the attached 
CSV) => The conversion does not work: all columns in the IC50 value are 
discarded and end up in yielding "NaN".

What is going wrong?


Thanks & Cheers,
Paul



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111=/4140
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss




This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.
--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111=/4140___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Pandas dataframe manipulation

2016-03-11 Thread Maciek Wójcikowski
Hi Paul,

I would suggest:

   - assigning dtype of dataframe/column to str/np.object
   - cleaning up the IC50s
   - casting to float/int as dataframe.astype()

Or alternatively you could use "converters" argument:
pd.read_csv('filename.csv', converters={'ic50_colname': lambda x:
x.replace('>', '')})

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl

2016-03-11 11:12 GMT+01:00 Paul Czodrowski :

> Dear RDKitter & Pandas-Dataframes heavy users,
>
>
>
> please find below a question concerning the conversion of pandas
> dataframes:
>
> df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"], "row1":
> [1,2,3,">2",5],
> "row2":[0.1,0.2,0.3,0.4,0.5],"row3":["ab","cd","ed","gh","ij"]})
>
> df_new = df[df[["row1"]].applymap(np.isreal).all(1)]
>
>
>
> I would like to get rid of this nasty ">2" entry in "row1" => This works
> perfect  given the snippet above.
>
>
>
> However, when I read in a CSV file containing similar data (see the
> attached CSV) => The conversion does not work: all columns in the IC50
> value are discarded and end up in yielding "NaN".
>
>
>
> What is going wrong?
>
>
>
>
>
> Thanks & Cheers,
>
> Paul
>
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
>
>
> Click http://www.merckgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.
>
>
> --
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785111=/4140
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111=/4140___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Pandas dataframe manipulation

2016-03-11 Thread Paul Czodrowski
Dear RDKitter & Pandas-Dataframes heavy users,

please find below a question concerning the conversion of pandas dataframes:
df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"], "row1": [1,2,3,">2",5], 
"row2":[0.1,0.2,0.3,0.4,0.5],"row3":["ab","cd","ed","gh","ij"]})
df_new = df[df[["row1"]].applymap(np.isreal).all(1)]

I would like to get rid of this nasty ">2" entry in "row1" => This works 
perfect  given the snippet above.

However, when I read in a CSV file containing similar data (see the attached 
CSV) => The conversion does not work: all columns in the IC50 value are 
discarded and end up in yielding "NaN".

What is going wrong?


Thanks & Cheers,
Paul



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.



Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


test_mw_r2.csv
Description: test_mw_r2.csv
--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111=/4140___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] pandas / sd-tags

2013-07-02 Thread Paul . Czodrowski
Dear Niko,

I was exactly looking for this functionality, great work!

A few follow-up questions:
* frame.set_index('_Name') did not work, but there is a name set in the SD 
file.
* Is there a way to load in only a specified list of SD tags? (I didn't 
find a names parameter for LoadSDF)
* frame.head() frame.describe() give a property ID, which is not present 
in my SD file. Where does it come from?
* frame.describe() does not show the basic statistics of the SD file.

The last three points are due to the fact that PandasTools.LoadSDF has 
fewer functionalities than PandasTools.read_table?


Cheers  big thanks again,
Paul


 
 Hi Paul,
 I am not sure if it is easily doable to get the pandas read_table 
 function to handle sd-files. However, there is some basic 
 functionality for this already built-in in the PandasTools module. 
 If you check the docktest header there is a small example. Basically, 
 
 frame = PandasTools.LoadSDF
 
(sdfFile,smilesName='SMILES',molColName='Molecule',includeFingerprints=True)
 
 loads the data from an sd-file into a dataframe, such that every 
 molecule entry corresponds to a row with the molecule in the column 
 'Molecule'. The specified smiles column is generated automatically 
 and every sd-property ends up in a column with the respective 
 property name. Additionally, if there is a property _Name set for 
 the molecule that is used as a row identifier - I assume this could 
 be made customisable in the future.
 Is this something you could use? 
 
 Kind regards,
 Niko
 
 On Jun 30, 2013, at 5:10 PM, paul.czodrow...@merckgroup.com wrote:
 
 Dear RDKitters,
 
 I was wondering if anyone has looked into the Pandas data frame with 
 respect to read in a SD file similar to this syntax:
 
 data = 
 pd.read_table(open('whatever.smi','r'),header=None,names=
 ['smiles','cas','mutagenic'])
 
 Ideally, names would be automatically set according to the SD tags.
 
 
 Cheers  Thanks,
 Paul


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.

Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

--
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] pandas / sd-tags

2013-07-02 Thread Nikolas Fechner
Hi Paul,
I'll answer directly below.

 Dear Niko,
 
 I was exactly looking for this functionality, great work!
 
 A few follow-up questions:
 * frame.set_index('_Name') did not work, but there is a name set in the SD 
 file.

The molecule name is contained in the column specified by the optional idName 
parameter for LoadSDF. The default value is ID - which might also answer the 
question where that column comes from.
Try frame.set_index('ID') instead.

 * Is there a way to load in only a specified list of SD tags? (I didn't 
 find a names parameter for LoadSDF)

There is no such option at the moment - but this is something that should be 
quite easy too add. 

 * frame.head() frame.describe() give a property ID, which is not present 
 in my SD file. Where does it come from?

See first answer.

 * frame.describe() does not show the basic statistics of the SD file.

This is something that happens when there are too many columns to fit the set 
display width. This can be adjusted as described here using also the describe 
problem: 
http://stackoverflow.com/questions/11707586/python-pandas-widen-output-display
Alternatively, if you are using the iPython notebook, there is also the option 
to force an HTML rendering of the full data frame (which is also what 
describe returns) by calling:

from IPython.display import HTML,display
display(HTML(frame.to_html()))

 
 The last three points are due to the fact that PandasTools.LoadSDF has 
 fewer functionalities than PandasTools.read_table?

The things you mentioned are all related to data frame behaviour, which should 
be no different regardless how the data was loaded originally - the data frame 
object should not behave differently if constructed using PandasTools.

 
 
 Cheers  big thanks again,
 Paul

I hope this helps a bit.

Best regards,
Niko

--
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] pandas / sd-tags

2013-07-01 Thread Nikolas Fechner
Hi Paul,
I am not sure if it is easily doable to get the pandas read_table function to 
handle sd-files. However, there is some basic functionality for this already 
built-in in the PandasTools module. If you check the docktest header there is a 
small example. Basically, 

frame = 
PandasTools.LoadSDF(sdfFile,smilesName='SMILES',molColName='Molecule',includeFingerprints=True)

loads the data from an sd-file into a dataframe, such that every molecule entry 
corresponds to a row with the molecule in the column 'Molecule'. The specified 
smiles column is generated automatically and every sd-property ends up in a 
column with the respective property name. Additionally, if there is a property 
_Name set for the molecule that is used as a row identifier - I assume this 
could be made customisable in the future.
Is this something you could use? 

Kind regards,
Niko

On Jun 30, 2013, at 5:10 PM, paul.czodrow...@merckgroup.com wrote:

 Dear RDKitters,
 
 I was wondering if anyone has looked into the Pandas data frame with 
 respect to read in a SD file similar to this syntax:
 
 data = 
 pd.read_table(open('whatever.smi','r'),header=None,names=['smiles','cas','mutagenic'])
 
 Ideally, names would be automatically set according to the SD tags.
 
 
 Cheers  Thanks,
 Paul
 
 
 This message and any attachment are confidential and may be privileged or 
 otherwise protected from disclosure. If you are not the intended recipient, 
 you must not copy this message or attachment or disclose the contents to any 
 other person. If you have received this transmission in error, please notify 
 the sender immediately and delete the message and any attachment from your 
 system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
 accept liability for any omissions or errors in this message which may arise 
 as a result of E-Mail-transmission or for damages resulting from any 
 unauthorized changes of the content of this message and any attachment 
 thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
 guarantee that this message is free of viruses and does not accept liability 
 for any damages caused by any virus transmitted therewith.
 
 Click http://www.merckgroup.com/disclaimer to access the German, French, 
 Spanish and Portuguese versions of this disclaimer.
 
 --
 This SF.net email is sponsored by Windows:
 
 Build for Windows Store.
 
 http://p.sf.net/sfu/windows-dev2dev
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

--
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss