Re: [Rdkit-discuss] Pandas

Chris Swain Sat, 26 Nov 2016 23:47:28 -0800

I added the similarity scores

by adding an extra line,



sdf['sim']=DataStructs.BulkTanimotoSimilarity(ionised_fps,sdf['mfp2’])

I don’t know if it could be done in a single line?

Chris

> On 26 Nov 2016, at 04:48, Greg Landrum <greg.land...@gmail.com> wrote:
> 
> That's a good question.
> 
> I'm not a master of pandas indexing, but this seems to work:
> In [5]: sdf['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2) 
> for x in sdf['ROMol']]
> In [8]: sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2'])
> In [13]: ids = [x for x,y in enumerate(sims) if y>0.5]
> In [18]: ndf = sdf.iloc[ids]
> In [19]: len(ndf)
> Out[19]: 3
> 
> The question is whether or not that's actually faster.
> 
> In [21]: def filt1(sdf,qry):
>     ...:     sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2'])
>     ...:     ids = [x for x,y in enumerate(sims) if y>0.5]
>     ...:     return sdf.iloc[ids]
>     ...: 
> 
> In [22]: def filt2(sdf,qry):
>     ...:     return sdf[sdf.apply(lambda 
> x:DataStructs.TanimotoSimilarity(x['mfp2'],qry)>0.5,axis=1)]
>     ...:     
> 
> In [25]: %timeit filt1(sdf,qry)
> 1 loop, best of 3: 458 ms per loop
> In [28]: %timeit filt2(sdf,qry)
> 1 loop, best of 3: 798 ms per loop
> 
> And it certainly is .
> 
> -greg
> 
> 
> 
> On Wed, Nov 23, 2016 at 4:06 PM, Peter Gedeck <peter.ged...@gmail.com 
> <mailto:peter.ged...@gmail.com>> wrote:
> Is it possible to use the bulk similarity searching functionality for better 
> performance instead of the list comprehension?
> 
> Best,
> 
> Peter
> 
> 
> On Wed, Nov 23, 2016 at 9:11 AM Greg Landrum <greg.land...@gmail.com 
> <mailto:greg.land...@gmail.com>> wrote:
> No worries.
> This, and Anna's question about similarity searching and clustering 
> illustrate a great opportunity for a tutorial on fingerprints and similarity 
> searching. 
> 
> -greg
> 
> 
> 
> 
> 
> On Wed, Nov 23, 2016 at 3:00 PM +0100, "Chris Swain" <sw...@mac.com 
> <mailto:sw...@mac.com>> wrote:
> 
> Thanks for this,
> 
> As a chemist who comes from the “cut and paste” school of scripting I’m 
> always concerned I’m asking something blindingly obvious
> 
> ;-)
> 
> Chris
>> On 23 Nov 2016, at 12:36, Greg Landrum <greg.land...@gmail.com 
>> <mailto:greg.land...@gmail.com>> wrote:
>> 
>> [including rdkit-discuss, because it's relevant there and I'm pretty sure 
>> Chris won't mind and the real Pandas experts may have a better answer than 
>> me.]
>> 
>> On Wed, Nov 23, 2016 at 9:51 AM, Chris Swain <sw...@mac.com 
>> <mailto:sw...@mac.com>> wrote:
>> 
>> I quite like storing molecules and associated data in a data frame and I’ve 
>> see that it is possible to use rdkit for substructure searching, it is 
>> possible to also do similarity searching?
>> 
>> It's not built in since there are many possible fingerprints that could be 
>> used.
>> 
>> It's not quite as convenient as the substructure search, but here's a little 
>> demo of what you can do to filter based on similarity:
>> 
>> # Start by adding a fingerprint column:
>> In [18]: df['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2) 
>> for x in df['ROMol']]
>> 
>> # and now filter:
>> In [21]: ndf =df[df.apply(lambda x: 
>> DataStructs.TanimotoSimilarity(x['mfp2'],qry)>=0.7, axis=1)]
>> 
>> In [23]: len(df)
>> Out[23]: 1000
>> In [24]: len(ndf)
>> Out[24]: 2
>> 
>> -greg
>> 
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net 
> <mailto:Rdkit-discuss@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss 
> <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss>
>

------------------------------------------------------------------------------

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Pandas

Reply via email to