I added the similarity scores by adding an extra line,
sdf['sim']=DataStructs.BulkTanimotoSimilarity(ionised_fps,sdf['mfp2’]) I don’t know if it could be done in a single line? Chris > On 26 Nov 2016, at 04:48, Greg Landrum <greg.land...@gmail.com> wrote: > > That's a good question. > > I'm not a master of pandas indexing, but this seems to work: > In [5]: sdf['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2) > for x in sdf['ROMol']] > In [8]: sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2']) > In [13]: ids = [x for x,y in enumerate(sims) if y>0.5] > In [18]: ndf = sdf.iloc[ids] > In [19]: len(ndf) > Out[19]: 3 > > The question is whether or not that's actually faster. > > In [21]: def filt1(sdf,qry): > ...: sims = DataStructs.BulkTanimotoSimilarity(qry,sdf['mfp2']) > ...: ids = [x for x,y in enumerate(sims) if y>0.5] > ...: return sdf.iloc[ids] > ...: > > In [22]: def filt2(sdf,qry): > ...: return sdf[sdf.apply(lambda > x:DataStructs.TanimotoSimilarity(x['mfp2'],qry)>0.5,axis=1)] > ...: > > In [25]: %timeit filt1(sdf,qry) > 1 loop, best of 3: 458 ms per loop > In [28]: %timeit filt2(sdf,qry) > 1 loop, best of 3: 798 ms per loop > > And it certainly is . > > -greg > > > > On Wed, Nov 23, 2016 at 4:06 PM, Peter Gedeck <peter.ged...@gmail.com > <mailto:peter.ged...@gmail.com>> wrote: > Is it possible to use the bulk similarity searching functionality for better > performance instead of the list comprehension? > > Best, > > Peter > > > On Wed, Nov 23, 2016 at 9:11 AM Greg Landrum <greg.land...@gmail.com > <mailto:greg.land...@gmail.com>> wrote: > No worries. > This, and Anna's question about similarity searching and clustering > illustrate a great opportunity for a tutorial on fingerprints and similarity > searching. > > -greg > > > > > > On Wed, Nov 23, 2016 at 3:00 PM +0100, "Chris Swain" <sw...@mac.com > <mailto:sw...@mac.com>> wrote: > > Thanks for this, > > As a chemist who comes from the “cut and paste” school of scripting I’m > always concerned I’m asking something blindingly obvious > > ;-) > > Chris >> On 23 Nov 2016, at 12:36, Greg Landrum <greg.land...@gmail.com >> <mailto:greg.land...@gmail.com>> wrote: >> >> [including rdkit-discuss, because it's relevant there and I'm pretty sure >> Chris won't mind and the real Pandas experts may have a better answer than >> me.] >> >> On Wed, Nov 23, 2016 at 9:51 AM, Chris Swain <sw...@mac.com >> <mailto:sw...@mac.com>> wrote: >> >> I quite like storing molecules and associated data in a data frame and I’ve >> see that it is possible to use rdkit for substructure searching, it is >> possible to also do similarity searching? >> >> It's not built in since there are many possible fingerprints that could be >> used. >> >> It's not quite as convenient as the substructure search, but here's a little >> demo of what you can do to filter based on similarity: >> >> # Start by adding a fingerprint column: >> In [18]: df['mfp2'] = [rdMolDescriptors.GetMorganFingerprintAsBitVect(x,2) >> for x in df['ROMol']] >> >> # and now filter: >> In [21]: ndf =df[df.apply(lambda x: >> DataStructs.TanimotoSimilarity(x['mfp2'],qry)>=0.7, axis=1)] >> >> In [23]: len(df) >> Out[23]: 1000 >> In [24]: len(ndf) >> Out[24]: 2 >> >> -greg >> > > ------------------------------------------------------------------------------ > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > <mailto:Rdkit-discuss@lists.sourceforge.net> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > <https://lists.sourceforge.net/lists/listinfo/rdkit-discuss> >
------------------------------------------------------------------------------
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss