[Rdkit-discuss] Beta of RDKit knime nodes available
Dear all, I announced this at Goslar but just realized I hadn't posted to the mailing list: We've recently been doing some work with the guys at knime.com to develop some RDKit-based nodes that add basic cheminformatics functionality to knime. A beta version of these nodes is available in a zipped update site here: http://labs.knime.org/update/org.rdkit.0.9.0.zip You can install these directly into knime using its Update Manager. Note that you do *not* need an RDKit install to use the knime nodes. They should work out of the box on 32 bit windows systems, 32 and 64 bit linux systems, and 64 bit mac systems (though here you will need to use a beta version of knime). Current functionality includes: - Conversion to/from RDKit molecules - generation of canonical smiles - fingerprinting - substructure filtering - chemical reactions The plan is to polish these nodes over the next couple of weeks, maybe add one or two more pieces of key functionality, and have everything ready for a release in early December. Please give the nodes a try and let me know what you think or if you have suggestions for improvements. Many thanks to Thorsten and Bernd at knime.com who made this all possible. Best Regards, -greg -- Centralized Desktop Delivery: Dell and VMware Reference Architecture Simplifying enterprise desktop deployment and management using Dell EqualLogic storage and VMware View: A highly scalable, end-to-end client virtualization framework. Read more! http://p.sf.net/sfu/dell-eql-dev2dev ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Beta of RDKit knime nodes available
Dear Greg (and, of course, Thorsten and Bernd!) Great job on the Knime nodes! I have been giving these a go and am impressed (and excited about the future development!). A couple of observations / comments / questions: 1. I have observed that sometimes the FP node seems to generate blank fingerprints (doesn't appear to just be the rendering - eg blank if I swap to 'Bit Scratch' render as well. I have mainly been trying the default Morgan FPs, and find that if I reset the node and re-run, the FP is still blank. If, however, I swap the node to eg atompair, run, then swap back to Morgan - it seems to work... I am running on knime 2.2.2 on Windows 32-bit. 2. The next point is probably down to cheminformatics / knime naivety, but I must confess I am struggling a little to cluster compounds based on the FP... I have used the 'Distance Matrix Calculate' node (with Tanimoto similarity) to get a matrix that can be used by the 'Heirarchical Clustering (DistMatrix)' or 'k-Medoids' nodes. However, both of these appear to perform VERY slowly for a set of ~ 4000 compounds. I also attempted to cluster on the fingerprints directly, using the Neighborgrams nodes - but must confess I am some way off understanding what I am doing! My limited experience of using the RDKit functionality to cluster compounds and eg select a representative set (based on the FP Tanimoto distances and the Murtagh clustering) was that it performed rather rapidly. Is there the intention to expose this functionality in knime (or is the functionality already there and I just don't know how?) 3. Any plans for Windows 64-bit support? 4. I would be interested to know what the team views as the next priorities - property calcs, 3D conformations, pharmacophores, rendering? So much great stuff to choose from! :-) Kind regards James __ PLEASE READ: This email is confidential and may be privileged. It is intended for the named addressee(s) only and access to it by anyone else is unauthorised. If you are not an addressee, any disclosure or copying of the contents of this email or any action taken (or not taken) in reliance on it is unauthorised and may be unlawful. If you have received this email in error, please notify the sender or postmas...@vernalis.com. Email is not a secure method of communication and the Company cannot accept responsibility for the accuracy or completeness of this message or any attachment(s). Please check this email for virus infection for which the Company accepts no responsibility. If verification of this email is sought then please request a hard copy. Unless otherwise stated, any views or opinions presented are solely those of the author and do not represent those of the Company. The Vernalis Group of Companies Oakdene Court 613 Reading Road Winnersh, Berkshire RG41 5UA. Tel: +44 118 977 3133 To access trading company registration and address details, please go to the Vernalis website at www.vernalis.com and click on the "Company address and registration details" link at the bottom of the page.. __-- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Beta of RDKit knime nodes available
Dear James, On Wed, Nov 24, 2010 at 4:35 PM, James Davidson wrote: > > Great job on the Knime nodes! I have been giving these a go and am > impressed (and excited about the future development!). A couple of > observations / comments / questions: Thanks! > > 1. I have observed that sometimes the FP node seems to generate blank > fingerprints (doesn't appear to just be the rendering - eg blank if I swap > to 'Bit Scratch' render as well. I have mainly been trying the default > Morgan FPs, and find that if I reset the node and re-run, the FP is still > blank. If, however, I swap the node to eg atompair, run, then swap back to > Morgan - it seems to work... I am running on knime 2.2.2 on Windows 32-bit. That's odd. I haven't seen anything like this, but I haven't spent a ton of time using the windows version. I'll try to see if I can reproduce it. > 2. The next point is probably down to cheminformatics / knime naivety, but > I must confess I am struggling a little to cluster compounds based on the > FP... I have used the 'Distance Matrix Calculate' node (with Tanimoto > similarity) to get a matrix that can be used by the 'Heirarchical Clustering > (DistMatrix)' or 'k-Medoids' nodes. However, both of these appear to > perform VERY slowly for a set of ~ 4000 compounds. I also attempted to > cluster on the fingerprints directly, using the Neighborgrams nodes - but > must confess I am some way off understanding what I am doing! Hierarchical Clustering (DistMatrix) does, indeed, scale poorly. According to the docs it scales cubically in the number of rows... that's going to hurt when N=4000. The implementation the RDKit uses (adapted from some code by Murtagh) is pretty heavily optimized and behaves well for large datasets. > My limited > experience of using the RDKit functionality to cluster compounds and eg > select a representative set (based on the FP Tanimoto distances and the > Murtagh clustering) was that it performed rather rapidly. Is there the > intention to expose this functionality in knime (or is the functionality > already there and I just don't know how?) It's not there yet, but it sure would be useful if the knime implementation were faster. I don't think it makes sense to use the RDKit implementation directly, but it may be possible to do a port of the Murtagh algorithm to java. Thorsten? What do you think? > > 3. Any plans for Windows 64-bit support? I haven't had a 64bit windows machine set up for development work, so I've never even tested the RDKit under 64bit windows. I just got a new machine, which does have windows installed. I will see about getting a development environment on there and trying to build the RDKit, but I'm not going to make any promises there. > 4. I would be interested to know what the team views as the next priorities > - property calcs, 3D conformations, pharmacophores, rendering? So much > great stuff to choose from! :-) We're open to suggestions. In addition to what's already there, the initial release will contain at least an AddCoordinates node which can add either 2D coordinates (optionally aligned to a template) or a 3D conformation. If you have things that you'd really like to see, please pipe up. Best Regards, -greg -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Beta of RDKit knime nodes available
>> reat job on the Knime nodes!? I have been giving these a go and am >> > My limited >> > experience of using the RDKit functionality to cluster compounds and eg >> > select a representative set (based on the FP Tanimoto distances and the >> > Murtagh clustering) was that it performed rather rapidly.? Is there the >> > intention to expose this functionality in knime (or is the functionality >> > already there and I just don't know how?) > It's not there yet, but it sure would be useful if the knime > implementation were faster. I don't think it makes sense to use the > RDKit implementation directly, but it may be possible to do a port of > the Murtagh algorithm to java. Thorsten? What do you think? I have to confess that I have never heard of the Murtaugh algorithm but it should be possible to port it to Java. On the other hand, 4000 rows should not take that long in KNIME. How much times does it currently take? Cheers, Thorsten -- Dr.-Ing. Thorsten Meinl room: Z815 Nycomed Chair for Bioinformatics fax: +49 (0)7531 88-5132 and Information Miningphone: +49 (0)7531 88-5016 Box 712, 78457 Konstanz, Germany -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Beta of RDKit knime nodes available
Hi Thorsten, On Wed, Nov 24, 2010 at 9:41 PM, Thorsten Meinl wrote: >>> reat job on the Knime nodes!? I have been giving these a go and am >>> > My limited >>> > experience of using the RDKit functionality to cluster compounds and eg >>> > select a representative set (based on the FP Tanimoto distances and the >>> > Murtagh clustering) was that it performed rather rapidly.? Is there the >>> > intention to expose this functionality in knime (or is the functionality >>> > already there and I just don't know how?) >> It's not there yet, but it sure would be useful if the knime >> implementation were faster. I don't think it makes sense to use the >> RDKit implementation directly, but it may be possible to do a port of >> the Murtagh algorithm to java. Thorsten? What do you think? > I have to confess that I have never heard of the Murtaugh algorithm but > it should be possible to port it to Java. There's a fortran implementation here: http://www.classification-society.org/csna/mda-sw/hc.f It will probably make your eyes burn to read it, but it's at least short. :-) > On the other hand, 4000 rows should not take that long in KNIME. How > much times does it currently take? I just did 1000 rows on my macbook. Assuming I'm reading the knime log correctly, that took about a minute. -greg -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Beta of RDKit knime nodes available
Hi Greg and Thorsten, > Greg: > >> Thorsten: >> On the other hand, 4000 rows should not take that long in KNIME. How >> much times does it currently take? > > I just did 1000 rows on my macbook. Assuming I'm reading the knime log > correctly, that took about a minute. Thanks for testing this out, Greg. I must confess, I didn't wait for the hierarchical clustering to finish for the 4000! Going back and selecting a random 1000 molecule subset, I reproduce your result of ~ 1 min (I get 67 secs). If I then go to 2000, it takes 520 secs - so to me this looks like cubic complexity - which is what the documentation for the node states (this would mean > 1 hr for my original 4000...) For completeness - this result was with the Hierarchical Clustering(DistMatrix) node set with 'Tanimoto' similarity and 'Complete Linkage' for cluster comparison. Changing the comparison to 'Single Linkage' did not reduce the time. Interestingly, the documentation for the 'standard' Hierarchical Clustering' (ie non-distance matrix) node states that it operates with "n-squared complexity". I guess other clustering algorithms available in knime must scale better than cubicly as well (k-means, fuzzy c-means?) - but as far as I can see they don't currently operate on distance matrices (or directly on bit vectors). If they could, then this may be a solution; or implementing the Murtagh algorithm (I am guessing the scaling is below cubic from my recollection of the speeds observed in rdkit). Kind regards James __ PLEASE READ: This email is confidential and may be privileged. It is intended for the named addressee(s) only and access to it by anyone else is unauthorised. If you are not an addressee, any disclosure or copying of the contents of this email or any action taken (or not taken) in reliance on it is unauthorised and may be unlawful. If you have received this email in error, please notify the sender or postmas...@vernalis.com. Email is not a secure method of communication and the Company cannot accept responsibility for the accuracy or completeness of this message or any attachment(s). Please check this email for virus infection for which the Company accepts no responsibility. If verification of this email is sought then please request a hard copy. Unless otherwise stated, any views or opinions presented are solely those of the author and do not represent those of the Company. The Vernalis Group of Companies Oakdene Court 613 Reading Road Winnersh, Berkshire RG41 5UA. Tel: +44 118 977 3133 To access trading company registration and address details, please go to the Vernalis website at www.vernalis.com and click on the "Company address and registration details" link at the bottom of the page.. __ -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Beta of RDKit knime nodes available
Am 27.11.2010 08:39, schrieb James Davidson: > For completeness - this result was with the Hierarchical > Clustering(DistMatrix) node set with 'Tanimoto' similarity and 'Complete > Linkage' for cluster comparison. Changing the comparison to 'Single > Linkage' did not reduce the time. That is expected. The "linkage" only controls which distance is used in the the (the maximum, minimum or average) but you need to look at all distances in any case. > Interestingly, the documentation for the 'standard' Hierarchical > Clustering' (ie non-distance matrix) node states that it operates with > "n-squared complexity". Ooops. That is certainly wrong. It is the same algorithm. n^3 would be right. > I guess other clustering algorithms available > in knime must scale better than cubicly as well (k-means, fuzzy > c-means?) - but as far as I can see they don't currently operate on > distance matrices (or directly on bit vectors). There is a k-medoids that should work on distance matrices. The problem for k-means (and fuzzy c-means) is that you need the full coordinates in order to set the prototypes in each iteration. That doesn't work if you only have pairwise distances. > If they could, then > this may be a solution; or implementing the Murtagh algorithm (I am > guessing the scaling is below cubic from my recollection of the speeds > observed in rdkit). Greg sent me a link to the publication. If I find some time, I will have a look at it. Cheers, Thorsten -- Dr.-Ing. Thorsten Meinl room: Z815 Nycomed Chair for Bioinformatics fax: +49 (0)7531 88-5132 and Information Miningphone: +49 (0)7531 88-5016 Box 712, 78457 Konstanz, Germany -- Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss