[Rdkit-discuss] Beta of RDKit knime nodes available

2010-11-13 Thread Greg Landrum
Dear all,

I announced this at Goslar but just realized I hadn't posted to the
mailing list:
We've recently been doing some work with the guys at knime.com to
develop some RDKit-based nodes that add basic cheminformatics
functionality to knime. A beta version of these nodes is available in
a zipped update site here:
http://labs.knime.org/update/org.rdkit.0.9.0.zip

You can install these directly into knime using its Update Manager.
Note that you do *not* need an RDKit install to use the knime nodes.
They should work out of the box on 32 bit windows systems, 32 and 64
bit linux systems, and 64 bit mac systems (though here you will need
to use a beta version of knime).

Current functionality includes:
- Conversion to/from RDKit molecules
- generation of canonical smiles
- fingerprinting
- substructure filtering
- chemical reactions

The plan is to polish these nodes over the next couple of weeks, maybe
add one or two more pieces of key functionality, and have everything
ready for a release in early December.

Please give the nodes a try and let me know what you think or if you
have suggestions for improvements.

Many thanks to Thorsten and Bernd at knime.com who made this all possible.

Best Regards,
-greg

--
Centralized Desktop Delivery: Dell and VMware Reference Architecture
Simplifying enterprise desktop deployment and management using
Dell EqualLogic storage and VMware View: A highly scalable, end-to-end
client virtualization framework. Read more!
http://p.sf.net/sfu/dell-eql-dev2dev
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Beta of RDKit knime nodes available

2010-11-24 Thread James Davidson
Dear Greg (and, of course, Thorsten and Bernd!)
 
Great job on the Knime nodes!  I have been giving these a go and am
impressed (and excited about the future development!).  A couple of
observations / comments / questions:
 
1.  I have observed that sometimes the FP node seems to generate blank
fingerprints (doesn't appear to just be the rendering - eg blank if I
swap to 'Bit Scratch' render as well.  I have mainly been trying the
default Morgan FPs, and find that if I reset the node and re-run, the FP
is still blank.  If, however, I swap the node to eg atompair, run, then
swap back to Morgan - it seems to work...  I am running on knime 2.2.2
on Windows 32-bit.
 
2.  The next point is probably down to cheminformatics / knime naivety,
but I must confess I am struggling a little to cluster compounds based
on the FP...   I have used the 'Distance Matrix Calculate' node (with
Tanimoto similarity) to get a matrix that can be used by the
'Heirarchical Clustering (DistMatrix)' or 'k-Medoids' nodes.  However,
both of these appear to perform VERY slowly for a set of ~ 4000
compounds.  I also attempted to cluster on the fingerprints directly,
using the Neighborgrams nodes - but must confess I am some way off
understanding what I am doing!  My limited experience of using the RDKit
functionality to cluster compounds and eg select a representative set
(based on the FP Tanimoto distances and the Murtagh clustering) was that
it performed rather rapidly.  Is there the intention to expose this
functionality in knime (or is the functionality already there and I just
don't know how?)
 
3.  Any plans for Windows 64-bit support?
 
4.  I would be interested to know what the team views as the next
priorities - property calcs, 3D conformations, pharmacophores,
rendering?  So much great stuff to choose from!  :-)
 
Kind regards
 
James

__
PLEASE READ: This email is confidential and may be privileged. It is intended 
for the named addressee(s) only and access to it by anyone else is 
unauthorised. If you are not an addressee, any disclosure or copying of the 
contents of this email or any action taken (or not taken) in reliance on it is 
unauthorised and may be unlawful. If you have received this email in error, 
please notify the sender or postmas...@vernalis.com. Email is not a secure 
method of communication and the Company cannot accept responsibility for the 
accuracy or completeness of this message or any attachment(s). Please check 
this email for virus infection for which the Company accepts no responsibility. 
If verification of this email is sought then please request a hard copy. Unless 
otherwise stated, any views or opinions presented are solely those of the 
author and do not represent those of the Company.

The Vernalis Group of Companies
Oakdene Court
613 Reading Road
Winnersh, Berkshire
RG41 5UA.
Tel: +44 118 977 3133

To access trading company registration and address details, please go to the 
Vernalis website at www.vernalis.com and click on the "Company address and 
registration details" link at the bottom of the page..
__--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Beta of RDKit knime nodes available

2010-11-24 Thread Greg Landrum
Dear James,

On Wed, Nov 24, 2010 at 4:35 PM, James Davidson  wrote:
>
> Great job on the Knime nodes!  I have been giving these a go and am
> impressed (and excited about the future development!).  A couple of
> observations / comments / questions:

Thanks!

>
> 1.  I have observed that sometimes the FP node seems to generate blank
> fingerprints (doesn't appear to just be the rendering - eg blank if I swap
> to 'Bit Scratch' render as well.  I have mainly been trying the default
> Morgan FPs, and find that if I reset the node and re-run, the FP is still
> blank.  If, however, I swap the node to eg atompair, run, then swap back to
> Morgan - it seems to work...  I am running on knime 2.2.2 on Windows 32-bit.

That's odd. I haven't seen anything like this, but I haven't spent a
ton of time using the windows version. I'll try to see if I can
reproduce it.

> 2.  The next point is probably down to cheminformatics / knime naivety, but
> I must confess I am struggling a little to cluster compounds based on the
> FP...   I have used the 'Distance Matrix Calculate' node (with Tanimoto
> similarity) to get a matrix that can be used by the 'Heirarchical Clustering
> (DistMatrix)' or 'k-Medoids' nodes.  However, both of these appear to
> perform VERY slowly for a set of ~ 4000 compounds.  I also attempted to
> cluster on the fingerprints directly, using the Neighborgrams nodes - but
> must confess I am some way off understanding what I am doing!

Hierarchical Clustering (DistMatrix) does, indeed, scale poorly.
According to the docs it scales cubically in the number of rows...
that's going to hurt when N=4000. The implementation the RDKit uses
(adapted from some code by Murtagh) is pretty heavily optimized and
behaves well for large datasets.

> My limited
> experience of using the RDKit functionality to cluster compounds and eg
> select a representative set (based on the FP Tanimoto distances and the
> Murtagh clustering) was that it performed rather rapidly.  Is there the
> intention to expose this functionality in knime (or is the functionality
> already there and I just don't know how?)

It's not there yet, but it sure would be useful if the knime
implementation were faster. I don't think it makes sense to use the
RDKit implementation directly, but it may be possible to do a port of
the Murtagh algorithm to java.  Thorsten? What do you think?

>
> 3.  Any plans for Windows 64-bit support?

I haven't had a 64bit windows machine set up for development work, so
I've never even tested the RDKit under 64bit windows. I just got a new
machine, which does have windows installed. I will see about getting a
development environment on there and trying to build the RDKit, but
I'm not going to make any promises there.

> 4.  I would be interested to know what the team views as the next priorities
> - property calcs, 3D conformations, pharmacophores, rendering?  So much
> great stuff to choose from!  :-)

We're open to suggestions. In addition to what's already there, the
initial release will contain at least an AddCoordinates node which can
add either 2D coordinates (optionally aligned to a template) or a 3D
conformation. If you have things that you'd really like to see, please
pipe up.

Best Regards,
-greg

--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Beta of RDKit knime nodes available

2010-11-24 Thread Thorsten Meinl
>> reat job on the Knime nodes!? I have been giving these a go and am
>> > My limited
>> > experience of using the RDKit functionality to cluster compounds and eg
>> > select a representative set (based on the FP Tanimoto distances and the
>> > Murtagh clustering) was that it performed rather rapidly.? Is there the
>> > intention to expose this functionality in knime (or is the functionality
>> > already there and I just don't know how?)
> It's not there yet, but it sure would be useful if the knime
> implementation were faster. I don't think it makes sense to use the
> RDKit implementation directly, but it may be possible to do a port of
> the Murtagh algorithm to java.  Thorsten? What do you think?
I have to confess that I have never heard of the Murtaugh algorithm but
it should be possible to port it to Java.
On the other hand, 4000 rows should not take that long in KNIME. How
much times does it currently take?

Cheers,

Thorsten

-- 
Dr.-Ing. Thorsten Meinl   room: Z815
Nycomed Chair for Bioinformatics  fax: +49 (0)7531 88-5132
and Information Miningphone: +49 (0)7531 88-5016
Box 712, 78457 Konstanz, Germany

--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Beta of RDKit knime nodes available

2010-11-24 Thread Greg Landrum
Hi Thorsten,

On Wed, Nov 24, 2010 at 9:41 PM, Thorsten Meinl
 wrote:
>>> reat job on the Knime nodes!? I have been giving these a go and am
>>> > My limited
>>> > experience of using the RDKit functionality to cluster compounds and eg
>>> > select a representative set (based on the FP Tanimoto distances and the
>>> > Murtagh clustering) was that it performed rather rapidly.? Is there the
>>> > intention to expose this functionality in knime (or is the functionality
>>> > already there and I just don't know how?)
>> It's not there yet, but it sure would be useful if the knime
>> implementation were faster. I don't think it makes sense to use the
>> RDKit implementation directly, but it may be possible to do a port of
>> the Murtagh algorithm to java.  Thorsten? What do you think?
> I have to confess that I have never heard of the Murtaugh algorithm but
> it should be possible to port it to Java.

There's a fortran implementation here:
http://www.classification-society.org/csna/mda-sw/hc.f
It will probably make your eyes burn to read it, but it's at least short. :-)

> On the other hand, 4000 rows should not take that long in KNIME. How
> much times does it currently take?

I just did 1000 rows on my macbook. Assuming I'm reading the knime log
correctly, that took about a minute.

-greg

--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Beta of RDKit knime nodes available

2010-11-26 Thread James Davidson

Hi Greg and Thorsten,


> Greg:
>
>> Thorsten:
>> On the other hand, 4000 rows should not take that long in KNIME. How
>> much times does it currently take?
>
> I just did 1000 rows on my macbook. Assuming I'm reading the knime log
> correctly, that took about a minute.


Thanks for testing this out, Greg.  I must confess, I didn't wait for
the hierarchical clustering to finish for the 4000!  Going back and
selecting a random 1000 molecule subset, I reproduce your result of ~ 1
min (I get 67 secs).  If I then go to 2000, it takes 520 secs - so to me
this looks like cubic complexity - which is what the documentation for
the node states (this would mean > 1 hr for my original 4000...)

For completeness - this result was with the Hierarchical
Clustering(DistMatrix) node set with 'Tanimoto' similarity and 'Complete
Linkage' for cluster comparison.  Changing the comparison to 'Single
Linkage' did not reduce the time.

Interestingly, the documentation for the 'standard' Hierarchical
Clustering' (ie non-distance matrix) node states that it operates with
"n-squared complexity".  I guess other clustering algorithms available
in knime must scale better than cubicly as well (k-means, fuzzy
c-means?) - but as far as I can see they don't currently operate on
distance matrices (or directly on bit vectors).  If they could, then
this may be a solution; or implementing the Murtagh algorithm (I am
guessing the scaling is below cubic from my recollection of the speeds
observed in rdkit).

Kind regards

James

__
PLEASE READ: This email is confidential and may be privileged. It is intended 
for the named addressee(s) only and access to it by anyone else is 
unauthorised. If you are not an addressee, any disclosure or copying of the 
contents of this email or any action taken (or not taken) in reliance on it is 
unauthorised and may be unlawful. If you have received this email in error, 
please notify the sender or postmas...@vernalis.com. Email is not a secure 
method of communication and the Company cannot accept responsibility for the 
accuracy or completeness of this message or any attachment(s). Please check 
this email for virus infection for which the Company accepts no responsibility. 
If verification of this email is sought then please request a hard copy. Unless 
otherwise stated, any views or opinions presented are solely those of the 
author and do not represent those of the Company.

The Vernalis Group of Companies
Oakdene Court
613 Reading Road
Winnersh, Berkshire
RG41 5UA.
Tel: +44 118 977 3133

To access trading company registration and address details, please go to the 
Vernalis website at www.vernalis.com and click on the "Company address and 
registration details" link at the bottom of the page..
__

--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Beta of RDKit knime nodes available

2010-11-27 Thread Thorsten Meinl
Am 27.11.2010 08:39, schrieb James Davidson:
> For completeness - this result was with the Hierarchical
> Clustering(DistMatrix) node set with 'Tanimoto' similarity and 'Complete
> Linkage' for cluster comparison.  Changing the comparison to 'Single
> Linkage' did not reduce the time.
That is expected. The "linkage" only controls which distance is used in
the the (the maximum, minimum or average) but you need to look at all
distances in any case.


> Interestingly, the documentation for the 'standard' Hierarchical
> Clustering' (ie non-distance matrix) node states that it operates with
> "n-squared complexity".  
Ooops. That is certainly wrong. It is the same algorithm. n^3 would be
right.

> I guess other clustering algorithms available
> in knime must scale better than cubicly as well (k-means, fuzzy
> c-means?) - but as far as I can see they don't currently operate on
> distance matrices (or directly on bit vectors).
There is a k-medoids that should work on distance matrices. The problem
for k-means (and fuzzy c-means) is that you need the full coordinates in
order to set the prototypes in each iteration. That doesn't work if you
only have pairwise distances.


> If they could, then
> this may be a solution; or implementing the Murtagh algorithm (I am
> guessing the scaling is below cubic from my recollection of the speeds
> observed in rdkit).
Greg sent me a link to the publication. If I find some time, I will have
a look at it.

Cheers,

Thorsten

-- 
Dr.-Ing. Thorsten Meinl   room: Z815
Nycomed Chair for Bioinformatics  fax: +49 (0)7531 88-5132
and Information Miningphone: +49 (0)7531 88-5016
Box 712, 78457 Konstanz, Germany

--
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss