On 2 Jul 2009, at 13:29, suyog wrote:
hi... all
I have one .sdf file with thousands of molecule in it...
I want to identify duplicate molecules(having similar structure)
from .sdf
file using CDK...
Do you mean similar or identical (isomorphic graphs).
Both would require different code.
For identical molecules, which I think is what you mean, you could use
various strategies.
I believe that the easiest is to iterate over the molecules with the
iterative SDF parser, read the molecule, make a canonical SMILES, put
that in a unique set.
For any subsequent molecule you can then check if the canonical SMILES
is already in the set and discard this molecules.
How do I do this????
Which classes should i use??
Or if you have any sample code please paste link....
The CDK tests are a good start for example code.
http://cdk.svn.sourceforge.net/viewvc/cdk/cdk/trunk/src/test/org/openscience/cdk/
has the tests for the various classes you'll need, with example code.
Cheers,
Chris
--
Dr. Christoph Steinbeck
Head of Chemoinformatics and Metabolism
European Bioinformatics Institute (EBI)
Wellcome Trust Genome Campus
Hinxton, Cambridge CB10 1SD UK
Phone +44 1223 49 2640
What is man but that lofty spirit - that sense of enterprise.
... Kirk, "I, Mudd," stardate 4513.3..
------------------------------------------------------------------------------
_______________________________________________
Cdk-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/cdk-user