Hi all, Last year I proposed a file format for cheminformatics fingerprints.
A format without tools is rather useless, so I've written a set of tools which work on my flat-file/text format. It's nearing completion, and I'm looking for some friendly test users. If you are interested, you can get the code from http://dalkescientific.com/chemfp-0.9.tar.gz It's a set of Python programs with a C extension. It currently supports the following toolkits: - OpenBabel - RDKit - OEChem plus being able to extract fingerprints from fields in an SD file, with special support for the PubChem substructure fingerprints. Here's a couple of ways to generate fingerprints % echo "c1ccccc1O phenol" | ./ob2fps #FPS1 #num_bits=1021 #software=OpenBabel/2.3.0 #type=OpenBabel-FP2/1 #date=2011-02-07T13:24:29 0000000000000000000002000000000000000000000000000000000000000000000000000000000000000008000000000000020000000000000000000000000008000000000000000000000002000000008000000000000040080000000000000000000000000002000000000000000000020000000000200800000000000000 phenol The default is --FP2 with SMILES from stdin, but I can change that. % ob2fps --MACCS Compound_09425001_09450000.sdf.gz | head #FPS1 #num_bits=166 #software=OpenBabel/2.3.0 #type=OpenBabel-MACCS/2 #source=Compound_09425001_09450000.sdf.gz #date=2011-02-07T13:25:04 000000000002080019cc44eacdec980baea378ef1f 9425004 000000002000082159d404eea9e8b80b8ea37eef1f 9425009 000000000000080159c404efa9e89a0b8eb3faef1b 9425012 000000000000082019c404ee89e8b80b8ea3ffef1f 9425015 Suppose I want to do a similarity search. I'll save the output to a compressed file % ob2fps Compound_09425001_09450000.sdf.gz -o Compound_09425001_09450000.fps.gz That takes about 85 seconds (OpenBabel's index creation only takes 67 seconds) % ls -l Compound_09425001_09450000.fps.gz -rw-r--r-- 1 dalke staff 766355 Feb 7 13:07 Compound_09425001_09450000.fps.gz The included "simsearch" program lets you specify k-nearest and a threshold. Here I'll search for the nearest 5. % echo 'N#Cc1ccccc1C#N' | simsearch --in smi -k 5 Compound_09425001_09450000.fps.gz #Simsearch/1 #num_bits=1021 #software=chemfp/0.9 #type=Tanimoto k=5 threshold=0.0 #target_source=Compound_09425001_09450000.sdf.gz 5 Record1 0.32258 9443135 0.32258 9443136 0.31667 9430485 0.31667 9430486 0.12389 9449997 Each line of the output (past the header) is: number of nearest neighbors found = N title of the input structure and then N alternating scores and target identifiers By default simsearch takes a fingerprint file as input, but I can use the "--in" option to specify a different input format. In this case it's a SMILES file. How did it know which fingerprints to generate? Simsearch opened the targets file, Compound_09425001_09450000.tar.gz, and read #type=OpenBabel-FP2/1 There's a table inside which describes how to match that fingerprint type to the right way to read structures and generate fingerprints of that type. There's also a "-c" option which only finds the number of targets which are within a given threshold of the queries. simsearch -c --threshold 0.6 -q ~/databases/pubchem/Compound_022225001_022250000.iso.smi.gz Compound_09425001_09450000.fps.gz #Count/1 #num_bits=1021 #software=chemfp/0.9 #type=Count threshold=0.6 #query_source=/Users/dalke/databases/pubchem/Compound_022225001_022250000.iso.smi.gz #target_source=Compound_09425001_09450000.sdf.gz 0 22225001 0 22225002 12 22225003 0 22225004 0 22225005 0 22225006 4 22225007 0 22225008 0 22225009 4 22225010 41 22225011 This says that 22225011 has 41 targets with at least 0.6 tanimoto similarity The code is not finished. There are inputs which will break it, there are special cases I need to handle, there are missing command-line parameters, there are internal APIs I need to clean up, and while there are 421 unit tests, that's only about 1/2 of what's needed to really test the code. However, I would really like feedback, so please take a look and let me know what you think. Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ The modern datacenter depends on network connectivity to access resources and provide services. The best practices for maximizing a physical server's connectivity to a physical network are well understood - see how these rules translate into the virtual world? http://p.sf.net/sfu/oracle-sfdevnlfb _______________________________________________ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss