Hi all,

 Last year I proposed a file format for cheminformatics fingerprints.

A format without tools is rather useless, so I've written a set of tools which 
work on my flat-file/text format.

It's nearing completion, and I'm looking for some friendly test users. If you 
are interested, you can get the code from

 http://dalkescientific.com/chemfp-0.9.tar.gz

It's a set of Python programs with a C extension.

It currently supports the following toolkits:
 - OpenBabel
 - RDKit
 - OEChem
plus being able to extract fingerprints from fields in an SD file,
with special support for the PubChem substructure fingerprints.

Here's a couple of ways to generate fingerprints


% echo "c1ccccc1O phenol" | ./ob2fps 
#FPS1
#num_bits=1021
#software=OpenBabel/2.3.0
#type=OpenBabel-FP2/1
#date=2011-02-07T13:24:29
0000000000000000000002000000000000000000000000000000000000000000000000000000000000000008000000000000020000000000000000000000000008000000000000000000000002000000008000000000000040080000000000000000000000000002000000000000000000020000000000200800000000000000
 phenol

The default is --FP2 with SMILES from stdin, but I can change that.


% ob2fps --MACCS Compound_09425001_09450000.sdf.gz | head
#FPS1
#num_bits=166
#software=OpenBabel/2.3.0
#type=OpenBabel-MACCS/2
#source=Compound_09425001_09450000.sdf.gz
#date=2011-02-07T13:25:04
000000000002080019cc44eacdec980baea378ef1f 9425004
000000002000082159d404eea9e8b80b8ea37eef1f 9425009
000000000000080159c404efa9e89a0b8eb3faef1b 9425012
000000000000082019c404ee89e8b80b8ea3ffef1f 9425015

Suppose I want to do a similarity search. I'll save the output to a compressed 
file

% ob2fps Compound_09425001_09450000.sdf.gz -o Compound_09425001_09450000.fps.gz

That takes about 85 seconds (OpenBabel's index creation only takes 67 seconds) 

% ls -l Compound_09425001_09450000.fps.gz
-rw-r--r--  1 dalke  staff  766355 Feb  7 13:07 
Compound_09425001_09450000.fps.gz


The included "simsearch" program lets you specify k-nearest and a threshold. 
Here I'll search for the nearest 5.



% echo 'N#Cc1ccccc1C#N' | simsearch --in smi -k 5 
Compound_09425001_09450000.fps.gz
#Simsearch/1
#num_bits=1021
#software=chemfp/0.9
#type=Tanimoto k=5 threshold=0.0
#target_source=Compound_09425001_09450000.sdf.gz
5 Record1 0.32258 9443135 0.32258 9443136 0.31667 9430485 0.31667 9430486 
0.12389 9449997

Each line of the output (past the header) is:
 number of nearest neighbors found = N
 title of the input structure
 and then N alternating scores and target identifiers


By default simsearch takes a fingerprint file as input, but I can use the 
"--in" option to specify a different input format. In this case it's a SMILES 
file.

How did it know which fingerprints to generate? Simsearch opened the targets 
file, Compound_09425001_09450000.tar.gz, and read

#type=OpenBabel-FP2/1

There's a table inside which describes how to match that fingerprint type to 
the right way to read structures and generate fingerprints of that type.


There's also a "-c" option which only finds the number of targets which are 
within a given threshold of the queries.

simsearch -c --threshold 0.6 -q 
~/databases/pubchem/Compound_022225001_022250000.iso.smi.gz 
Compound_09425001_09450000.fps.gz


#Count/1
#num_bits=1021
#software=chemfp/0.9
#type=Count threshold=0.6
#query_source=/Users/dalke/databases/pubchem/Compound_022225001_022250000.iso.smi.gz
#target_source=Compound_09425001_09450000.sdf.gz
0 22225001
0 22225002
12 22225003
0 22225004
0 22225005
0 22225006
4 22225007
0 22225008
0 22225009
4 22225010
41 22225011


This says that 22225011 has 41 targets with at least 0.6 tanimoto similarity

The code is not finished. There are inputs which will break it, there are 
special cases I need to handle, there are missing command-line parameters, 
there are internal APIs I need to clean up, and while there are 421 unit tests, 
that's only about 1/2 of what's needed to really test the code.

However, I would really like feedback, so please take a look and let me know 
what you think.



                                Andrew
                                da...@dalkescientific.com



------------------------------------------------------------------------------
The modern datacenter depends on network connectivity to access resources
and provide services. The best practices for maximizing a physical server's
connectivity to a physical network are well understood - see how these
rules translate into the virtual world? 
http://p.sf.net/sfu/oracle-sfdevnlfb
_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Reply via email to