Hi All,

Some of you may have seen the blog post the other week on speeding up
AtomContainer's, if not it will explain what 'Ref' objects are and why
they're wanted:
http://efficientbits.blogspot.co.uk/2017/04/cdk-atomcontainers-are-slow-lets-fix.html

In short I'd like to introduce a public object to wrap AtomContainers for
faster read (i.e. adjacency) operations without breaking the current API.
The atom and bond properties can still be modified. As a public "blessed"
implementation I can optimize on the back-end for
cases where the a molecule is passed through several methods without any
structural changes. At the moment if a method needs to do many of adjacency
queries (e.g. getConnectedBondList) they're internally converted to a
int[][] each and every time. Sometimes I expose the int[][] APIs (e.g. ring
perception) but they're quite ugly and setting bond props is a pain.

I've done some more work on the abstraction and now wrap it up in an
IAtomContainer. I've kept it as read-only for now but have some neat tricks
to make the adjacency mutable I can use. Scaffold Hunter had/has some
internal patches for something similar (I'm guessing essentially Map
lookups) - here because we provide an AtomRef and BondRef you can avoid the
somewhat costly map lookup, for example:

   // cost ~O(|E|) in normal AtomContainer, poor - linear in num bonds
   for (IBond bond : mol.getConnectedBondList(mol.getAtom(0))) {}

   // cost ~O(1) via AtomContainerRef, better - but HashMap lookup
   for (IBond bond : mol.getConnectedBondList(mol.getAtom(0))) {}

   // cost O(1) via AtomRef, best - direct access, invokevirtual
   for (IBond bond : mol.getAtom(0).bonds()) {}

Essentially you can now do the following:

// example AtomTyping

 AtomContainerManipulator.preceiveAtomsAndConfigure(new
> AtomContainerRef(mol));



> // example Path Fingerprints

IFingerprinter fpr = new Fingerprinter();
> fpr.getFingerprint(new AtomContainerRef(mol));


Neither of these methods modify the connection table (read-only), for
AtomTyping we get a reasonable speed improvement whilst the Fingerprint we
see none [Why? - see footnote 1]. You get a huge speed improvement if you
re-use it multiple times (it's intended use as outlined above). Also once
in use I can update functions like the AtomTyper to leverage the faster API
and improve further still.

Ultimately I would like redesign the current API (getFirstAtom, getLastAtom
- urgh!) to be better but that's quite destructive, maybe v3.0....

Anyways please checkout the pull request to test and leave comments:
https://github.com/cdk/cdk/pull/278

- John

Here are the timings using AtomContainerRef on methods not optimized for it
(i.e. could go much faster).

*AtomTyping **(first 100k in ChEMBL 22)*
*SilentAtomContainer*
10000 661 ms @ 908265 mol per min
20000 950 ms @ 1263556 mol per min
30000 1273 ms @ 1414070 mol per min
40000 1514 ms @ 1585648 mol per min
50000 1924 ms @ 1559207 mol per min
60000 2119 ms @ 1698890 mol per min
70000 2286 ms @ 1837532 mol per min
80000 2475 ms @ 1939773 mol per min
90000 2740 ms @ 1970908 mol per min
100000 3087 ms @ 1943733 mol per min

*AtomContainerRef(**SilentAtomContainer**)*
10000 512 ms @ 1172463 mol per min
20000 753 ms @ 1593887 mol per min
30000 997 ms @ 1806203 mol per min
40000 1170 ms @ 2051013 mol per min
50000 1385 ms @ 2165530 mol per min
60000 1524 ms @ 2362910 mol per min
70000 1638 ms @ 2564230 mol per min
80000 1763 ms @ 2722008 mol per min
90000 1923 ms @ 2807711 mol per min
100000 2108 ms @ 2846276 mol per min

*Path Fingerprints **(first 100k in ChEMBL 22)*
*SilentAtomContainer*
10000 3433 ms @ 174786 mol per min
20000 5787 ms @ 207358 mol per min
30000 8413 ms @ 213951 mol per min
40000 10710 ms @ 224083 mol per min
50000 13515 ms @ 221969 mol per min
60000 15697 ms @ 229349 mol per min
70000 17616 ms @ 238423 mol per min
80000 19643 ms @ 244368 mol per min
90000 22165 ms @ 243628 mol per min
100000 24638 ms @ 243530 mol per min

*AtomContainerRef(**SilentAtomContainer**)*
10000 3441 ms @ 174358 mol per min
20000 5935 ms @ 202189 mol per min
30000 8547 ms @ 210611 mol per min
40000 10831 ms @ 221585 mol per min
50000 13500 ms @ 222220 mol per min
60000 15721 ms @ 228999 mol per min
70000 17670 ms @ 237696 mol per min
80000 19742 ms @ 243139 mol per min
90000 22243 ms @ 242771 mol per min
100000 24597 ms @ 243933 mol per min


Footnotes:
1 - basically they run the same elapsed time but bear in mind we're
including the time to convert to the AtomContainerRef first, so the
fingerprinting did run faster. The time doesn't improve because I patched
the Fingerprinter the other week to cache the Bonds already...
https://github.com/cdk/cdk/commit/8a3cbeed5dd814b48394a5e7b9b41c49ce974ca9#diff-2067e85b8af365bd0b0e45566c4b88c1R234
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Cdk-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/cdk-user

Reply via email to