Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

David Cosgrove Mon, 23 Apr 2018 09:29:45 -0700

Hi all,
I’ve just had the attached from Roger Sayle, which might be of interest.
Dave


Hi Andrew and Dave,

John (Mayfield) has just pointed me at the very interesting discussion
raging on sourceforge.
Alas, I've no idea how to post/tweet/snapchat a reply, but thought I'd at
least contribute this
small nugget of trivia:  US5434796A
https://patents.google.com/patent/US5434796A
where Dave W. managed to patent any application of genetic algorithms to
optimizing an
objective function of a molecular structure [which is optimistically broad,
and in its day may
even have covered Andrew's rev_eng_fp.py, which is very impressive by the
way.]  This is
probably the closest thing to a publication around Dave's work with GA's
that was mentioned
on the thread (I think he only authored four or five papers in his
lifetime).

Best wishes to the both of you.  Keep up the great discussion.

Cheers,
Roger
--
Roger Sayle, PhD.
CEO and founder
NextMove Software Limited
Registered in England No. 07588305
Registered Office: Innovation Centre (Unit 23), Cambridge Science Park,
Cambridge, CB4 0EY



On Mon, 23 Apr 2018 at 17:11, Brice Hoffmann <brice.hoffm...@iktos.com>
wrote:

> Hi,
> Another option is to use generative models that uses fingerprints as input
> (ex: https://arxiv.org/abs/1701.01329,
> https://pubs.acs.org/doi/10.1021/acs.molpharmaceut.7b00346). If you use
> as a scoring function of the generated molecules the Tanimoto Distance to
> a given fingerprint, you can often retrieve the original compound.
> At Iktos we develop such methods and it work pretty well !
> Best regards,
> Brice
>
>
>
> 2018-04-23 16:18 GMT+02:00 Andrew Dalke <da...@dalkescientific.com>:
>
>> On Apr 23, 2018, at 14:54, Brian Cole <col...@gmail.com> wrote:
>> > Unfortunately it doesn't work on circular/ECFP-like fingerprints.
>>
>> To be fair, you didn't mention that was a requirement. ;)
>>
>> > It has the requirement that the fingerprint be a substructure
>> fingerprint as you described.
>>
>> Could you elaborate on your goal?
>>
>> I used RDKitFingerprint because it was the easiest. It was something I
>> could do in a day to demonstrate that it is possible to reverse engineer
>> some fingerprints.
>>
>> I think it's possible to do something similar for circular fingerprints.
>> It would mean generating all possible subgraphs of a given radius, which is
>> doable for r=2 or r=3, and probably r=4. RDKit has a way to look at the
>> circular environment around a a specific atom rather than the entire
>> fingerprint, so that can be used to generate a seed point. Once that's
>> found, it can be expanded to one of its neighbor atoms.
>>
>> Another problem is that the Morgan fingerprint algorithm really wants
>> sanitized structures, which I didn't need to worry about for the hash
>> fingerprints.
>>
>> Instead of a day of work, it's going to take a couple of weeks of work,
>> which requires time and money.
>>
>> My advice though is that it's surely possible to determine some structure
>> information from the circular fingerprint. If your use case says there
>> should be no information leak (other than what's possible by full
>> brute-force-enumeration) then don't exchange fingerprints.
>>
>> But leaking information is not really what I thought of by "reverse
>> engineer".
>>
>> For example, if I want to check if any of the Morgan fingerprints with
>> r=2 contain a phenol, I can ask RDKit to generate the fingerprint for r=2
>> using just the c(O)c as the fromAtoms. This gives:
>>
>>
>> % echo '*c1ccc(O)cc1 phenol' | rdkit2fps --from-atoms 3,4,5,6 --morgan
>> #FPS1
>> #num_bits=2048
>> #type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0
>> useBondTypes=1 fromAtoms=3,4,5,6
>> #software=RDKit/2016.09.3 chemfp/3.2.1
>> #date=2018-04-23T14:03:24
>> 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000000000000800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000200000000000000100000000000000000000000000000000000000000000000000000000400000000000000000000000000000000004000000004000000000000000000000002000000000000000000000000000000000000000000
>>       phenol
>>
>>
>> I can then screen using that fingerprint to see which fingerprint match.
>>
>> Of the first 100,000 structures in ChEMBL, 2216 contain phenol, all of
>> the are detected by this screen, and there are no false positives.
>>
>> Poof - structural information leakage.
>>
>> The code is at the bottom of this email. It depends on the commercial
>> version of chemfp.
>>
>>
>> > It seems the evolutionary/genetic algorithm approach is the current
>> state-of-the-art for decoding circular/ECFP-like fingerprints.
>>
>> Dave Cosgrove mentioned Dave Weininger's GA work, which means it was with
>> Daylight hash fingerprints. I don't think we know that GAs have ever been
>> used to reverse engineer circular fingerprints.
>>
>>
>> > Historical question for you since you're the closest we have to a
>> chem-informatician historian. :-) Why did these circular/ECFP fingerprints
>> come into existence?
>>
>> I believe you are asking for
>> https://pubs.acs.org/doi/abs/10.1021/ci100050t .
>>
>>   Extended-connectivity fingerprints (ECFPs) are a novel class of
>> topological
>>   fingerprints for molecular characterization. Historically, topological
>>   fingerprints were developed for substructure and similarity searching.
>>   ECFPs were developed specifically for structure−activity modeling.
>>
>> > my reading of the current literature is that tree/dendritic are
>> statistically just as good at virtual screening as circular/ECFP:
>>
>> Yeah, I don't go there. I leave concepts like "just as good" or "better"
>> to people who have experimental data they can use for the comparison.
>>
>>
>>                                 Andrew
>>                                 da...@dalkescientific.com
>>
>> == Code to find which Morgan fingerprints contain a phenol substructure ==
>>
>> import chemfp
>> from chemfp import bitops, search
>>
>> arena = chemfp.load_fingerprints("chembl_23_morgan.fps", reorder=False)
>> print("Fingerprint type:", arena.metadata.type)
>>
>> # Want to find structures containing phenol
>>
>> # Adjust the fingerprint type to limit it to the given atoms
>> fptype = chemfp.get_fingerprint_type(arena.metadata.type + "
>> fromAtoms=3,4,5,6")
>> query_fp = fptype.parse_molecule_fingerprint("*c1ccc(O)cc1", "smi")
>>
>> print("Query fingerprint:")
>> print(bitops.hex_encode(query_fp))
>> print()
>>
>> # Find the matching fingerprints
>> result = search.contains_fp(query_fp, arena)
>>
>> circular_ids = set(result.get_ids())
>>
>> # Search the first 100,000 structures
>> from rdkit import Chem
>> from chemfp import rdkit_toolkit as T
>>
>> pat = Chem.MolFromSmarts("*c1ccc(O)cc1")
>> all_ids = set()
>> exact_ids = set()
>> with T.read_molecules("/Users/dalke/databases/chembl_23.sdf.gz") as
>> reader:
>>     for mol in reader:
>>         id = mol.GetProp("_Name")
>>         all_ids.add(id)
>>         if mol.HasSubstructMatch(pat):
>>             exact_ids.add(id)
>>         if len(all_ids) == 100000:
>>             break
>>
>> # limit the circular ids to only those checked
>> print("Full screen:", len(circular_ids))
>> circular_ids = circular_ids & all_ids
>> print("Relevant screen:", len(circular_ids))
>>
>> print("#correct:", len(exact_ids & circular_ids))
>> print("#false positives:", len(circular_ids - exact_ids))
>>
>> ## I get the following:
>> # Fingerprint type: RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0
>> useChirality=0 useBondTypes=1
>> # Query fingerprint:
>> #
>> 0000000000000000000000000000000000000000000000000000000000000000000000000000000
>> #
>> 0000000000000000000000000000000000000000000000000000000000000000000000000000000
>> #
>> 0000000000000000000000000000020000000000008000000000000000000000000000000000000
>> #
>> 0000000000000000000000000000000000000000000000000000000000000000000000000000000
>> #
>> 0000000000000200000000000000100000000000000000000000000000000000000000000000000
>> #
>> 0000004000000000000000000000000000000000040000000040000000000000000000000020000
>> # 00000000000000000000000000000000000000
>> # Full screen: 31134
>> # Relevant screen: 2216
>> # #correct: 2216
>> # #false positives: 0
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
>
> --
> Brice HOFFMANN
> Senior Scientist,
> Molecular Modeling & Computational Chemistry
> iktos.ai
> 24 rue chaptal 75009 Paris
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

Reply via email to