Hi Deepti, for the spark part I say you simply generated all the fingerprints (locally or on the cluster) and store the generated list of fingerprints as pkl file. Then when running you test you simply load the picke file into memory. With 15GB memory and 2Mio molecules this should easily work out just fine, for a test obviously. I have a simple web app that does exactly this albeit with only about 200k molecules using 400Mb of RAM which I assume most of it is from the fingerprints. This would mean the 2 mio fingerprints would only use about 4GB of RAM.
Still, it begs the question what this would be used for as obviously this approach doesn't scale at all and you would need some form of storing the fingerprints also on spark. Also if your goal is to do similarity searches with lots of fingerprints I suggest you have a look at ChemFP. Best Regards, Thomas ________________________________ Von: Deepti Gupta via Rdkit-discuss <rdkit-discuss@lists.sourceforge.net> Gesendet: Mittwoch, 26. Februar 2020 09:46 An: rdkit-discuss@lists.sourceforge.net <rdkit-discuss@lists.sourceforge.net>; Tim Dudgeon <tdudgeon...@gmail.com> Betreff: Re: [Rdkit-discuss] RDkit in Python vs. on PostgreSQL? Hi Tim, Thank you! I'll be more detailed in my post, sorry about that. As this was a PoC, I had a spark cluster with 2 worker nodes with 4 vCPUs with disk size 500GB and memory 15GB on Google Cloud. I timed the response against 2 million data points consisting of Chembl id, Smile structures. Substructure search - 2 mins Similarity search - 43 mins PostgreSQL DB was installed on VM having 4 vCPUs and disk size of 500 GB and 15GB memory. The value of shared_buffers = 2048MB was edited in the postgresql.conf file. Substructure search - within 5 secs Similarity search - within 3 secs I tried to store the converted molecules and fingerprints in a file to get better performance while trying the pyspark program but was not able to do so. Regards, DA On Wednesday, February 26, 2020, 12:57:43 AM GMT+5:30, Tim Dudgeon <tdudgeon...@gmail.com> wrote: I think you need to explain what benchmarks you are running and what is really meant by "faster". And what hardware (for Spark how many nodes, how big; for PostgreSQL what size server, what settings esp. the shared_buffers setting). A very obvious critique of what you reported is that what you describe as "running in Python" includes generating the fingerprints for each molecule on the fly, whereas for "the cartridge" these are already calculated, so will obviously be much faster (as the fingerprint generation dominates the compute). Tim On 25/02/2020 11:14, Deepti Gupta via Rdkit-discuss wrote: Hi Gurus, I'm absolutely new to Chem-informatics domain. I've been assigned a PoC where I've to compare RDKit in Python and RDKit on PostgreSQL. I've installed both and am trying some hands-on exercises to understand the differences. What I've understood that the structure searches are slower in Python (Spark Cluster) than in PostgreSQL database. Please correct me if I'm wrong as I'm a newbie in this and maybe talking silly. The similarity search using the below functions (example) - Python methods - fps = FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure, sanitize=False)) similarity = DataStructs.TanimotoSimilarity(fps1,fps2) takes too long (45 minutes) for a 2 million file while the same thing is very quick (in seconds) on PostgreSQL Database functions - select count(*) from (select modality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2) as similarity from fingerprints join mols using (modality_id)) as fps where similarity between 0.45 and 0.50; Does this conclude that for production workloads one must always use database cartridge only? Like RDKit, BINGO, etc.? Regards, DA _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss