Hi Tim,
Thank you!
I'll be more detailed in my post, sorry about that. As this was a PoC, I had a 
spark cluster with 2 worker nodes with 4 vCPUs with disk size 500GB and memory 
15GB on Google Cloud. I timed the response against 2 million data points 
consisting of Chembl id, Smile structures. 
Substructure search - 2 minsSimilarity search - 43 mins
PostgreSQL DB was installed on VM having 4 vCPUs and disk size of 500 GB and 
15GB memory. The value of shared_buffers = 2048MB  was edited in the  
postgresql.conf file.
Substructure search - within 5 secsSimilarity search - within 3 secs
I tried to store the converted molecules and fingerprints in a file to get 
better performance while trying the pyspark program but was not able to do so.
Regards,DA
    On Wednesday, February 26, 2020, 12:57:43 AM GMT+5:30, Tim Dudgeon 
<[email protected]> wrote:  
 
  
I think you need to explain what benchmarks you are running and what is really 
meant by "faster".
 And what hardware (for Spark how many nodes, how big; for PostgreSQL what size 
server, what settings esp. the shared_buffers setting).
 
 
A very obvious critique of what you reported is that what you describe as 
"running in Python" includes generating the fingerprints for each molecule on 
the fly, whereas for "the cartridge" these are already calculated, so will 
obviously be much faster (as the fingerprint generation dominates the compute).
 
Tim
 
 On 25/02/2020 11:14, Deepti Gupta via Rdkit-discuss wrote:
  
 
 Hi Gurus, 
  I'm absolutely new to Chem-informatics domain. I've been assigned a PoC where 
I've to compare RDKit in Python and RDKit on PostgreSQL. I've installed both 
and am trying some hands-on exercises to understand the differences. What I've 
understood that the structure searches are slower in Python (Spark Cluster) 
than in PostgreSQL database. Please correct me if I'm wrong as I'm a newbie in 
this and maybe talking silly. 
  The similarity search using the below functions (example) - Python methods - 
    fps = FingerprintMols.FingerprintMol(Chem.MolFromSmiles(smile_structure, 
sanitize=False)) similarity = DataStructs.TanimotoSimilarity(fps1,fps2)  
  takes too long (45 minutes) for a 2 million file while the same thing is very 
quick (in seconds) on PostgreSQL  Database functions - 
    select count(*) from 
(selectmodality_id,m,tanimoto_sml(morganbv_fp(mol_from_smiles('CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'::cstring)),mfp2)
 as similarity from fingerprints join mols using (modality_id)) as fps where 
similarity between 0.45 and 0.50;  
  Does this conclude that for production workloads one must always use database 
cartridge only? Like RDKit, BINGO, etc.? 
  Regards, DA  
  
  _______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
 _______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
  
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to