Re: [Rdkit-discuss] General questions on RDKit cartridge performance.
Dear Sune, On Tue, May 3, 2011 at 8:10 AM, wrote: > > After spending a few hours I now have RDKit running on my laptop on Ubuntu > 10.10. All but two of the ctests pass and I am sure I can get the last ones > to pass soon too. Please feel free to post if you run into problems with this. > In general the installation ran smoothly and the > installation guide was easy to follow. Next steps are getting structures > loaded into the database. I am already procrastinating other tasks to play > with RDKit ;o). Ah, I see I've infected someone else. :-) Hopefully the time spent playing will be well spent and you'll find that the RDKit makes some of the other tasks easier! Best Regards, -greg -- WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Speeding up database queries...
Hi JP, On Wed, May 4, 2011 at 10:49 AM, JP wrote: > Hi there Adrian, > > Why should splitting the table vertically make a difference? > Am I not correct in thinking that would then require a join, which is > expensive (especially on 8M rows) ? You would actually only be doing the join on the rows that match your SSS/similarity query. As long as you have some kind of primary key and build indices on that primary key the performance should be fine. This is the layout I almost always use: molecules in one table, bit vect fingerprints in a second, count-based fingerprints in a third. > My FS is local, and my indices are quite large > > db=# \di+ idx_ligand_rdkitmol; > List of relations > Schema | Name | Type | Owner | Table | Size | Description > +-+---+---++-+- > public | idx_ligand_rdkitmol | index | jpebe | ligand | 5030 MB | > (1 row) > > db=# \di+ idx_ligand_morganbv; > List of relations > Schema | Name | Type | Owner | Table | Size | Description > +-+---+---++-+- > public | idx_ligand_morganbv | index | jpebe | ligand | 1645 MB | > (1 row) yeah, those are huge. > But Greg is right (!) running the query a second time resulted in much > faster performance (11285.886ms as opposed to the original > 193973.253ms) > Of course if you change the smiles string, than nothing is cached and > it takes ages again... I actually wasn't proposing re-running the query to get the cached results, that's cheating. :-) The idea is that once you have the index in memory all queries will go faster. This is what the emolecules example on the wiki shows. It sounds like you've reached a database size where doing some real performance tuning on the database machine is going to be required. I don't have much experience with this, but there does seem to be a fair amount of information out there on the web. This page, in particular, looks helpful in explaining what the configuration parameters are and providing suggestions for tuning them: http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server shared_buffers and effective_cache_size look particularly relevant. This kind of performance tuning information would be really useful to collect, so if you don't end up getting totally frustrated, please do share your findings. Best, -greg -- WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Antwort: Re: rdBase.so
Hi Greg, here we go: ldd rdBase.so " linux-vdso.so.1 => (0x7fffed3ff000) libRDGeneral.so.1 => /usr/local/rdkit//lib/libRDGeneral.so.1 (0x7f8aefb1a000) libRDBoost.so.1 => /usr/local/rdkit//lib/libRDBoost.so.1 (0x7f8aef7aa000) libboost_python-mt.so.3 => /usr/lib64/libboost_python-mt.so.3 (0x7f8aef548000) libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x7f8aef23e000) libm.so.6 => /lib64/libm.so.6 (0x7f8aeefb8000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f8aeeda1000) libc.so.6 => /lib64/libc.so.6 (0x7f8aeea2f000) libutil.so.1 => /lib64/libutil.so.1 (0x7f8aee82b000) libpthread.so.0 => /lib64/libpthread.so.0 (0x7f8aee60f000) libdl.so.2 => /lib64/libdl.so.2 (0x7f8aee40b000) librt.so.1 => /lib64/librt.so.1 (0x7f8aee201000) /lib64/ld-linux-x86-64.so.2 (0x0031f200) " 64-bit python? 32-bit RDBoost? Cheers, Paul > Hi Paul, > > On Wed, May 4, 2011 at 3:44 PM, wrote: > > > > Dear folks, > > > > what could be the reason causing the following error: > > > > " > > [GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2 > > Type "help", "copyright", "credits" or "license" for more information. > from rdkit import Chem > > Traceback (most recent call last): > > File "", line 1, in > > File "/usr/local/rdkit/rdkit/Chem/__init__.py", line 18, in > > from rdkit import rdBase > > ImportError: /usr/local/rdkit/rdkit/rdBase.so: undefined symbol: > > > _ZN5boost6python9converter8registry6insertEPFPvP7_objectEPFvS5_PNS1_30rvalue_from_python_stage1_dataEENS0_9type_infoEPFPK11_typeobjectvE > > " > > Could it be that it's finding the wrong version of the boost python > library? running ldd on rdBase.so will answer this question for you. > > -greg This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of E-Mail-transmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith. Click http://disclaimer.merck.de to access the German, French, Spanish and Portuguese versions of this disclaimer. -- WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] rdBase.so
Hi Paul, On Wed, May 4, 2011 at 3:44 PM, wrote: > > Dear folks, > > what could be the reason causing the following error: > > " > [GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. from rdkit import Chem > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/rdkit/rdkit/Chem/__init__.py", line 18, in > from rdkit import rdBase > ImportError: /usr/local/rdkit/rdkit/rdBase.so: undefined symbol: > _ZN5boost6python9converter8registry6insertEPFPvP7_objectEPFvS5_PNS1_30rvalue_from_python_stage1_dataEENS0_9type_infoEPFPK11_typeobjectvE > " Could it be that it's finding the wrong version of the boost python library? running ldd on rdBase.so will answer this question for you. -greg -- WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] rdBase.so
Dear folks, what could be the reason causing the following error: " [GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from rdkit import Chem Traceback (most recent call last): File "", line 1, in File "/usr/local/rdkit/rdkit/Chem/__init__.py", line 18, in from rdkit import rdBase ImportError: /usr/local/rdkit/rdkit/rdBase.so: undefined symbol: _ZN5boost6python9converter8registry6insertEPFPvP7_objectEPFvS5_PNS1_30rvalue_from_python_stage1_dataEENS0_9type_infoEPFPK11_typeobjectvE " Thanks, Paul This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of E-Mail-transmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith. Click http://disclaimer.merck.de to access the German, French, Spanish and Portuguese versions of this disclaimer. -- WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Speeding up database queries...
Hi there Adrian, Why should splitting the table vertically make a difference? Am I not correct in thinking that would then require a join, which is expensive (especially on 8M rows) ? My FS is local, and my indices are quite large db=# \di+ idx_ligand_rdkitmol; List of relations Schema |Name | Type | Owner | Table | Size | Description +-+---+---++-+- public | idx_ligand_rdkitmol | index | jpebe | ligand | 5030 MB | (1 row) db=# \di+ idx_ligand_morganbv; List of relations Schema |Name | Type | Owner | Table | Size | Description +-+---+---++-+- public | idx_ligand_morganbv | index | jpebe | ligand | 1645 MB | (1 row) But Greg is right (!) running the query a second time resulted in much faster performance (11285.886ms as opposed to the original 193973.253ms) Of course if you change the smiles string, than nothing is cached and it takes ages again... Interested, JP -- WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Speeding up database queries...
On Wed, May 4, 2011 at 04:26, Greg Landrum wrote: > Dear JP, > > On Tue, May 3, 2011 at 6:29 PM, JP wrote: >> >> Based on an rdkit post I read over the warm weekend I set myself to >> have a look at my rdkit based queries (and ways to speed them up)... >> >> But first some details: >> >> Postgres: >> PostgreSQL 9.0.3 on x86_64-unknown-linux-gnu, compiled by GCC gcc >> (GCC) 4.5.2, 64-bit >> >> RDKit (DB Cartridge): >> v. 0.20.0 >> > > The other important question is how much memory you have and what > filesystem postgres is using for the database (local or network). > >> # of Molecules >> 8,432,896 >> >> Database table (ligand) >> Table "public.ligand" >> Column | Type | Modifiers >> +---+- >> id | integer | not null default >> nextval('ligand_id_seq'::regclass) >> supplierid | character varying(50) | >> smiles | text | >> rdkitmol | mol | >> pairbv | bfp | >> torsionbv | bfp | >> morganbv | bfp | >> amw | real | >> mollogp | real | >> hba | integer | >> hbd | integer | >> atoms | integer | >> hvyatoms | integer | >> Indexes: >> "ligand_pkey" PRIMARY KEY, btree (id) >> "idx_ligand_morganbv" gist (morganbv) >> "idx_ligand_pairbv" gist (pairbv) >> "idx_ligand_rdkitmol" gist (rdkitmol) >> "idx_ligand_torsionbv" gist (torsionbv) >> >> >> I cannot explain why the following queries: >> >> db=# select count(*) from ligand where rdkitmol@>'c12c1nncc2' ; >> count >> --- >> 2942 >> (1 row) >> >> Time: 193973.253 ms >> db=# select count(*) from ligand where >> morganbv%morganbv_fp('c12c1nncc2',2); >> count >> --- >> 8 >> (1 row) >> >> Time: 400138.989 ms > > The performance is critically dependent on whether or not the indices > are in memory. If you've just started the database or if there's been > a lot of non-postgres activity on the machine since the last time you > used it, it can take a long time to load the index from disk to > memory. Once it has been loaded, things should go faster. > > My emolecules database requires about 1 GB for each of the indices: > > emolecules=# \di+ molidx; > List of relations > Schema | Name | Type | Owner | Table | Size | Description > ++---+--+---+-+- > public | molidx | index | glandrum | mols | 1049 MB | > (1 row) > emolecules=# \di+ mfp2idx; > List of relations > Schema | Name | Type | Owner | Table | Size | Description > +-+---+--+---++- > public | mfp2idx | index | glandrum | fps | 970 MB | > (1 row) > > >> >> Take so long... these are orders of magnitude larger than timings >> reported in http://code.google.com/p/rdkit/wiki/DatabaseCreation2 >> And my database in "only" roughly 50% larger (8M instead of the puny >> 5M in emolecules). >> >> When I do an "explain" on these queries (to make sure the indices are >> being used), I get: >> >> db=# explain select count(*) from ligand where rdkitmol@>'c12c1nncc2' ; >> QUERY PLAN >> -- >> Aggregate (cost=34850.44..34850.45 rows=1 width=0) >> -> Bitmap Heap Scan on ligand (cost=2667.36..34829.36 rows=8433 width=0) >> Recheck Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol) >> -> Bitmap Index Scan on idx_ligand_rdkitmol >> (cost=0.00..2665.25 rows=8433 width=0) >> Index Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol) >> (5 rows) >> >> db=# explain select count(*) from ligand where >> morganbv%morganbv_fp('c12c1nncc2',2); >> QUERY PLAN >> - >> Aggregate (cost=33290.88..33290.89 rows=1 width=0) >> -> Bitmap Heap Scan on ligand (cost=918.05..33269.79 rows=8433 width=0) >> Recheck Cond: (morganbv % >> '\\xe0ff000413007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp) >> -> Bitmap Index Scan on idx_ligand_morganbv >> (cost=0.00..915.94 rows=8433 width=0) >> Index Cond: (morganbv % >> '\\xe0ff000413007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp) >> (5 rows) >> >> Looks good no? >> Am I missing something? Or is this the fastest my search can go at? >> Supposedly the fingerprints search is just doing some ~8M binary >> operations no? Why does this take so long? >> Ideas, anyone? > > This all looks fine. > > The experiments t