Re: [Rdkit-discuss] General questions on RDKit cartridge performance.

2011-05-04 Thread Greg Landrum
Dear Sune,

On Tue, May 3, 2011 at 8:10 AM,   wrote:
>
> After spending a few hours I now have RDKit running on my laptop on Ubuntu
> 10.10. All but two of the ctests pass and I am sure I can get the last ones
> to pass soon too.

Please feel free to post if you run into problems with this.

> In general the installation ran smoothly and the
> installation guide was easy to follow. Next steps are getting structures
> loaded into the database. I am already procrastinating other tasks to play
> with RDKit ;o).

Ah, I see I've infected someone else. :-)
Hopefully the time spent playing will be well spent and you'll find
that the RDKit makes some of the other tasks easier!

Best Regards,
-greg

--
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Speeding up database queries...

2011-05-04 Thread Greg Landrum
Hi JP,

On Wed, May 4, 2011 at 10:49 AM, JP  wrote:
> Hi there Adrian,
>
> Why should splitting the table vertically make a difference?
> Am I not correct in thinking that would then require a join, which is
> expensive (especially on 8M rows) ?

You would actually only be doing the join on the rows that match your
SSS/similarity query. As long as you have some kind of primary key and
build indices on that primary key the performance should be fine. This
is the layout I almost always use: molecules in one table, bit vect
fingerprints in a second, count-based fingerprints in a third.

> My FS is local, and my indices are quite large
>
> db=#  \di+ idx_ligand_rdkitmol;
>                               List of relations
>  Schema |        Name         | Type  | Owner | Table  |  Size   | Description
> +-+---+---++-+-
>  public | idx_ligand_rdkitmol | index | jpebe | ligand | 5030 MB |
> (1 row)
>
> db=#  \di+ idx_ligand_morganbv;
>                               List of relations
>  Schema |        Name         | Type  | Owner | Table  |  Size   | Description
> +-+---+---++-+-
>  public | idx_ligand_morganbv | index | jpebe | ligand | 1645 MB |
> (1 row)

yeah, those are huge.

> But Greg is right (!) running the query a second time resulted in much
> faster performance (11285.886ms as opposed to the original
> 193973.253ms)
> Of course if you change the smiles string, than nothing is cached  and
> it takes ages again...

I actually wasn't proposing re-running the query to get the cached
results, that's cheating. :-) The idea is that once you have the index
in memory all queries will go faster. This is what the emolecules
example on the wiki shows.

It sounds like you've reached a database size where doing some real
performance tuning on the database machine is going to be required. I
don't have much experience with this, but there does seem to be a fair
amount of information out there on the web. This page, in particular,
looks helpful in explaining what the configuration parameters are and
providing suggestions for tuning them:
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
shared_buffers and effective_cache_size look particularly relevant.

This kind of performance tuning information would be really useful to
collect, so if you don't end up getting totally frustrated, please do
share your findings.

Best,
-greg

--
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Antwort: Re: rdBase.so

2011-05-04 Thread Paul . Czodrowski
Hi Greg,

here we go:

ldd rdBase.so

"
linux-vdso.so.1 =>  (0x7fffed3ff000)
libRDGeneral.so.1 => /usr/local/rdkit//lib/libRDGeneral.so.1
(0x7f8aefb1a000)
libRDBoost.so.1 => /usr/local/rdkit//lib/libRDBoost.so.1
(0x7f8aef7aa000)
libboost_python-mt.so.3 => /usr/lib64/libboost_python-mt.so.3
(0x7f8aef548000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x7f8aef23e000)
libm.so.6 => /lib64/libm.so.6 (0x7f8aeefb8000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7f8aeeda1000)
libc.so.6 => /lib64/libc.so.6 (0x7f8aeea2f000)
libutil.so.1 => /lib64/libutil.so.1 (0x7f8aee82b000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x7f8aee60f000)
libdl.so.2 => /lib64/libdl.so.2 (0x7f8aee40b000)
librt.so.1 => /lib64/librt.so.1 (0x7f8aee201000)
/lib64/ld-linux-x86-64.so.2 (0x0031f200)
"

64-bit python? 32-bit RDBoost?


Cheers,
Paul



> Hi Paul,
>
> On Wed, May 4, 2011 at 3:44 PM,   wrote:
> >
> > Dear folks,
> >
> > what could be the reason causing the following error:
> >
> > "
> > [GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2
> > Type "help", "copyright", "credits" or "license" for more information.
>  from rdkit import Chem
> > Traceback (most recent call last):
> >  File "", line 1, in 
> >  File "/usr/local/rdkit/rdkit/Chem/__init__.py", line 18, in 
> >    from rdkit import rdBase
> > ImportError: /usr/local/rdkit/rdkit/rdBase.so: undefined symbol:
> >
>
_ZN5boost6python9converter8registry6insertEPFPvP7_objectEPFvS5_PNS1_30rvalue_from_python_stage1_dataEENS0_9type_infoEPFPK11_typeobjectvE

> > "
>
> Could it be that it's finding the wrong version of the boost python
> library? running ldd on rdBase.so will answer this question for you.
>
> -greg


This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://disclaimer.merck.de to access the German, French, Spanish and
Portuguese versions of this disclaimer.


--
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] rdBase.so

2011-05-04 Thread Greg Landrum
Hi Paul,

On Wed, May 4, 2011 at 3:44 PM,   wrote:
>
> Dear folks,
>
> what could be the reason causing the following error:
>
> "
> [GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
 from rdkit import Chem
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "/usr/local/rdkit/rdkit/Chem/__init__.py", line 18, in 
>    from rdkit import rdBase
> ImportError: /usr/local/rdkit/rdkit/rdBase.so: undefined symbol:
> _ZN5boost6python9converter8registry6insertEPFPvP7_objectEPFvS5_PNS1_30rvalue_from_python_stage1_dataEENS0_9type_infoEPFPK11_typeobjectvE
> "

Could it be that it's finding the wrong version of the boost python
library? running ldd on rdBase.so will answer this question for you.

-greg

--
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] rdBase.so

2011-05-04 Thread Paul . Czodrowski

Dear folks,

what could be the reason causing the following error:

"
[GCC 4.3.2 20080917 (Red Hat 4.3.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from rdkit import Chem
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/rdkit/rdkit/Chem/__init__.py", line 18, in 
from rdkit import rdBase
ImportError: /usr/local/rdkit/rdkit/rdBase.so: undefined symbol:
_ZN5boost6python9converter8registry6insertEPFPvP7_objectEPFvS5_PNS1_30rvalue_from_python_stage1_dataEENS0_9type_infoEPFPK11_typeobjectvE
"


Thanks,
Paul

This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://disclaimer.merck.de to access the German, French, Spanish and
Portuguese versions of this disclaimer.


--
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Speeding up database queries...

2011-05-04 Thread JP
Hi there Adrian,

Why should splitting the table vertically make a difference?
Am I not correct in thinking that would then require a join, which is
expensive (especially on 8M rows) ?

My FS is local, and my indices are quite large

db=#  \di+ idx_ligand_rdkitmol;
   List of relations
 Schema |Name | Type  | Owner | Table  |  Size   | Description
+-+---+---++-+-
 public | idx_ligand_rdkitmol | index | jpebe | ligand | 5030 MB |
(1 row)

db=#  \di+ idx_ligand_morganbv;
   List of relations
 Schema |Name | Type  | Owner | Table  |  Size   | Description
+-+---+---++-+-
 public | idx_ligand_morganbv | index | jpebe | ligand | 1645 MB |
(1 row)

But Greg is right (!) running the query a second time resulted in much
faster performance (11285.886ms as opposed to the original
193973.253ms)
Of course if you change the smiles string, than nothing is cached  and
it takes ages again...

Interested,
JP

--
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Speeding up database queries...

2011-05-04 Thread Adrian Schreyer
On Wed, May 4, 2011 at 04:26, Greg Landrum  wrote:
> Dear JP,
>
> On Tue, May 3, 2011 at 6:29 PM, JP  wrote:
>>
>> Based on an rdkit post I read over the warm weekend I set myself to
>> have a look at my rdkit based queries (and ways to speed them up)...
>>
>> But first some details:
>>
>> Postgres:
>> PostgreSQL 9.0.3 on x86_64-unknown-linux-gnu, compiled by GCC gcc
>> (GCC) 4.5.2, 64-bit
>>
>> RDKit (DB Cartridge):
>> v. 0.20.0
>>
>
> The other important question is how much memory you have and what
> filesystem postgres is using for the database (local or network).
>
>> # of Molecules
>> 8,432,896
>>
>> Database table (ligand)
>>                                  Table "public.ligand"
>>   Column   |         Type          |                      Modifiers
>> +---+-
>>  id         | integer               | not null default
>> nextval('ligand_id_seq'::regclass)
>>  supplierid | character varying(50) |
>>  smiles     | text                  |
>>  rdkitmol   | mol                   |
>>  pairbv     | bfp                   |
>>  torsionbv  | bfp                   |
>>  morganbv   | bfp                   |
>>  amw        | real                  |
>>  mollogp    | real                  |
>>  hba        | integer               |
>>  hbd        | integer               |
>>  atoms      | integer               |
>>  hvyatoms   | integer               |
>> Indexes:
>>    "ligand_pkey" PRIMARY KEY, btree (id)
>>    "idx_ligand_morganbv" gist (morganbv)
>>    "idx_ligand_pairbv" gist (pairbv)
>>    "idx_ligand_rdkitmol" gist (rdkitmol)
>>    "idx_ligand_torsionbv" gist (torsionbv)
>>
>>
>> I cannot explain why the following queries:
>>
>> db=# select count(*) from ligand where rdkitmol@>'c12c1nncc2' ;
>>  count
>> ---
>>  2942
>> (1 row)
>>
>> Time: 193973.253 ms
>> db=# select count(*) from ligand where 
>> morganbv%morganbv_fp('c12c1nncc2',2);
>>  count
>> ---
>>     8
>> (1 row)
>>
>> Time: 400138.989 ms
>
> The performance is critically dependent on whether or not the indices
> are in memory. If you've just started the database or if there's been
> a lot of non-postgres activity on the machine since the last time you
> used it, it can take a long time to load the index from disk to
> memory. Once it has been loaded, things should go faster.
>
> My emolecules database requires about 1 GB for each of the indices:
>
> emolecules=# \di+ molidx;
>                         List of relations
>  Schema |  Name  | Type  |  Owner   | Table |  Size   | Description
> ++---+--+---+-+-
>  public | molidx | index | glandrum | mols  | 1049 MB |
> (1 row)
> emolecules=# \di+ mfp2idx;
>                         List of relations
>  Schema |  Name   | Type  |  Owner   | Table |  Size  | Description
> +-+---+--+---++-
>  public | mfp2idx | index | glandrum | fps   | 970 MB |
> (1 row)
>
>
>>
>> Take so long... these are orders of magnitude larger than timings
>> reported in http://code.google.com/p/rdkit/wiki/DatabaseCreation2
>> And my database in "only" roughly 50% larger (8M instead of the puny
>> 5M in emolecules).
>>
>> When I do an "explain" on these queries (to make sure the indices are
>> being used), I get:
>>
>> db=# explain select count(*) from ligand where rdkitmol@>'c12c1nncc2' ;
>>                                          QUERY PLAN
>> --
>>  Aggregate  (cost=34850.44..34850.45 rows=1 width=0)
>>   ->  Bitmap Heap Scan on ligand  (cost=2667.36..34829.36 rows=8433 width=0)
>>         Recheck Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol)
>>         ->  Bitmap Index Scan on idx_ligand_rdkitmol
>> (cost=0.00..2665.25 rows=8433 width=0)
>>               Index Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol)
>> (5 rows)
>>
>> db=# explain select count(*) from ligand where
>> morganbv%morganbv_fp('c12c1nncc2',2);
>>                                                     QUERY PLAN
>> -
>>  Aggregate  (cost=33290.88..33290.89 rows=1 width=0)
>>   ->  Bitmap Heap Scan on ligand  (cost=918.05..33269.79 rows=8433 width=0)
>>         Recheck Cond: (morganbv %
>> '\\xe0ff000413007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp)
>>         ->  Bitmap Index Scan on idx_ligand_morganbv
>> (cost=0.00..915.94 rows=8433 width=0)
>>               Index Cond: (morganbv %
>> '\\xe0ff000413007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp)
>> (5 rows)
>>
>> Looks good no?
>> Am I missing something?  Or is this the fastest my search can go at?
>> Supposedly the fingerprints search is just doing some ~8M binary
>> operations no?  Why does this take so long?
>> Ideas, anyone?
>
> This all looks fine.
>
> The experiments t