Re: Performance issues lately.

2010-11-15 Thread Doug Leavitt

 Jorgen,
I suggest that you use dtrace to get a better understanding of what is 
going on.


You can start with some pre-existing documented scripts from the dtrace 
toolkit

here:

http://hub.opensolaris.org/bin/view/Community+Group+dtrace/dtracetoolkit

The dtrace guide is here:

http://wikis.sun.com/display/DTrace/Documentation

There are many examples in the dtrace toolkit that should help sort out
what other processes or system resources are affecting the ldap servers
performance in your specific situation.

Doug.


On 11/14/10 07:42 PM, Jorgen Lundman wrote:



Howard Chu wrote:


If it slows down after you wait a while, that means some other process
on the machine is using the system RAM and forcing the BDB data out of
the system cache. Find out what other program is hogging the memory,
it's obvious that BDB is not doing anything wrong on its own.


If I db_stat another large file, like dn2id.bdb, the subsequent 
id2entry.bdb will be slower. So maybe it is fighting itself.


However, since I am executing separate db_stat processes each time, 
the setcachesize would have no chance to help improve things. I will 
have to try different values for slapd running.


Could be I should investigate various Solaris specific process limits 
as well. It is all 64bit now, but per process limits may still be 
interfering.






Performance issues lately.

2010-11-14 Thread Jorgen Lundman


Hello list,

Solaris 10u8
db-4.8.30.NC
openldap-2.4.23

So in the last week, simple queries have been really slow. For example:

# time /usr/local/bin/ldapsearch  -h 0 -D cn=admin,dc=company,dc=com -x -w pass 
-b ou=mail,dc=company,dc=com mail=testloo...@domain.com

1m21s

Of course mail in indexed. The server is dedicated to LDAP (test server) so 
there are no other connections. Not even syncrepl connection.


I have checked the usual things, memory usage, BDB cache, and disk IO. All seem 
to be within reasonable limits. Then I went deeper.


I started trussing it, and it seems to send the query reply nearly right away, 
then sit around 200 seconds before we send a few more bytes and exit.


I tried out db_stat on various db, and found it is also really slow on stat on 
id2entry.bdb. For example:


# time /usr/local/BerkeleyDB.4.8/bin/db_stat -d id2entry.bdb
2m32s

So, trussing that call (including all calls in libdb-4.2):

19700/1...@1:   0.0001 - libdb-4.8:__db_stat_print_pp(0x419920, 0x0, 
0x412c30,
 0x4146a0, 0xfd7fffdff880, 0xfd7fffdff824)
19700/1: 0.0009 time()  = 128972
5450
19700/1: 0.0001 brk(0x0041EC20) = 0
19700/1: 0.0001 brk(0x00422C20) = 0
19700/1: 0.0001 open(/usr/share/lib/zoneinfo/Japan, O_RDONLY) = 4
19700/1: 0. fstat(4, 0xFD7FFFDFF880)= 0
19700/1: 0.0001 read(4, 0xFD7FFEEA0880, 125)= 125
19700/1:   T Z i f\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\003\0\0\003\0\0\0
\0
19700/1:  \0\0\0\t\0\0\003\0\0\0\fC3CE85 pD7 1E90D7EC1680D8F91690D9CBF8
80
19700/1:  DB071D10DBABDA80DCE6FF10DD8BBC80020102010201020102\0\0 ~90\0\0
\0
19700/1:  \08CA00104\0\0 ~90\0\b C J T\0 J D T\0 J S T\0\0\0\0\0\0\0
19700/1: 0. close(4)= 0
19700/1: 0.0002 ioctl(1, TCGETA, 0xFD7FFFDFE8B0)= 0
19700/1: 0. fstat(1, 0xFD7FFFDFE840)= 0
19700/1: 0.0001 write(1, 0xFD7FFF006764, 36)= 36
19700/1:   S u n   N o v   1 4   1 8 : 0 4 : 1 0   2 0 1 0\t L o c a l
 t
19700/1:   i m e\n
19700/1:589.8794write(1, 0xFD7FFF006764, 25)
= 25
19700/1:   5 3 1 6 2\t B t r e e   m a g i c   n u m b e r\n
19700/1: 0.0046 write(1, 0xFD7FFF006764, 23)= 23


Not entirely sure why a write (everything leading up) to stdout takes 589 
seconds.

I have tried rebuilding the whole database with slapadd but made no difference. 
I put all DB files in /tmp on x4540 (has 32G RAM) and yet, still 2mins to do 
db_stat.



General db_stat -m output:


4GB Total cache size
8   Number of caches
8   Maximum number of caches
512MB   Pool individual cache size
0   Maximum memory-mapped file size
0   Maximum open file descriptors
0   Maximum sequential buffer writes
0   Sleep after writing maximum sequential buffers
0   Requested pages mapped into the process' address space
407MRequested pages found in the cache (99%)
38  Requested pages not found in the cache
425405  Pages created in the cache
38  Pages read into the cache
425427  Pages written from the cache to the backing file
0   Clean pages forced from the cache
0   Dirty pages forced from the cache
0   Dirty pages written by trickle-sync thread
425427  Current total page count
425427  Current clean page count
0   Current dirty page count
524296  Number of hash buckets used for page location
4096Assumed page size used
407MTotal number of times hash chains searched for a page (407904893)
32  The longest hash chain searched for a page
466MTotal number of hash chain entries checked for page (466861223)
0   The number of hash bucket locks that required waiting (0%)
0   The maximum number of times any hash bucket lock was waited for (0%)
0   The number of region locks that required waiting (0%)
0   The number of buffers frozen
0   The number of buffers thawed
0   The number of frozen buffers freed
425538  The number of page allocations
0   The number of hash buckets examined during allocations
0   The maximum number of hash buckets examined for an allocation
0   The number of pages examined during allocations
0   The max number of pages examined for an allocation
0   Threads waited on page I/O
0   The number of times a sync is interrupted


Pool File: id2entry.bdb
16384   Page size
0   Requested pages mapped into the process' address space
13M Requested pages found in the cache (99%)
2   Requested pages not found in the cache
163449  Pages created in the cache
2   Pages read into the cache
163451  Pages written from the cache to the backing file

Pool File: mail.bdb
4096Page size
0   Requested pages mapped into the process' 

Re: Performance issues lately.

2010-11-14 Thread Jorgen Lundman


No real reason, tried various different settings but to no real advantage.

Now I have:

Filesystem size   used  avail capacity  Mounted on
swap19G   7.7G11G42%/tmp

# grep cache DB_CONFIG
set_cachesize 8 0 1

# time /usr/local/BerkeleyDB.4.8/bin/db_stat -d id2entry.bdb
real6m6.099s

# time cp id2entry.bdb /dev/null
real0m0.040s
(It's not on disk)

I thought to delete id2entry.bdb, and use slapindex to re-generate it but that 
appears not to be a supported feature. slapindex can not run without a valid 
id2entry.bdb. This is why I tried slapcat, rm *, slapadd. But no difference in 
speed up.


If I truss with -u *:* (All inter-library calls) I get no single large system 
call, just a lot of work somewhere (that does not call read/write etc). Alas, 
the number of lines in truss file is:

 6209933 /var/tmp/db_stat_truss


Lund

Quanah Gibson-Mount wrote:

--On Sunday, November 14, 2010 7:13 PM +0900 Jorgen Lundman
lund...@lundman.net wrote:


dbconfig set_cachesize 4 0 8


Why are you breaking your cache into segments? This has always had a
negative performance impact in all tests I've done, and stopped being
necessary to do with BDB 4.3 and later.

--Quanah



--

Quanah Gibson-Mount
Principal Software Engineer
Zimbra, Inc

Zimbra :: the leader in open source messaging and collaboration



--
Jorgen Lundman   | lund...@lundman.net
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)


Re: Performance issues lately.

2010-11-14 Thread Jorgen Lundman

truss is pretty much useless in this context. Most of BDB's activity is
thru memory-mapping, which involves no system calls for truss to trace.
You need an actual profile (e.g. using oprofile) to identify where the
time is going.



This is very true. But reach for the tools you have, even if it is a hammer. I 
guess Purify would be the Solaris equivalent, unless we find the problem also 
occurs on a Linux box.


truss has a simple profiler, but only for System Calls which do not help in this 
case:


Library: Function calls
libaio:  close16
libc:membar_exit  1633814
libc:thr_self 653455
libc:_lock_clear  326545
libc:_lock_try326545
libc:memcpy   163311
libc:memcmp   242
libc:strcasecmp   195
libc:free 77

sys totals:  .921 303 22
usr time:  21.669
elapsed:  829.890

 Now repeat the db_stat call and see how long it takes the 2nd time.

It does indeed speed up, if I do not wait too long between tests.

real1m27.712s
real0m29.696s
real0m4.332s
real0m4.452s

4 seconds is much nicer. So what you are saying is that BDB uses mmap, and 
operations inside this memory will trigger reads inside the kernel which do not 
show as libc syscalls. Rats. So it may be IO? I need to throw even more memory 
at it, and live with the increasing startup times?


How does the set_cachesize relate to the mmap usage?




--
Jorgen Lundman   | lund...@lundman.net
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)


Re: Performance issues lately.

2010-11-14 Thread Howard Chu

Jorgen Lundman wrote:

Now repeat the db_stat call and see how long it takes the 2nd time.

It does indeed speed up, if I do not wait too long between tests.

real1m27.712s
real0m29.696s
real0m4.332s
real0m4.452s


If it slows down after you wait a while, that means some other process on the 
machine is using the system RAM and forcing the BDB data out of the system 
cache. Find out what other program is hogging the memory, it's obvious that 
BDB is not doing anything wrong on its own.


4 seconds is much nicer. So what you are saying is that BDB uses mmap, and
operations inside this memory will trigger reads inside the kernel which do not
show as libc syscalls. Rats. So it may be IO? I need to throw even more memory
at it, and live with the increasing startup times?

How does the set_cachesize relate to the mmap usage?



--
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/


Re: Performance issues lately.

2010-11-14 Thread Jorgen Lundman



Howard Chu wrote:


If it slows down after you wait a while, that means some other process
on the machine is using the system RAM and forcing the BDB data out of
the system cache. Find out what other program is hogging the memory,
it's obvious that BDB is not doing anything wrong on its own.


If I db_stat another large file, like dn2id.bdb, the subsequent id2entry.bdb 
will be slower. So maybe it is fighting itself.


However, since I am executing separate db_stat processes each time, the 
setcachesize would have no chance to help improve things. I will have to try 
different values for slapd running.


Could be I should investigate various Solaris specific process limits as well. 
It is all 64bit now, but per process limits may still be interfering.




--
Jorgen Lundman   | lund...@lundman.net
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)