Re: [ApacheDS] [Performance] Using indices to boost search performance

Alex Karasulu Mon, 11 Sep 2006 10:56:33 -0700

Hi all,

As promised here are some more statistics as we fill up a singlepartition up to 2 million entries.



---------------
750,000 Entries
---------------

    [cached] [indexed] [time (seconds)]

(1)    no       yes        7.26
(2)    yes      yes        5.04
(3)    no       no         565 (9m25s)

#1-#2 = 2.22

search results = 1039 entries

NOTE:
-----

I started changing settings a little here on. I did this because thenumber of entries returned with cache was way more than what we hadallocated. I changed the memory settings and cache setting from thedefaults. Here's what I did:


 o up'ed memory from 384 MB to 1024 MB
 o increased initials cache to 5000 from 100
 o increased entry cache from 100 to 5000
 o increased system indices except alias indices from 100 to 1000

I turned things up a bit so I can load the database as fast as possiblesince we're increasing the amount added per run to 250K entries now.


Let's do the 750K again:

---------------
750,000 Entries
---------------

    [cached] [indexed] [time (seconds)]

(1)    no       yes        8.22
(2)    yes      yes        2.61
(3)    no       no         stopped

#1-#2 = 5.61

search results = 1039 entries


-----------------
1,000,000 Entries
-----------------

    [cached] [indexed] [time (seconds)]

(1)    no       yes        10.22
(2)    yes      yes         3.59

#1-#2 = 6.63
search results = 1373 entries



-----------------
1,250,000 Entries
-----------------

    [cached] [indexed] [time (seconds)]

(1)    no       yes        12.71
(2)    yes      yes         4.50

#1-#2 = 8.21
search results = 1738 entries


-----------------
1,500,000 Entries
-----------------

    [cached] [indexed] [time (seconds)]

(1)    no       yes        14.74
(2)    yes      yes        11.84


#1-#2 = 2.9
search results = 2123 entries

Now let's search this 1.5 million entry partition by limiting the numberof returned entries to 1000.


no cache: 8.10
w/ cache: 3.04

#1-#2 = 5.06

Very similar to the values we got for 0.75 million entries at 1039entries returned.

For the last value I'll load the partition up to 2 million entries andrun the same tests again:


-----------------
2,000,000 Entries
-----------------

    [cached] [indexed] [time (seconds)]

(1)    no       yes        18.58
(2)    yes      yes        15.38


#1-#2 = 3.1
search results = 2851 entries

Now let's search this 2 million entry partition by limiting the numberof returned entries to 1000.


no cache:   8.04
w/ cache:   2.55
#1-#2 = 5.49

Wow performance for returning 1000 entries is not diminishing as wescale from 1 million to 2 million entries. Both for cached andnon-cached returns.

Looking at these results I think we can easily accommodate 10 millionentries per partition. Perhaps with a capacity limit on the order of100 million entries per partition. Note that you can have as manypartitions as you like in a single server.


Alex


Alex Karasulu wrote:

Hi again,
I've completed my optimization on indices under high partitioncapacities. The results are in line and side by side to the oldresults. For those not interested in the exact optimization that tookplace skip down to the results.
The problem:
------------
Up to now duplicate keys (keys with many values) were stored using aTreeSet within the B+Trees of indices. Indices usually map someattribute value to an ID into the master table for an entry. The ID isa sequentially unique value assigned to the entry.
So if 100 people have initials AA then there would be 100 values in theTreeSet for the key 'AA' in the intials index. This would be serializedand deserialized to and from disk. At these low numbers there's reallyvery little impact. However certain indices were seriously impactedlike the objectClass index or the hierarchy based system index. Justimagine having 1 million entries of inetOrgPersons. The objectClassindex would then have 1 million values in the TreeSet for the key'inetOrgPerson'. This would seriously hurt performance from both aspace and time perspective. It would essentially obfuscate the benefitsof using BTree's in the first place with large numbers of entries.
Solution:
---------
What I did was add a configurable threshold parameter when instead ofusing a TreeSet to store duplicates another B+Tree was created and usedwithin the same index database file. Right now the default value forthis which these tests were conducted with is 1000. So after having1000 duplicate values the server switches from using in memory TreeSetsto on disk B+Trees for only storing values for that key.
Alex Karasulu wrote:
Hi all,

I've been testing the search performance boost gained from indexing
attributes before starting development on an optimization to improve
index performance and in memory size.  I thought I'd share these
dramatic results of my pre-optimization tests with you since they
clearly show the benefits of indices.

Before I do this let me list the characteristics of the hardware used
and my configuration settings:


-------------
Machine Setup
-------------

CPU: Dual Athlon MP 1900
OS: Linux 2.6.15
Mem: 2GB 266Mhz


--------------
ApacheDS Setup
--------------

ApacheDS: Stock RC4 (to be released pre-image) w/ modifications
  - Using 1024MB Memory
  - Indexed st and initials


----------
Data Setup
----------

Wrote a simple tool to generate random values for descent sized entries.
 The data sort of looks like this for a user entry:

    dn: uid=user.1,ou=users,dc=example,dc=com
    uid: user.1
    initials: yq
    description: cFocJATNuhlXisDCqGtY
    pager: FYyimqyZRW
    cn: HSGMzajYKmicUTe
    postalcode: WiXXA
    st: xy       street: kpCCqmrsCzkpdtHXWMfY
    l: KqmAXFYTrI
    objectclass: person
    objectclass: organizationalPerson
    objectclass: inetOrgPerson
    sn: nuymgOwpm
    homephone: PERamkCtsv
    mobile: vkIviOGNTC
    telephonenumber: 7248889026
    mail: pYvEoOjSnEymcWD
    givenname: IVHJZB
    postaladdress: crObexKoUTIFdzNHcZMr
    employeenumber: 1
    userpassword:: cGFzc3dvcmQ=

I started loading a partition up with these entries 100,000 of them at a
time then performing the following searches for all entries with
initials aa:

(1) index on initials but no cached entries
(2) index on initials with cached entries
(3) no index without cached entries

Here are the results at the various capacities:

---------------
100,000 Entries
---------------

    [cached] [indexed] [time (seconds)]

(1)    no       yes        3.30
                             2.37
(2)    yes      yes        0.72
                             0.76
(3)    no       no         30.63
                             42.08
search results = 153 entries
                   146
Notice that if #2 is total time for cached entries we can conclude that#1 - #2 is a descent measure of pulling from disk. So let's see whatthat number looks like:
Old: 2.58
New: 1.61
---------------
200,000 Entries
---------------

    [cached] [indexed] [time (seconds)]

(1)    no       yes        6.04
                             3.29
(2)    yes      yes        1.44
                             1.49
(3)    no       no         82
                             83
search results = 302 entries
                   297
#1 - #2 values:

Old: 4.60
New: 1.80
---------------
300,000 Entries
---------------

    [cached] [indexed] [time (seconds)]

(1)    no       yes        7.54
                             4.11
(2)    yes      yes        1.95
                             2.22
(3)    no       no         146
                             128
search results = 451 entries
                   435

#1 - #2 values:

Old: 5.59
New: 1.89
Before the optimization there seems to be a linear growth to access timewhereas after the optimization the growth rate is almost unmeasurableabove the noise.
-Alex

---------------
400,000 Entries
---------------

    [cached] [indexed] [time (seconds)]

(1)    no       yes        9.24
                             4.87
(2)    yes      yes        3.80
                             2.69
(3)    no       no         196
                             168
search results = 586 entries
                   577

#1 - #2 values:

Old: 5.44
New: 2.18
---------------
500,000 Entries
---------------

    [cached] [indexed] [time (seconds)]

(1)    no       yes        11.96
                              5.06
(2)    yes      yes        3.21
                             3.21
(3)    no       no         224
                             209
search results = 748 entries
                   706

#1 - #2 values:

Old: 8.75
New: 1.85
Besides these numbers I've noticed that adds are now faster and totalgarbage collection is lower. These were eyeballed figures. For examplethe adds would grow in time up to 28 minutes per 100K entries for thosebetween 400K-500K. This has dropped down to 18 minutes.
Now it would be interesting to see what happens with these trends oncethe intials index converts from a TreeSet to a BTree for the values ofthe key 'aa'. So just to see we keep loading up the partition untilthere are over 1000 entries with initials 'aa'.
I'll report on this tomorrow as the server loads these entries.
I guess the conclusion here is that the optimization had a significanteffect and is worth the additional complexity it introduced into theJDBM partition implementation. Here's the access times (#1-#2) as a
function of the number of entries:


Retrieval Times | 100K | 200K | 300K | 400K | 500K
--------------------------------------------------
Before:         | 2.58 | 4.60 | 5.59 | 5.44 | 8.75
After:          | 1.61 | 1.80 | 1.89 | 2.18 | 1.85
Before the optimization there seems to be a linear growth to access timewhereas after the optimization the growth rate is almost unmeasurableabove the noise.
-Alex

Re: [ApacheDS] [Performance] Using indices to boost search performance

Reply via email to