On Wed, Apr 4, 2012 at 3:22 PM, Emmanuel Lécharny <elecha...@gmail.com> wrote:
> Hi guys,
>
> since I started to work on index removals last week, I started to get
> strange behaviors I put on some wrong modification I have done. Today, as I
> was removing the last call to the OneLevelIndex to replace it by rdnIndex,
> the core-integ tests are blocking.
>
> I did a kill -3 to see where I get a blockage, and here is what I got :
>
> "main" prio=5 tid=7fd9db800800 nid=0x10d310000 waiting on condition
> [10d30d000]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>        at java.lang.Thread.sleep(Native Method)
>        at jdbm.helper.LRUCache.put(LRUCache.java:330)
>        at
> jdbm.recman.SnapshotRecordManager.update(SnapshotRecordManager.java:401)
>        at jdbm.btree.BPage.remove(BPage.java:605)
>        at jdbm.btree.BPage.remove(BPage.java:611)
>        at jdbm.btree.BTree.remove(BTree.java:464)
>        at
> org.apache.directory.server.core.partition.impl.btree.jdbm.JdbmTable.remove(JdbmTable.java:741)
>        - locked <7c226be90> (a
> org.apache.directory.server.core.partition.impl.btree.jdbm.JdbmTable)
>        at
> org.apache.directory.server.core.partition.impl.btree.jdbm.JdbmRdnIndex.drop(JdbmRdnIndex.java:157)
>        at
> org.apache.directory.server.core.partition.impl.btree.jdbm.JdbmRdnIndex.drop(JdbmRdnIndex.java:49)
>        at
> org.apache.directory.server.core.partition.impl.btree.AbstractBTreePartition.delete(AbstractBTreePartition.java:891)
> ...
>
> The associated code in LRUCache is :
>
>    public void put( K key, V value, long newVersion, Serializer serializer,
>        boolean neverReplace ) throws IOException, CacheEvictionException
>    {
>    ...
>        while ( true )
>        {
>        ...
>                else
>                {
>                    entry = this.findNewEntry( key, latchIndex );
>                    ...
>                }
>            }
>            catch ( CacheEvictionException e )
>            {
>                e.printStackTrace(); // Added for debug purposes
>                sleepForFreeEntry = totalSleepTime <
> this.MAX_WRITE_SLEEP_TIME;
>
>                ...
>            }
>            ...
>
>            if ( sleepForFreeEntry )
>            {
>                try
>                {
>                    Thread.sleep( sleepInterval );
>                ....
>                totalSleepTime += sleepInterval;
>            }
>            else
>            {
>                break;
>            }
>        }
>
> Basically, we try to add a new element in the cache, it's full, we then try
> to evict one entry, it fails, we get a CacheEvictionException, and we go to
> sleep for 600 seconds...
>
> It's systematic, and I guess that the fact we now pond the RdnIndex table
> way more often than before (just because we don't call anymore the
> OneLevelIndex) cause the cache to get filled and not released fast enough.

do we hold a cursor open while this code gets stuck? I would think we
hold a cursor open and moduify quite a bit of jdbm btree pages for
this kind of behavior to happen.

>
> As we don't set any size for the cache, its default size is 1024. For some
> of the tests, this mightnot be enough, as we load a lot of entries
> (typically the schema elements) plus many others that get added and removed
> while running tests in revert mode.
>
> If I increase the default size to 65536, the tests are passing.
>
> Ok, now, I have to admit I haven't - yet - looked at the LRUCache code, and
> my analysis is just based on what I saw by quickly looking at the code, the
> stack traces I have added and some few blind guesses.
> However, I think we have a serious issue here. As far as I can tel, the code
> itself is probably not responsible for this behaviour, but the way we use it
> is.
>
> Did I missed something ? Is there anything we can do - except increase the
> cache size - to get the tests passing fine ?
>
> I'm more concern about what could occur in real life, when some users will
> load the server up to a point it just stop responding...
 to aovid this issue, we can let the writers allocate more cache
pages(rather than keeping the cache size fixed) so that they do not
loop waiting for a replaceable cache. However, I would again suggest
making sure we do not forget the cursor open. If we forget a cursor
open and keep allocating new cache pages for writes, we will have
other problems.


>
> Anyone ?
>
> Thanks !
>
> --
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.iktek.com
>

thanks
Selcuk

Reply via email to