Hope you don't mind me cc:ing this to the list (standard procedure here).

It looks like your timeouts are occurring on the Local Replica Cataog (LRC) 
->updating-> Replica Location Index (RLI). You've bumped your timeout up to 120 
which is a good thing to try. Since the timeouts (at least the ones that I can 
see in your log) are happening on a server-to-server update and since your 
catalog contains 70,000+ entries, this may be a good time to try to switch over 
to using the compressed "bloomfilter" updates. Currently, with your setup using 
uncompressed updates the LRC will send all 70,000 logical names (full strings) 
to the index (itself again). I'm not sure whether this will resolve the issues 
entirely, but I'm guessing that things are getting backlogged because the 
update gets timed out and then the thread is in limbo until the cleanup process 
kills it (which is also visible in the log).

Here's how to switch over to bloomfilters.

1) start your server (in isolation I guess or whatever works)
2) use the admin tool to tell it to stop sending itself updates: 
globus-rls-admin -d rls://hostname rls://hostname
3) stop the server
4) in your globus-rls-server.conf change the bloomfilter setting from 'false' 
to 'true'.
5) restart your server
6) use the admin tool to tell it to start sending itself the compressed 
updates: globus-rls-admin -A rls://hostname rls://hostname

Note: its possible to do this without ever stopping/restarting an RLS server, 
but for simplicity I'll just use these instructions.

Once you've done the above, the RLS/LRC will create bloomfilters and will send 
them to the RLS/RLI. This should be much faster than sending the full lfn list 
update.

One caveat of using the bloomfilters is that you cannot (a) use wildcard 
queries on the index (wildcards still work on the catalog though), and (b) you 
cannot partition your updates between RLI server.

Of course, another approach would be to up your timeout even more. Then you 
could continue using the lfn list updates. Also, a bloomfilter update *could* 
take more than 120 secs as the bloomfilter grows in size, but this ought to be 
sufficient for your current catalog size.

So, start with this, and see if it eliminates so many of those timeouts. And if 
lucky, it will also resolve the other issue. But I'm not certain.

As another side note, I've tested using a SQLite database with up to 5M 
entries. It worked smoothly up to 1M - 2M entries and afterward gracefully 
degraded. The main issue I see with SQLite is that it doesn't handle lots of 
concurrent users, so if you have 20+ clients simultaneously hitting the RLS, 
the db will have problems.

rob


-----Original Message-----
From: [EMAIL PROTECTED] on behalf of Adam Bazinet
Sent: Fri 8/15/2008 10:00 AM
To: Robert Schuler
Subject: Re: [gt-user] RLS woes: database is locked
 
Dear Robert,

I hope all is well with you.  We enjoyed a period of prosperity with RLS
where there were no issues, but now I'm afraid that once again I can't keep
the server up and running for any extended period of time.  FYI, here is the
current state:

[EMAIL PROTECTED]:/export/work/globus-4.1.0>
../globus-4.0.6/bin/globus-rls-admin -S rlsn://asparagine
Version:    4.6
Uptime:     00:02:20
LRC stats
  update method: lfnlist
  update method: bloomfilter
  updates lfnlist:     rlsn://asparagine.umiacs.umd.edu:39281 last 12/31/69
19:00:00
  lfnlist update interval: 86400
  bloomfilter update interval: 900
  numlfn: 71210
  numpfn: 142159
  nummap: 142159
RLI stats
  updated by: rlsn://asparagine.umiacs.umd.edu:39281 last 08/15/08 12:26:05
  updated via lfnlists
  numlfn: 71139
  numlrc: 1
  numsender: 1
  nummap: 71139

It has lots of entries that I can't afford to lose right now, so I can't
very well scrap the sqlite database files and start over.  So when I say I
can't get it to stay up, I mean one of two things happens:

1) by far the more common thing, I just get timeouts with any sort of RLS
query using globus-rls-cli
2) the RLS server just crashes.

Now, what I'll do is kill it off, bring it back up in isolation for a good
5-10 minutes (it seems happier that way), and then turn our Grid back on
which immediately generates lots of (sometimes simultaneous) queries.  It
may hold up for a while, but before long usually the timeouts start to
occur.  Scanning through this old thread I decided to try the -dL3 option
you suggested, and I'm attaching three log files that I generated during 3
separate attempts at keeping the server up:

1) first attempt did not use the LD_ASSUME_KERNEL=2.4.1, server crashed
2) second attempt did not use the LD_ASSUME_KERNEL=2.4.1, timeout occurred
3) third attempt DID use LD_ASSUME..., timeouts occurred

Usually looking at the end of the log is sufficient to see some of the
problems, but I'm including all of it in case you want to look at our
settings or what not.  I may not have had GLOBUS_ERROR_VERBOSE on, I just
turned it on now.  Will that cause more information to be printed to this
log, or to /var/log/messages?

I really don't know what to do.  The next step, if I have to take it, would
be to attempt to get it working with Postgres and somehow dump/transfer the
existing data.  My hunch is that most of these problems are SQLite
specific.  I don't have a good explanation as to why things broke after all
this time, except we may have had more simultaneous queries lately.  Thanks
for any ideas or help you can provide.

Adam

Reply via email to