Re: [Trac-dev] [PATCHES] mitigating "database is locked"

Eli Carter Tue, 13 Aug 2013 15:22:39 -0700

On 08/10/2013 10:28 AM, Christian Boos wrote:
> Hello Eli!

:)  Missed working on Trac; finally had something I *had* to deal with
that gave me enough time to tackle this.



> On 8/10/2013 4:54 AM, Eli Carter wrote:
>> All,
>>
>> Everyone knows the "database is locked" issue with sqlite.  According to
>> http://trac.edgewall.org/wiki/MostFrequentDuplicates it should be gone
>> with 0.12dev.  But, it isn't; I have a way to get that error for 15
>> minutes at a time on 0.12.5.
> 
> Indeed. As with any request taking an unreasonable amount of time, the
> whole-db-lock nature of SQLite makes it prone to this error when one
> tries to write to it, like a lot of requests do, if only for saving
> cookies or trying to persist the session.
> 
>>
>> The root of the problem that I'm seeing is that viewing an individual
>> file in the source browser will take 15 minutes if that file has only
>> one commit in the repository history.  This is because of the SQL query
>> run in CachedRepository._next_prev_rev() which is called in two forms
>> during the request.  That SQL query is searching the node_change table,
>> and for a node with a single commit, will return no results.  In my
>> particular case, that table has 3.5 million entries (from one svn repo).
>>  Running those two queries will take a total of ~950 seconds to complete
>> on my machine.  During that time, if a commit is made to the repository,
>> every request will attempt to sync the repository, and fail because it
>> cannot get a write lock.  This includes requests that would otherwise be
>> read-only operations.
> 
> Well, the sync-per-request problem is simple to avoid:
> http://trac.edgewall.org/wiki/TracRepositoryAdmin#ExplicitSync
> ... which doesn't mean we can't improve the case of the implicit sync:

Well, simple when things work well... not so simple when the explicit
sync fails because someone has a 15 minute lock on the database... the
post-commit hook needs to return quickly and not fail, but also needs to
retry the explicit sync sometime in the future if the sync fails.  Looks
like the contrib/trac-svn-hook is not that sophisticated.


>> I'd like to improve this situation in a few ways:
>>
>> 1. Add a try/except around repo.sync() and turn that exception into a
>> add_warning().  While this does nothing to address the real underlying
>> issue, this makes the problem much less painful and allows the user to
>> at least look at their data.  (db-locked-mitigation-part1.patch)
> 
> Maybe we should add a hint about configuring explicit sync, something like:
> "If this is a recurring problem, you can set up an _explicit
> synchronization_" (with that linking to the same URL as given above).

I didn't know about [trac] repository_sync_per_request; glad to see
that's already configurable.  I'll rework this patch with that.  Right
now it's doing an "except Exception", which is a bit broad; should I
narrow that to the SQLite-specific exception for a locked db?  I can
make it fall back to Exception if the import of sqlite fails so those
who use other DBs don't have have to have sqlite installed.

>> 2. Add an explicit duration check around sql execute() calls when [trac]
>> debug_sql is enabled, and log SQL operations that take over 10 seconds
>> to complete at the warn level.  This will let an admin enable debug_sql
>> and set the logging level to INFO and see that the database is the
>> bottleneck, and what query is the cause.  Identifying the exact query
>> will allow developers to figure out the root of the issue and identify
>> the specific operations that limit scalability.
>> (db-locked-mitigation-part2.patch)
> 
> Good idea.

I'll commit this change first then.

>> 3. Address half of the root problem by making the
>> CachedRepository.previous_rev() call self.repo.previous_rev()
>> unconditionally and not use the database cache.  The database solution
>> will take time based on the size of the node_change table, but the
>> repository already has links to parent revs and can find the information
>> much more directly. In the case I was testing, the previous_rev() call
>> was taking over 200 seconds, and with the patch now takes under a
>> second.  (part of db-locked-mitigation-part3.patch)
> 
> Ok, that seems to help. The "real" solution would be to find a
> appropriate way to index that table and optimize that query. I tried a
> few things so far, but without success as simply adding an index on the
> repo/path won't help as SQLite won't use it with that query (EXPLAIN
> QUERY PLAN is helpful here).

Yeah, I blindly attempted adding that index and compared performance: no
change, as you already knew.  IIRC, the indexing took ~7 minutes and
added nearly 400MB to the .db file.

>> 4. That leaves the second half of the root problem; namely, that
>> CachedRepository.next_rev() takes a very long time under these
>> conditions.  (12+ minutes for my particular testcase.)  Worse,
>> self.repo.next_rev() takes an even longer time.  (Did not complete over
>> lunch.)
> 
> That's why in #8813 I said "better not use directly the svnfs API and
> use the DB instead". But we should really make a difference here between
> next_rev and prev_rev, at least until we find a way to make the DB query
> faster.

I'll split the part3 patch in two; this first, and the rest second.

> Note that we already have a ticket for improving this specific "query of
> hell", it's #11104 (reported in the context of changesets and MySQL, but
> it's the same thing).
> 
>>  While I do not have a solution for this, making the problem
>> visible is a step in the right direction.  Therefore, I added a duration
>> check around the next_rev() call which logs a warning if that takes more
>> than 10 seconds.  The warning includes a suggestion to enable [trac]
>> debug_sql to lead an admin to the database level check from part 2.
>> (the other part of db-locked-mitigation-part3.patch)  It appears that on
>> my system this is I/O bound in the sqlite database, leading me to doubt
>> that Postgresql would solve the problem in my case, though I have not
>> attempted that.  This needs further investigation.
> 
> One "solution" proposed in #8639 was to make an indirect link (e.g.
> ?rev_following=...) with the link showing a hint about the risk that the
> request could take a long time to complete.

The purpose of the page is to show the content of that node; the link to
adjacent versions is secondary to that.  I think making the link
indirect until there is a better solution would be a good thing.

>> I think patches 1 and 2 should be reasonably uncontroversial.  Patch #3
>> may warrant some discussion and benchmarking.  The patches are all
>> against the 0.12-stable branch, but they will also apply (with offsets)
>> to trunk.
> 
> I'll rebase and try them out. Thanks a lot for throwing in some work and
> new ideas on this topic. Together with the mergeinfo issue that Jun has
> started to tackle recently (#11219), those are some of the worst
> performance issues we have.
> 
>>
>> If feedback on the patches is favorable, I'll commit them to 0.12-stable
>> and trunk.  (Assuming I still can... it's been a while.)
> 
> No worries, you don't get removed unless you ask to be removed ;-)
> OTOH, if you forgot your password in the meantime, just PM me.

I believe I still have it.  Thanks.  :)

Eli

-- 
You received this message because you are subscribed to the Google Groups "Trac 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/trac-dev.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Trac-dev] [PATCHES] mitigating "database is locked"

Reply via email to