Re: [cas-dev] JpaTicketRegistry -- A Sinking Ship

Scott Battaglia Tue, 03 Jan 2012 08:57:34 -0800

On Tue, Jan 3, 2012 at 11:51 AM, Joachim Fritschi <jfrits...@freenet.de>wrote:


> I definately feel your pain.
>
> I ran into the same issue on our mysql database once. After some
> application malfunction we had massive amounts of tickets being generated
> by a service for every user since they had implemented some clever
> redirection loop and some issue where a new ticket was created for every
> page impression.
> The bug was introduced right in the peak period at the start of the
> semester and caused something like 500 tickets/second to be generated. Once
> the cleaner started running after our 2h exiry time the cas server tanked
> with a OOM. Every restart the server was pretty much stuck and the only
> quick solution to get up and running was to drop all ticket tables and be
> done with it.
> Increasing the memory and shortening the cleaner period was added as a
> temporary fix to get up and running with crappy performance due to the DOS
> like situation at that point. These fixes limited the number of tickets to
> be cleaner per run and at least achieved a stable but slow service. It
> worked as a temporary fix until the rogue app was fixed and served us well
> in some similar incident later.
> This was all before the cas throtteling feature was introduced...
>

We specifically coded for that issue in the CAS4 code base (i.e. there is a
configurable batch size you can set for the cleaning).  That clearly
doesn't help you now ;-)



>
> Joachim
>
>
>
> On 03.01.2012 15:40, Marvin Addison wrote:
>
>> The subject is intentionally provocative and based at least in part
>> from the production headaches it caused me over a holiday weekend
>> around 5AM.  I'd like to provide a brief overview of the problem and
>> resolution steps since it may help others to evaluate
>> JpaTicketRegistry in considering a ticket storage backend.
>>
>> Around 0500 I got a call from our NOC that CAS was unavailable, which
>> in this case meant the /login URI was throwing HTTP 500s.  This of
>> course meant that CAS was entirely unusable.  I confirmed the issue
>> then started a shell session on both hosts.  Top and quick log review
>> both suggested both nodes were OOM, and logs also suggested that the
>> root cause was an attempt to clean up a massive amount of tickets.
>> Recall that the effect of RegistryCleaner running on a
>> JpaTicketRegistry is to buffer _all_ tickets into memory in order to
>> perform cleanup.  I queried the database and confirmed there was an
>> unusually high number of tickets in the registry, which indicated that
>> I had to clean up tickets in order to triage the problem.  I
>> temporarily disabled the cleaner trigger that drives
>> RegistryCleaner#clean() and redeployed CAS to get it back online, then
>> went about the work of cleaning up tickets.
>>
>> Due to the self-referential nature of TGTs (a PGT is simply a TGT that
>> points to a parent TGT), this is tedious to impossible to do with
>> manual queries.  Thankfully in our case we have exclusively proxy
>> tickets of chain length one, and the following two queries (on
>> PostgreSQL) issued sequentially will suffice:
>>
>> delete from ticketgrantingticket where
>> to_timestamp(creation_time/**1000)<  $DATE and ticketgrantingticket_id
>> is not null;
>> delete from ticketgrantingticket where to_timestamp(creation_time/**1000)<
>>  $DATE;
>>
>> This cleans up all children before the parents and respects FK
>> constraints.  This approach would not work with more complex proxy
>> chains.  The only way to handle this situation generally with manual
>> queries would be to cascade deletes to child records, which is
>> fortunately possible on our platform (PostgreSQL) via the ON DELETE
>> CASCADE clause on the foreign keys.  Unfortunately, Hibernate schema
>> creation does not specify this clause, so it would be needed to be
>> added manually.  Tragically, making constraint changes on PostgreSQL
>> tables requires an exclusive table lock, which is simply not viable
>> for active production systems.
>>
>> It's worth discussing briefly the cause of large numbers of expired
>> tickets at the root cause of this incident.  PostgreSQL implements
>> BLOBs via a custom data type called a large object (lo) where columns
>> of the SQL LOB type are simply references to the lo objects (they
>> contain an int which is a handle to the lo).  Since they are
>> references, you can get into two situations:
>>  - Orphaned large objects (the vacuumlo tool and triggers alleviate
>> this situation)
>>  - References to large objects that no longer exist
>>
>> For some unknown reason, large objects are getting removed while
>> records still exist that reference them.  Any attempt to load a
>> non-existent lo causes a SQLException on the Java side.  These
>> exceptions tank the entire RegistryCleaner#clean() cycle, and
>> apparently they were happening often and early enough that cleanup was
>> effectively not happening.  Logjam ensued.
>>
>> I have spent significant development time on JpaTicketRegistry and
>> related components and to tuning our production CAS servers on two
>> different database platforms (Oracle and PostgreSQL).  So I'm invested
>> in the approach, but I believe this recent incident is the last straw.
>>  There are fundamental problems with JpaTicketRegistry and it will
>> take a fairly broad redesign of the TicketRegistry API to resolve them
>> adequately.  I believe the use of the factory pattern that Scott has
>> explored in the feature-cas4api branch is at least on the right track,
>> but those are big changes that we simply can't wait for.  Sure we
>> could fix some of the problems now and work around others, but I'm
>> coming to see that a database is not the best storage back end for our
>> needs.  (If you're using the "Remember Me" feature, it starts to look
>> a lot more attractive.  We don't, and the very durability of
>> database-backed tickets is a liability.)
>>
>> M
>>
>>
>
> --
> You are currently subscribed to cas-dev@lists.jasig.org as:
> scott.battag...@gmail.com
> To unsubscribe, change settings or access archives, see
> http://www.ja-sig.org/wiki/**display/JSG/cas-dev<http://www.ja-sig.org/wiki/display/JSG/cas-dev>
>

-- 
You are currently subscribed to cas-dev@lists.jasig.org as: 
arch...@mail-archive.com
To unsubscribe, change settings or access archives, see 
http://www.ja-sig.org/wiki/display/JSG/cas-dev

Re: [cas-dev] JpaTicketRegistry -- A Sinking Ship

Reply via email to