On Tue, Jan 3, 2012 at 11:51 AM, Joachim Fritschi <jfrits...@freenet.de>wrote:
> I definately feel your pain. > > I ran into the same issue on our mysql database once. After some > application malfunction we had massive amounts of tickets being generated > by a service for every user since they had implemented some clever > redirection loop and some issue where a new ticket was created for every > page impression. > The bug was introduced right in the peak period at the start of the > semester and caused something like 500 tickets/second to be generated. Once > the cleaner started running after our 2h exiry time the cas server tanked > with a OOM. Every restart the server was pretty much stuck and the only > quick solution to get up and running was to drop all ticket tables and be > done with it. > Increasing the memory and shortening the cleaner period was added as a > temporary fix to get up and running with crappy performance due to the DOS > like situation at that point. These fixes limited the number of tickets to > be cleaner per run and at least achieved a stable but slow service. It > worked as a temporary fix until the rogue app was fixed and served us well > in some similar incident later. > This was all before the cas throtteling feature was introduced... > We specifically coded for that issue in the CAS4 code base (i.e. there is a configurable batch size you can set for the cleaning). That clearly doesn't help you now ;-) > > Joachim > > > > On 03.01.2012 15:40, Marvin Addison wrote: > >> The subject is intentionally provocative and based at least in part >> from the production headaches it caused me over a holiday weekend >> around 5AM. I'd like to provide a brief overview of the problem and >> resolution steps since it may help others to evaluate >> JpaTicketRegistry in considering a ticket storage backend. >> >> Around 0500 I got a call from our NOC that CAS was unavailable, which >> in this case meant the /login URI was throwing HTTP 500s. This of >> course meant that CAS was entirely unusable. I confirmed the issue >> then started a shell session on both hosts. Top and quick log review >> both suggested both nodes were OOM, and logs also suggested that the >> root cause was an attempt to clean up a massive amount of tickets. >> Recall that the effect of RegistryCleaner running on a >> JpaTicketRegistry is to buffer _all_ tickets into memory in order to >> perform cleanup. I queried the database and confirmed there was an >> unusually high number of tickets in the registry, which indicated that >> I had to clean up tickets in order to triage the problem. I >> temporarily disabled the cleaner trigger that drives >> RegistryCleaner#clean() and redeployed CAS to get it back online, then >> went about the work of cleaning up tickets. >> >> Due to the self-referential nature of TGTs (a PGT is simply a TGT that >> points to a parent TGT), this is tedious to impossible to do with >> manual queries. Thankfully in our case we have exclusively proxy >> tickets of chain length one, and the following two queries (on >> PostgreSQL) issued sequentially will suffice: >> >> delete from ticketgrantingticket where >> to_timestamp(creation_time/**1000)< $DATE and ticketgrantingticket_id >> is not null; >> delete from ticketgrantingticket where to_timestamp(creation_time/**1000)< >> $DATE; >> >> This cleans up all children before the parents and respects FK >> constraints. This approach would not work with more complex proxy >> chains. The only way to handle this situation generally with manual >> queries would be to cascade deletes to child records, which is >> fortunately possible on our platform (PostgreSQL) via the ON DELETE >> CASCADE clause on the foreign keys. Unfortunately, Hibernate schema >> creation does not specify this clause, so it would be needed to be >> added manually. Tragically, making constraint changes on PostgreSQL >> tables requires an exclusive table lock, which is simply not viable >> for active production systems. >> >> It's worth discussing briefly the cause of large numbers of expired >> tickets at the root cause of this incident. PostgreSQL implements >> BLOBs via a custom data type called a large object (lo) where columns >> of the SQL LOB type are simply references to the lo objects (they >> contain an int which is a handle to the lo). Since they are >> references, you can get into two situations: >> - Orphaned large objects (the vacuumlo tool and triggers alleviate >> this situation) >> - References to large objects that no longer exist >> >> For some unknown reason, large objects are getting removed while >> records still exist that reference them. Any attempt to load a >> non-existent lo causes a SQLException on the Java side. These >> exceptions tank the entire RegistryCleaner#clean() cycle, and >> apparently they were happening often and early enough that cleanup was >> effectively not happening. Logjam ensued. >> >> I have spent significant development time on JpaTicketRegistry and >> related components and to tuning our production CAS servers on two >> different database platforms (Oracle and PostgreSQL). So I'm invested >> in the approach, but I believe this recent incident is the last straw. >> There are fundamental problems with JpaTicketRegistry and it will >> take a fairly broad redesign of the TicketRegistry API to resolve them >> adequately. I believe the use of the factory pattern that Scott has >> explored in the feature-cas4api branch is at least on the right track, >> but those are big changes that we simply can't wait for. Sure we >> could fix some of the problems now and work around others, but I'm >> coming to see that a database is not the best storage back end for our >> needs. (If you're using the "Remember Me" feature, it starts to look >> a lot more attractive. We don't, and the very durability of >> database-backed tickets is a liability.) >> >> M >> >> > > -- > You are currently subscribed to cas-dev@lists.jasig.org as: > scott.battag...@gmail.com > To unsubscribe, change settings or access archives, see > http://www.ja-sig.org/wiki/**display/JSG/cas-dev<http://www.ja-sig.org/wiki/display/JSG/cas-dev> > -- You are currently subscribed to cas-dev@lists.jasig.org as: arch...@mail-archive.com To unsubscribe, change settings or access archives, see http://www.ja-sig.org/wiki/display/JSG/cas-dev