Hi all,

Starting Wednesday night, we observed several weird errors indicative
of data corruption shortly before a CPU spike and complete crash on
our master db server (opera.oursite.com). opera.oursite.com had
crashed twice with signal 11 in recent weeks, but we had never
observed any data corruption issues. This was about 15 minutes after
an inadvertent and short-lived server id conflict between two slave
servers (serenade.oursite.com and adagio.oursite.com). Shortly after,
we replaced the master with sonata.oursite.com, we then did a full
mysqldump from sonata.oursite.com and imported that dump onto our 3
other db servers and resumed slaving (with opera as a slave to
sonata). Then Thursday morning, we brought opera back online as our
master. See server list [1] and timeline [2].

Between Thursday and Saturday, we continued to observe apparent data
corruption errors, now on sonata as well as opera, as well as many
dropped and/or failed connections at various unexpected times, often
one error immediately after the other. [3]

We took opera offline completely on Sunday morning, with serenade as
our new master. We continued to observe data integrity problems on
sonata.  We then completely disabled the use of slaves on Sunday night
and haven't had any issues (yet). Note that adagio, no longer in
production, never had any issues either.

What insights might you have into this behavior? Might it be due to a
known bug in MySQL 5.0.27? How would you go about investigating the
cause of this? I am happy to provide any other information you might
think relevant.

Below is is a list of our DB servers for reference, a timeline of
events, and an example of some of the errors we received.  Any help
you can provide is very much appreciated!

Thanks,
David


[1] List of db servers
opera.oursite.com - original master, currently out of production
sonata.oursite.com - slave, then temporary master on Wednesday, currently
out of production
serenade.oursite.com - slave, now current master since Sunday morning
adagio.oursite.com - slave that was brought up with serenade's server id originally

[2] Timeline
Wednesday, September 5th, 8:00 PM - We launch Adagio with conflicting srv ID Wednesday, September 5th, 8:01 PM - We stop adagio, launch with correct ID Wednesday, September 5th, 8:05 PM - We restart replication on adagio, catchup Wednesday, September 5th, 8:16 PM - Data corruption errors & CPU spike on opera
Wednesday, September 5th, 8:18 PM - Opera dies
Wednesday, September 5th, 8:30 PM - Sonata becomes master
Wednesday, September 5th, 8:40 PM - Opera comes back online after reboot
Wednesday, September 5th, 9:30 PM - Sonata dies with signal 11
Wednesday, September 5th, 9:40 PM - Lost DB connections on sonata
Wednesday, September 5th, 10:18 PM - Another lost DB connection on sonata

Thursday, September 6th, 3:00 AM - Dump is performed on Sonata
Thursday, September 6th, 4:00 AM - Dump imported on opera serenade adagio
Thursday, September 6th, 5:00 AM - Opera becomes master again
              Serenade and Adagio replicate
Thursday, September 6th, 3:00 PM - Sonata and import done
       Sonata back into production
Thursday, September 6th, Afternoon - Sonata's replication lagging behind
                                     Lots of IO wait on sonata
                                     Sonata pulled out of production
Thursday, September 6th, 7:05 PM TO 9:29 PM - More apparent data
corruption errors and lost connections on opera
Thursday, September 6th, 10:19 PM TO 11:20 PM - A ton of failed
connections to opera
Thursday, September 6th, 11:04 PM TO Friday, September 7th, 1:32 AM -
More data corruption errors

Friday, September 7th, 3:16 AM - Opera dies again with signal 11
Friday, September 7th, 6:37 AM - Opera dies again with signal 11 (and a
bunch of failed connections)
Friday, September 7th, 9:18 PM - A bunch more failed/lost connections

Sunday, September 9th, 5:00 AM - Opera taken out of production
          - Sonata and Adagio are slaves, serenade master
Sunday, September 9th, 3:06 PM - Incorrect key file error on sonata
                   - work_music table (MyISAM) marked as crashed
                   - more apparent DB corruption, this time on sonata
Sunday, September 9th, 3:10 PM - 3:12 PM
          - Error 127 reading table work_music on sonata
Sunday, September 9th, 10:13 PM - 11:39 PM
          - Error 134 reading table production_favs on sonata
Sunday, September 9th, 11:39 PM - Slaves taken completely offline,
serenade now the only master


[3] Representative Errors
(Note that the vast majority of our tables are MyISAM -- including the ones we had errors with)

UPDATE work_music, (SELECT SUM(count) AS num_views, COUNT(*) AS
num_viewers FROM workmusic_hits WHERE work_music_id='36079') AS hits
SET work_music.__num_views=hits.num_views,
work_music.__num_viewers=hits.num_viewers WHERE
work_music.work_music_id='36079' [nativecode=1031 ** Table storage
engine for 'hits' doesn't have this option]

insert into production_hits (production_id, user_id, first_hit) values
('57760', '13241', now())
on duplicate key update count=count+1, last_hit=now()
[nativecode=1062 ** Duplicate entry '57760-13241' for key 1]

SELECT id FROM user WHERE favorite_id='194074' [nativecode=1194 **
Table 'user' is marked as crashed and should be repaired]

SELECT SUM(count) AS num_views FROM production_hits WHERE production_id='64667'
[nativecode=1030 ** Got error 134 from storage engine]


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to