We have suddenly run into a replication problem between our master and
our slave databases. It had been running without a problem for atleast
several months since our last upgrade, and for several years before
that.  I've tried a number of debugging techniques, and now I'm hoping
somebody here can make sense of it.  On the slave I get error messages
like:

040630  2:43:52  Slave: reconnected to master
'[EMAIL PROTECTED]:3306',replication resumed in log 'mysql-bin.163' at
position 37919441

It does that several times between 2:20am and 4:30am.  and every few
nights the slave gives up:

040630  4:15:34  Slave thread exiting, replication stopped in log
'mysql-bin.163' at position 99512496

No logs are recorded on the master.  This 2am to 5am time coincides with
a lot of mysql update/insert traffic on the master, in parallel with
alot of different connections.  Master and slave are running linux. 
Master is running 3.23.57 binary and the slave is running 3.23.58
binary.  I checked the changelong on mysql's site, and I don't see why
the version difference would be a problem.  (but I am open to upgrading
the master!)  Both master and slave have max packet set at 16M.  So my
initial thought was hardware (since everything was great before).  So I
feel like I've almost eliminated it by doing the following:

Changing ethernet boards on both master and slave (to totally different
brand of hardware even)
Changed cables
Used different switch ports
Turned off auto negotiate on the switch port

Then I:
swapped the slave hardware for completely different hardware (removed
raid drives, stuffed into similar server)

Upgraded the kernels on both machines to 2.4.26

I also setup a ping over night to both the master and slave servers
looking for packet loss, none.  I wrote a TCP based ping between both of
those servers and our administrative server.  It kept passing data back
and forth and recorded the total time.  No slow down, or Net errors
during the time periods Mysql has communications problems.

So then I setup a second slave.  This box runs debian (the other slave
ran redhat).  This machine is running the same exact binary mysql that
was running on the other slave.  This slave ran fine for 3 days (which
led to make some false assumptions that it was working fine).  But now
it's slave breaks too.  Interesting though.  Both slaves will break at
*approximately* the same time (about a minute apart..  Both machines are
ntp synced to the second, so it's not *EXACTLY* the same time), and both
will be pretty *close* in the binlog, but not exact.

I have looked in the biglog at the various queries that it's "crapped
out" on, and they're nothing interesting.  I've looked at about 10 of
them, and some are very simple updates, some are long complex inserts. 
A few nights I have SLAVE STOP;ed the slave, and waited until the
morning to run the slave log forward.  When I run the slave log in the
morning, I get the same exact slave disconnects.  This is the first time
I've seen the slave disconnect outside the 2:30-4am window.  So clearly
this seems to be a mysql problem... Something in the binlog that's
causing it to disconnect, but why do the two slaves not disconnect in
the same *EXACT* place?

So, before putting the axe through the mysql server, I busted out my
packet sniffer.  I sniffed the packets from both sides of the connection
until it failed.  From my reading, it looks like the master is tearing
down the connection.  It basically looks like this:

Master sends lots of data
Master sends a PSH ACK packet
Master sends lots of data
Master sends a PSH ACK packet
Master sends lots of data
Master sends a PSH ACK FIN packet
Slave sends a ACK FIN packet

(with slave ACKs in there too)

Soooo, it looks like the master is tearing down the connection.  Anybody
have any thoughts?  I can upgrade the master to 3.23.58, but I don't see
anything in the mysql change log that implies that will help.  (Bringing
down the master server requires much dancing around and appeasing the
customers due to the outage).  My other thought is going to mysql
4.0.x.  But again, I don't generally like doing things just because they
might help.  We have had a plan to upgrade to 4.0.x for some time (we
certified our software on it), but we don't have an urgent need to
budget the resources required to do it.

        Is it possible there is some sort of race in the mysql-biglog writing? 
One of the reasons why this might have only started cropping up now is
the 2:30-4am slot has been getting progressively busier and busier, with
a huge number of parallel insert/updates (4cpu box).  The rest of the
day the traffic isn't even close to that time period, and replication
works rockstar.

Settings on the master:

set-variable    = key_buffer=1024M
set-variable    = tmp_table_size=1024M
set-variable    = max_allowed_packet=16M
set-variable    = thread_stack=128K
set-variable    = max_connections=2000
set-variable    = max_connect_errors=999999999
set-variable    = table_cache=1024
set-variable    = myisam_max_sort_file_size=4096
set-variable    = myisam_sort_buffer_size=512M
set-variable    = join_buffer_size=512M
set-variable    = sort_buffer=512M

settings on slave:
[mysqld]
server-id=5
master-host=xx.x.x.xx
master-user=yyy
master-password=zzz
master-port=3306
set-variable = max_allowed_packet=16M
set-variable = max_connections=2000
set-variable = max_connect_errors=999999999
set-variable = table_cache=256
socket=/tmp/xyz-slave.sock
bind-address=yy.yy.yy.yy


Any help would be appreciated,

Thanks,
-Joe




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to