We have suddenly run into a replication problem between our master and our slave databases. It had been running without a problem for atleast several months since our last upgrade, and for several years before that. I've tried a number of debugging techniques, and now I'm hoping somebody here can make sense of it. On the slave I get error messages like:
040630 2:43:52 Slave: reconnected to master '[EMAIL PROTECTED]:3306',replication resumed in log 'mysql-bin.163' at position 37919441 It does that several times between 2:20am and 4:30am. and every few nights the slave gives up: 040630 4:15:34 Slave thread exiting, replication stopped in log 'mysql-bin.163' at position 99512496 No logs are recorded on the master. This 2am to 5am time coincides with a lot of mysql update/insert traffic on the master, in parallel with alot of different connections. Master and slave are running linux. Master is running 3.23.57 binary and the slave is running 3.23.58 binary. I checked the changelong on mysql's site, and I don't see why the version difference would be a problem. (but I am open to upgrading the master!) Both master and slave have max packet set at 16M. So my initial thought was hardware (since everything was great before). So I feel like I've almost eliminated it by doing the following: Changing ethernet boards on both master and slave (to totally different brand of hardware even) Changed cables Used different switch ports Turned off auto negotiate on the switch port Then I: swapped the slave hardware for completely different hardware (removed raid drives, stuffed into similar server) Upgraded the kernels on both machines to 2.4.26 I also setup a ping over night to both the master and slave servers looking for packet loss, none. I wrote a TCP based ping between both of those servers and our administrative server. It kept passing data back and forth and recorded the total time. No slow down, or Net errors during the time periods Mysql has communications problems. So then I setup a second slave. This box runs debian (the other slave ran redhat). This machine is running the same exact binary mysql that was running on the other slave. This slave ran fine for 3 days (which led to make some false assumptions that it was working fine). But now it's slave breaks too. Interesting though. Both slaves will break at *approximately* the same time (about a minute apart.. Both machines are ntp synced to the second, so it's not *EXACTLY* the same time), and both will be pretty *close* in the binlog, but not exact. I have looked in the biglog at the various queries that it's "crapped out" on, and they're nothing interesting. I've looked at about 10 of them, and some are very simple updates, some are long complex inserts. A few nights I have SLAVE STOP;ed the slave, and waited until the morning to run the slave log forward. When I run the slave log in the morning, I get the same exact slave disconnects. This is the first time I've seen the slave disconnect outside the 2:30-4am window. So clearly this seems to be a mysql problem... Something in the binlog that's causing it to disconnect, but why do the two slaves not disconnect in the same *EXACT* place? So, before putting the axe through the mysql server, I busted out my packet sniffer. I sniffed the packets from both sides of the connection until it failed. From my reading, it looks like the master is tearing down the connection. It basically looks like this: Master sends lots of data Master sends a PSH ACK packet Master sends lots of data Master sends a PSH ACK packet Master sends lots of data Master sends a PSH ACK FIN packet Slave sends a ACK FIN packet (with slave ACKs in there too) Soooo, it looks like the master is tearing down the connection. Anybody have any thoughts? I can upgrade the master to 3.23.58, but I don't see anything in the mysql change log that implies that will help. (Bringing down the master server requires much dancing around and appeasing the customers due to the outage). My other thought is going to mysql 4.0.x. But again, I don't generally like doing things just because they might help. We have had a plan to upgrade to 4.0.x for some time (we certified our software on it), but we don't have an urgent need to budget the resources required to do it. Is it possible there is some sort of race in the mysql-biglog writing? One of the reasons why this might have only started cropping up now is the 2:30-4am slot has been getting progressively busier and busier, with a huge number of parallel insert/updates (4cpu box). The rest of the day the traffic isn't even close to that time period, and replication works rockstar. Settings on the master: set-variable = key_buffer=1024M set-variable = tmp_table_size=1024M set-variable = max_allowed_packet=16M set-variable = thread_stack=128K set-variable = max_connections=2000 set-variable = max_connect_errors=999999999 set-variable = table_cache=1024 set-variable = myisam_max_sort_file_size=4096 set-variable = myisam_sort_buffer_size=512M set-variable = join_buffer_size=512M set-variable = sort_buffer=512M settings on slave: [mysqld] server-id=5 master-host=xx.x.x.xx master-user=yyy master-password=zzz master-port=3306 set-variable = max_allowed_packet=16M set-variable = max_connections=2000 set-variable = max_connect_errors=999999999 set-variable = table_cache=256 socket=/tmp/xyz-slave.sock bind-address=yy.yy.yy.yy Any help would be appreciated, Thanks, -Joe -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]