Just for others to know, it was the memory problem. I re-set the memory parameters for ndbmtd (two nodes) to minimum. Then I could run the backup successfully.
Thanks BA On Wed, Oct 24, 2012 at 11:57 AM, Bheemsen Aitha <pgb...@motorola.com>wrote: > Hi, > > After following the steps at the following website, I tried to do an > online backup of the cluster. > > > http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-backup-using-management-client.html > > It is a plain vanilla command which is below. The cluster is almost an > empty database, but backup is crashing at least one data node and was never > successful. > > ndb_mgm -e "START BACKUP WAIT COMPLETED" > > Did anyone have this kind of error before? I tried searching on web but > could not find a solution. > > Here is the error I received. > > Connected to Management Server at: localhost:1186 > Waiting for completed, this may take several minutes > Backup failed > * 3001: Could not start backup > * Backup aborted due to node failure: Permanent error: Internal > error > > ALERT -- Node 2: Backup 2 started from 49 has been aborted. Error: 1326 > > > Here is little background about our setup. > > OS: Redhat Linux 5.8 > Cluster: MySQL 5.5, NDB 7.2.7 > Cluster was installed and set up on two hosts using MCM, one host hosting > mysqld, ndb_mgmd and the other hosting > ndbmtd1 and ndbmtd2. > > I even tried by setting up the following parameters, but got the same > error again. > > > set BackupDataBufferSize:ndbmtd=256M attcluster; > set BackupLogBufferSize:ndbmtd=256M attcluster; > set BackupMemory:ndbmtd=512M attcluster; > set BackupWriteSize:ndbmtd=32M attcluster; > set BackupMaxWriteSize:ndbmtd=128M attcluster; > > Here are some links I found on web similar to my error. > > > http://forums.mysql.com/read.php?25,230891,230959#msg-230959 > http://grokbase.com/t/mysql/cluster/0578z8cj71/backup-error > http://bugs.mysql.com/bug.php?id=66104 > > > mcm> show status -r attcluster; > +--------+----------+-----------------+---------+-----------+---------+ > | NodeId | Process | Host | Status | Nodegroup | Package | > +--------+----------+-----------------+---------+-----------+---------+ > | 49 | ndb_mgmd | ut06sandboxdb01 | running | | 7.2.7 | > | 50 | mysqld | ut06sandboxdb01 | running | | 7.2.7 | > | 1 | ndbmtd | ut06sandboxdb02 | failed | 0 | 7.2.7 | > | 2 | ndbmtd | ut06sandboxdb02 | running | 0 | 7.2.7 | > +--------+----------+-----------------+---------+-----------+---------+ > 4 rows in set (0.07 sec) > > mcm> > > I see the core dump in DataDir of node 1. > > [root@ut06sandboxdb02 data]# ls -ltr > /opt/app/mcm-1.1.6_64-linux-rhel5-x86/mcm_data/clusters/attcluster/1/data > total 16949760 > -rw-rw-r-- 1 mysql mysql 0 Oct 19 12:23 ndb_1_out.err > -rw-rw-r-- 1 mysql mysql 1 Oct 21 04:02 ndb_1_trace.log.next > -rw-rw-r-- 1 mysql mysql 568 Oct 21 04:02 ndb_1_error.log > -rw-rw-r-- 1 mysql mysql 12202 Oct 21 04:02 ndb_1_trace.log.1_t4 > -rw-rw-r-- 1 mysql mysql 923467 Oct 21 04:02 ndb_1_trace.log.1_t3 > -rw-rw-r-- 1 mysql mysql 923489 Oct 21 04:02 ndb_1_trace.log.1_t2 > -rw-rw-r-- 1 mysql mysql 934663 Oct 21 04:02 ndb_1_trace.log.1_t1 > -rw-rw-r-- 1 mysql mysql 948989 Oct 21 04:02 ndb_1_trace.log.1 > -rw------- 1 mysql mysql 4104044544 Oct 23 11:04 core.21529 > -rw------- 1 mysql mysql 5880332288 Oct 23 18:22 core.8108 > -rw------- 1 mysql mysql 4538155008 Oct 23 23:56 core.1124 > -rw------- 1 mysql mysql 2924789760 Oct 24 00:32 core.9176 > -rw-rw-r-- 1 mysql mysql 460826 Oct 24 00:33 ndb_1_out.log > > Here is more info from mcmd.log. I really appreciate any help on this. > > > 2012-10-23 18:09:58.193: (message) [T0x19add970 RECFG]: > [1,ndb_mgmd,0]: 2012-10-23 18:09:58 [MgmtSrvr] WARNING -- Node 2: Failed > to memlock pages, error: 12 (Cannot allocate memory) > [1,ndb_mgmd,0]: 2012-10-23 18:09:58 [MgmtSrvr] INFO -- Node 2: Waiting > 30 sec for nodes 1 to connect, nodes [ all: 1 and 2 connected: 2 no-wait: ] > > 2012-10-23 18:09:58.193: (message) [T0x19add970 RECFG]: > [1,ndb_mgmd,0]: 2012-10-23 18:09:58 [MgmtSrvr] INFO -- Node 2: Waiting > 30 sec for nodes 1 to connect, nodes [ all: 1 and 2 connected: 2 no-wait: ] > > 2012-10-23 18:09:58.286: (message) [T0x19add970 RECFG]: > [1,ndb_mgmd,0]: 2012-10-23 18:09:58 [MgmtSrvr] INFO -- Node 2: > Communication to Node 1 opened > [1,ndb_mgmd,0]: 2012-10-23 18:09:58 [MgmtSrvr] INFO -- Node 2: Node 1 > Connected > > 2012-10-23 18:09:58.347: (message) last message repeated 1 times > 2012-10-23 18:09:58.347: (message) [T0x19add970 RECFG]: > [1,ndb_mgmd,0]: 2012-10-23 18:09:58 [MgmtSrvr] INFO -- Node 1: Node 2 > Connected > [1,ndb_mgmd,0]: 2012-10-23 18:09:58 [MgmtSrvr] INFO -- Node 1: Node 2: > API mysql-5.5.25 ndb-7.2.7 > [1,ndb_mgmd,0]: 2012-10-23 18:09:58 [MgmtSrvr] INFO -- Node 2: > CM_REGCONF president = 1, own Node = 2, our dynamic id = 0/13 > [1,ndb_mgmd,0]: 2012-10-23 18:09:58 [MgmtSrvr] INFO -- Node 2: Node 1: > API mysql-5.5.25 ndb-7.2.7 > [1,ndb_mgmd,0]: 2012-10-23 18:09:58 [MgmtSrvr] INFO -- Node 2: Start > phase 1 completed > > 2012-10-23 18:09:58.437: (message) last message repeated 1 times > 2012-10-23 18:09:58.437: (message) [T0x19add970 RECFG]: > [1,ndb_mgmd,0]: 2012-10-23 18:09:58 [MgmtSrvr] INFO -- Node 2: Start > phase 2 completed (node restart) > > 2012-10-23 18:09:58.467: (message) last message repeated 1 times > 2012-10-23 18:09:58.467: (message) [T0x19add970 RECFG]: > [1,ndb_mgmd,0]: 2012-10-23 18:09:58 [MgmtSrvr] INFO -- Node 1: DICT: > locked by node 2 for NodeRestart > [1,ndb_mgmd,0]: 2012-10-23 18:09:58 [MgmtSrvr] INFO -- Node 2: Start > phase 3 completed (node restart) > [1,ndb_mgmd,0]: 2012-10-23 18:09:58 [MgmtSrvr] INFO -- Node 2: Start > phase 4 completed (node restart) > > 2012-10-23 18:09:59.029: (message) last message repeated 1 times > 2012-10-23 18:09:59.029: (message) [T0x19add970 RECFG]: > [1,ndb_mgmd,0]: 2012-10-23 18:09:59 [MgmtSrvr] INFO -- Node 2: Receive > arbitrator node 49 [ticket=1fac001848b1b289] > > 2012-10-23 18:09:59.711: (message) last message repeated 1 times > 2012-10-23 18:09:59.711: (message) [T0x19add970 RECFG]: > [1,ndb_mgmd,0]: 2012-10-23 18:09:59 [MgmtSrvr] INFO -- Node 2: > Starting to restore schema > > 2012-10-23 18:10:00.185: (message) last message repeated 1 times > 2012-10-23 18:10:00.185: (message) [T0x19add970 RECFG]: > [1,ndb_mgmd,0]: 2012-10-23 18:10:00 [MgmtSrvr] INFO -- Node 2: Restore > of schema complete > > 2012-10-23 18:10:00.216: (message) last message repeated 1 times > 2012-10-23 18:10:00.216: (message) [T0x19add970 RECFG]: > [1,ndb_mgmd,0]: 2012-10-23 18:10:00 [MgmtSrvr] INFO -- Node 1: Node > restart completed copy of dictionary information > [1,ndb_mgmd,0]: 2012-10-23 18:10:00 [MgmtSrvr] INFO -- Node 2: DICT: > activate index 6 done (sys/def/5/ndb_index_stat_sample_x1) > > 2012-10-23 18:10:00.227: (message) last message repeated 1 times > 2012-10-23 18:10:00.227: (message) [T0x19add970 RECFG]: > [1,ndb_mgmd,0]: 2012-10-23 18:10:00 [MgmtSrvr] INFO -- Node 2: GCP > Monitor: unlimited lags allowed > [1,ndb_mgmd,0]: 2012-10-23 18:10:00 [MgmtSrvr] INFO -- Node 2: Node > restart starting to copy the fragments to Node 2 > [1,ndb_mgmd,0]: 2012-10-23 18:10:00 [MgmtSrvr] INFO -- Node 2: Node: 2 > StartLog: [GCI Keep: 124385 LastCompleted: 124437 NewestRestorable: 124454] > > 2012-10-23 18:10:02.187: (message) last message repeated 1 times > 2012-10-23 18:10:02.187: (message) [T0x19add970 RECFG]: > [1,ndb_mgmd,0]: 2012-10-23 18:10:02 [MgmtSrvr] INFO -- Node 2: LQH: > Starting to rebuild ordered indexes > [1,ndb_mgmd,0]: 2012-10-23 18:10:02 [MgmtSrvr] INFO -- Node 2: LQH: > index 6 rebuild done > [1,ndb_mgmd,0]: 2012-10-23 18:10:02 [MgmtSrvr] INFO -- Node 2: LQH: > Rebuild ordered indexes complete > > 2012-10-23 18:10:02.296: (message) last message repeated 3 times > 2012-10-23 18:10:02.296: (message) [T0x19add970 RECFG]: > [1,ndb_mgmd,0]: 2012-10-23 18:10:02 [MgmtSrvr] INFO -- Node 2: Node > restart completed copying the fragments to Node 2 > > 2012-10-23 18:10:03.431: (message) last message repeated 1 times > 2012-10-23 18:10:03.431: (message) [T0x19add970 RECFG]: > [1,ndb_mgmd,0]: 2012-10-23 18:10:03 [MgmtSrvr] INFO -- Node 1: Local > checkpoint 87 started. Keep GCI = 124428 oldest restorable GCI = 124426 > > > > > Thanks >