[jira] [Commented] (HBASE-9469) Synchronous replication

2014-02-27 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915412#comment-13915412
 ] 

terry zhang commented on HBASE-9469:


Hi Feng HongHua, what's about mysql SemiSyncReplication solution? 
(https://code.google.com/p/google-mysql-tools/wiki/SemiSyncReplicationDesign) , 
we need to make sure client write both side is a transaction if you want to 
data is consistent.

 Synchronous replication
 ---

 Key: HBASE-9469
 URL: https://issues.apache.org/jira/browse/HBASE-9469
 Project: HBase
  Issue Type: New Feature
Reporter: Feng Honghua

 Scenario: 
 A/B clusters with master-master replication, client writes to A cluster and A 
 pushes all writes to B cluster, and when A cluster is down, client switches 
 writing to B cluster.
 But the client's write switch is unsafe due to the replication between A/B is 
 asynchronous: a delete to B cluster which aims to delete a put written 
 earlier can fail due to that put is written to A cluster and isn't 
 successfully pushed to B before A is down. It can be worse if this delete is 
 collected(flush and then major compact occurs) before A cluster is up and 
 that put is eventually pushed to B, the put won't ever be deleted.
 Can we provide per-table/per-peer synchronous replication which ships the 
 according hlog entry of write before responsing write success to client? By 
 this we can guarantee the client that all write requests for which he got 
 success response when he wrote to A cluster must already have been in B 
 cluster as well.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HBASE-9214) CatalogJanitor delete region info in Meta during Restore snapshot

2013-08-14 Thread terry zhang (JIRA)
terry zhang created HBASE-9214:
--

 Summary: CatalogJanitor delete region info in Meta during Restore 
snapshot
 Key: HBASE-9214
 URL: https://issues.apache.org/jira/browse/HBASE-9214
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.10
Reporter: terry zhang




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-9214) CatalogJanitor delete region info in Meta during Restore snapshot

2013-08-14 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-9214:
---

Description: 
Today I meet a issue during restore snapshot. It can be reproduce in step below:

1. Table t1 create a Snapshot s1 successfully
2. region r1 in t1 split 
3. CatalogJanitor Chore begin to work and found daughter  do not have reference 
, so r1 can be deleted
4. restore snapshot s1 . RestoreSnapshotHelper add region r1 to meta table
5.  CatalogJanitor delete r1 region info in meta which RestoreSnapshotHelper 
just inserted .
6. restore snapshot finished.

Then we can found there is a hole in t1 after restore snapshot. Data loss.



 CatalogJanitor delete region info in Meta during Restore snapshot
 -

 Key: HBASE-9214
 URL: https://issues.apache.org/jira/browse/HBASE-9214
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.10
Reporter: terry zhang

 Today I meet a issue during restore snapshot. It can be reproduce in step 
 below:
 1. Table t1 create a Snapshot s1 successfully
 2. region r1 in t1 split 
 3. CatalogJanitor Chore begin to work and found daughter  do not have 
 reference , so r1 can be deleted
 4. restore snapshot s1 . RestoreSnapshotHelper add region r1 to meta table
 5.  CatalogJanitor delete r1 region info in meta which RestoreSnapshotHelper 
 just inserted .
 6. restore snapshot finished.
 Then we can found there is a hole in t1 after restore snapshot. Data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9214) CatalogJanitor delete region info in Meta during Restore snapshot

2013-08-14 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739503#comment-13739503
 ] 

terry zhang commented on HBASE-9214:


Below is some log during testing :

1. CatalogJanitor decide to delete region

2013-08-13 16:19:34,268 WARN org.apache.hadoop.hbase.master.CatalogJanitor: 
Daughter regiondir does not exist: hdfs://dw77.kgb.sqa.c
m4:9900/hbase-test3-snap/writetest/b95023663816ecf208f7ee3d69d8fb9c
2013-08-13 16:19:34,268 WARN org.apache.hadoop.hbase.master.CatalogJanitor: 
Daughter regiondir does not exist: hdfs://dw77.kgb.sqa.c
m4:9900/hbase-test3-snap/writetest/1c47cf89406349c159f36ae2e0c55582
2013-08-13 16:19:34,268 DEBUG org.apache.hadoop.hbase.master.CatalogJanitor: 
Deleting region writetest,P
3,1376370447956.bce523f334ca2c78f2c4ad360dc2a6a4. because 
daughter splits no longer hold references

2 . RestoreSnapshotHelper  insert region info to META

2013-08-13 16:19:34,273 DEBUG org.apache.hadoop.hbase.catalog.MetaEditor: Add 
to META, regions: [{NAME = 'writetest,HZZ
ZZX,1376370447956.05a9a252b41e8e41aa3a6a8a797380ab.',
 STARTKEY = 'H
X', ENDKEY = 
'LI', ENCODED = 
05a9a252b41e8e41aa3a6a8a797380ab,}, {
NAME = 
'writetest,3L,1376370447956.5689d3dbc9dc97106c28945ab7bb5104.',
 STARTKEY =
'3L', ENDKEY = 
'76', ENCODED = 568
9d3dbc9dc97106c28945ab7bb5104,}, {NAME = 
'writetest,76,1376370447956.5b5a7331c26ead
c41ea2944939fd2cee.', STARTKEY = 
'76', ENDKEY = 
'A
R', ENCODED = 5b5a7331c26eadc41ea2944939fd2cee,}, {NAME = 
'writetest,A
R,1376370447956.60226c1fbedc9d98f1192e37d7d3b6af.', STARTKEY = 
'AR', ENDKEY =
'EC', ENCODED = 
60226c1fbedc9d98f1192e37d7d3b6af,}, {NAME = 'writetest,LLL
LLI,1376370447956.70523ec9e600da52b18caa45cf76cf66.',
 STARTKEY = 'L
I', ENDKEY = 
'P3', ENCODED = 
70523ec9e600da52b18caa45cf76cf66,
}, {NAME = 
'writetest,SO,1376370447956.74f9eef1cbb3e4f6627ee39a0996dec8.',
 STARTKEY
 = 'SO', ENDKEY = 
'W9', ENCODED =
 74f9eef1cbb3e4f6627ee39a0996dec8,}, {NAME = 
'writetest,W9,1376370447956.82cf586c79
382a89508c3325304904f3.', STARTKEY = 
'W9', ENDKEY = '', ENCODED = 
82cf586c79382a8
9508c3325304904f3,}, {NAME = 
'writetest,P3,1
376370447956.bce523f334ca2c78f2c4ad360d
c2a6a4.', STARTKEY = 'P3', 
ENDKEY = 'S
O', ENCODED = bce523f334ca2c78f2c4ad360dc2a6a4,}, {NAME = 
'writetest,,1376370447944.d3eb975033a94730c88bf6696a413e9e.', STARTK
EY = '', ENDKEY = '3L', 
ENCODED = d3eb975033a94730c88bf6696a413e9e,}, {NAME = 'w
ritetest,EC,1376370447956.f03a2b6805852c00e4fe029f3a9e7261.',
 STARTKEY = 'E
C', ENDKEY = 
'HX', ENCODED = f03a2b6805852
c00e4fe029f3a9e7261,}]

3. region info is delete by CatalogJanitor 

2013-08-13 16:19:34,333 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
Deleted region writetest,P7
7773,1376370447956.bce523f334ca2c78f2c4ad360dc2a6a4. from META

 CatalogJanitor delete region info in Meta during Restore snapshot
 -

 Key: HBASE-9214
 URL: https://issues.apache.org/jira/browse/HBASE-9214
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.10
Reporter: terry zhang

 Today I meet a issue during restore snapshot. It can be reproduce in step 
 below:
 1. Table t1 create a Snapshot s1 successfully
 2. region r1 in t1 split 
 3. CatalogJanitor Chore begin to work and found daughter  do not have 
 reference , so r1 can be deleted
 4. restore snapshot s1 . RestoreSnapshotHelper add region r1 to meta table
 5.  CatalogJanitor 

[jira] [Commented] (HBASE-9214) CatalogJanitor delete region info in Meta during Restore snapshot

2013-08-14 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739505#comment-13739505
 ] 

terry zhang commented on HBASE-9214:


I wonder if we could change CatalogJanitor not to remove split region info when 
table is disabled to avoid this issue?

 CatalogJanitor delete region info in Meta during Restore snapshot
 -

 Key: HBASE-9214
 URL: https://issues.apache.org/jira/browse/HBASE-9214
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.10
Reporter: terry zhang

 Today I meet a issue during restore snapshot. It can be reproduce in step 
 below:
 1. Table t1 create a Snapshot s1 successfully
 2. region r1 in t1 split 
 3. CatalogJanitor Chore begin to work and found daughter  do not have 
 reference , so r1 can be deleted
 4. restore snapshot s1 . RestoreSnapshotHelper add region r1 to meta table
 5.  CatalogJanitor delete r1 region info in meta which RestoreSnapshotHelper 
 just inserted .
 6. restore snapshot finished.
 Then we can found there is a hole in t1 after restore snapshot. Data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-9201) Hfile will be deleted after deleteColumn instead be archived

2013-08-12 Thread terry zhang (JIRA)
terry zhang created HBASE-9201:
--

 Summary: Hfile will be deleted after deleteColumn instead be 
archived
 Key: HBASE-9201
 URL: https://issues.apache.org/jira/browse/HBASE-9201
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.10
Reporter: terry zhang




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9201) Hfile will be deleted after deleteColumn instead be archived

2013-08-12 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737756#comment-13737756
 ] 

terry zhang commented on HBASE-9201:


now we can see hfile will be deleted in Master file system instead be archived. 
{code:title=MasterFileSystem.java|borderStyle=solid}
  public void deleteFamilyFromFS(HRegionInfo region, byte[] familyName)
  throws IOException {
// archive family store files
Path tableDir = FSUtils.getTableDir(rootdir, region.getTableName());
HFileArchiver.archiveFamily(fs, conf, region, tableDir, familyName);

// delete the family folder
Path familyDir = new Path(tableDir,
  new Path(region.getEncodedName(), Bytes.toString(familyName)));
if (fs.delete(familyDir, true) == false) {
  throw new IOException(Could not delete family 
  + Bytes.toString(familyName) +  from FileSystem for region 
  + region.getRegionNameAsString() + ( + region.getEncodedName()
  + ));
}
  }
{code}
Should we use interface archiveStoreFiles instead fs.delete? 

 Hfile will be deleted after deleteColumn instead be archived
 

 Key: HBASE-9201
 URL: https://issues.apache.org/jira/browse/HBASE-9201
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.10
Reporter: terry zhang



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9201) Hfile will be deleted after deleteColumn instead be archived

2013-08-12 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737759#comment-13737759
 ] 

terry zhang commented on HBASE-9201:


Now we found this issue is serious when we deleteColumn after create a 
snapshot.  If we restore the snapshot hlink files can not be found and table 
can not be enabled.

 Hfile will be deleted after deleteColumn instead be archived
 

 Key: HBASE-9201
 URL: https://issues.apache.org/jira/browse/HBASE-9201
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.10
Reporter: terry zhang



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-9201) Hfile will be deleted after deleteColumn instead be archived

2013-08-12 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737768#comment-13737768
 ] 

terry zhang commented on HBASE-9201:


deleteTable still has same problem  
public void deleteTable(byte[] tableName) throws IOException {
fs.delete(new Path(rootdir, Bytes.toString(tableName)), true);
  }

 Hfile will be deleted after deleteColumn instead be archived
 

 Key: HBASE-9201
 URL: https://issues.apache.org/jira/browse/HBASE-9201
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.10
Reporter: terry zhang



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HBASE-9201) Hfile will be deleted after deleteColumn instead be archived

2013-08-12 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang resolved HBASE-9201.


Resolution: Not A Problem

 Hfile will be deleted after deleteColumn instead be archived
 

 Key: HBASE-9201
 URL: https://issues.apache.org/jira/browse/HBASE-9201
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.10
Reporter: terry zhang



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7735) Prevent regions from moving during online snapshot.

2013-03-19 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13606149#comment-13606149
 ] 

terry zhang commented on HBASE-7735:


hi @Jonathan Hsieh  I mean region is moving between region server before 
regionserver getonlineregion . This means the moving region doesn't belong to 
anyone of regionserver. Would this case lead data lose in snapshot and user 
don't know about it?

 Prevent regions from moving during online snapshot.
 ---

 Key: HBASE-7735
 URL: https://issues.apache.org/jira/browse/HBASE-7735
 Project: HBase
  Issue Type: Sub-task
Reporter: Jonathan Hsieh

 To increase the probability of snapshots succeeding, we should attempt to 
 prevent splits and region moves from happening.  Currently we take region 
 locks but this could be too late and results in an aborted snapshot.  
 We should probably take the table lock (0.96) when starting a snapshot and 
 for  a 0.94 backport we should probably disable the balancer.
 This will probably not be tackled until after trunk merge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7735) Prevent regions from moving during online snapshot.

2013-03-19 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607173#comment-13607173
 ] 

terry zhang commented on HBASE-7735:


Hi Jonathan Hsieh,  Can we use below methods to increase the probability of 
snapshots succeeding?

1. get all the region list in master. Including moving region , online region 
and spliting region(parent and child)
2. check the region list to make sure there is no hole in it.
3. generate a snapshot task assign map file in the target snapshot folder. 
(also assign the moving regions and spliting parent regions to some region 
server )
4. when the regionserver start buildSubprocedure. it will compare online region 
and the assgin region list in task file which master generated . if it is a 
online region it will be a FlushSnapshotSubprocedure. if it is not online we 
can think it is a close region. we do not need to flush cache , only need to 
create reference file(empty file). If the region already in the snapshot 
folder. we can just skip .

I think this would be helpful cause in large cluster moving region and spliting 
region is a normal stituation. So snapshot maybe always fail when do the 
verification. what do you think ?

 Prevent regions from moving during online snapshot.
 ---

 Key: HBASE-7735
 URL: https://issues.apache.org/jira/browse/HBASE-7735
 Project: HBase
  Issue Type: Sub-task
Reporter: Jonathan Hsieh

 To increase the probability of snapshots succeeding, we should attempt to 
 prevent splits and region moves from happening.  Currently we take region 
 locks but this could be too late and results in an aborted snapshot.  
 We should probably take the table lock (0.96) when starting a snapshot and 
 for  a 0.94 backport we should probably disable the balancer.
 This will probably not be tackled until after trunk merge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7735) Prevent regions from moving during online snapshot.

2013-03-18 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13606003#comment-13606003
 ] 

terry zhang commented on HBASE-7735:


hi Jonathan, I wonder now how we can make sure all regionserver snapshot 
manager can get the correct number of region when doing snapshot. Cause if some 
of the region is moving or spliting , getOnlineRegions will miss these region 
and snapshot will loss data and there will be hole in meta after we restore 
snapshot.

 Prevent regions from moving during online snapshot.
 ---

 Key: HBASE-7735
 URL: https://issues.apache.org/jira/browse/HBASE-7735
 Project: HBase
  Issue Type: Sub-task
Reporter: Jonathan Hsieh

 To increase the probability of snapshots succeeding, we should attempt to 
 prevent splits and region moves from happening.  Currently we take region 
 locks but this could be too late and results in an aborted snapshot.  
 We should probably take the table lock (0.96) when starting a snapshot and 
 for  a 0.94 backport we should probably disable the balancer.
 This will probably not be tackled until after trunk merge.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-7886) [replication] hlog zk node will not delete if client roll hlog

2013-02-19 Thread terry zhang (JIRA)
terry zhang created HBASE-7886:
--

 Summary: [replication] hlog zk node will not delete if client roll 
hlog
 Key: HBASE-7886
 URL: https://issues.apache.org/jira/browse/HBASE-7886
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.94.4
Reporter: terry zhang
Assignee: terry zhang


if we use the hbase shell command hlog_roll on a regionserver which is 
configured replication. the Hlog zk node under /hbase/replication/rs/1 can not 
be deleted.

this issue is caused by HBASE-6758. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7886) [replication] hlog zk node will not delete if client roll hlog

2013-02-19 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13581990#comment-13581990
 ] 

terry zhang commented on HBASE-7886:


hlog zk node is deleted in  shipEdits() or 
{code:title=ReplicationSource.java|borderStyle=solid}
 if (this.isActive()  (gotIOE || currentNbEntries == 0)) {
if (this.lastLoggedPosition != this.position) {
  this.manager.logPositionAndCleanOldLogs(this.currentPath,
  this.peerClusterZnode, this.position, queueRecovered, 
currentWALisBeingWrittenTo);
  this.lastLoggedPosition = this.position;
}
if (sleepForRetries(Nothing to replicate, sleepMultiplier)) {
  sleepMultiplier++;
}
continue;
  }
{code} 
but after patch HBASE-6758. logPositionAndCleanOldLogs can not delete hlog zk 
node when currentWALisBeingWrittenTo is true. When log switched and we can see 
 
// If we didn't get anything and the queue has an object, it means we
// hit the end of the file for sure
return seenEntries == 0  processEndOfFile(); // seenEntries is 0 when run 
'hlog_roll' in shell

So replicationsource will continue and hlog zk node can not deleted.

{code:title=ReplicationSource.java|borderStyle=solid}

if(readAllEntriesToReplicateOrNextFile(currentWALisBeingWrittenTo)) {   
 
  continue;
}
{code} 


 [replication] hlog zk node will not delete if client roll hlog
 --

 Key: HBASE-7886
 URL: https://issues.apache.org/jira/browse/HBASE-7886
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.94.4
Reporter: terry zhang
Assignee: terry zhang

 if we use the hbase shell command hlog_roll on a regionserver which is 
 configured replication. the Hlog zk node under /hbase/replication/rs/1 can 
 not be deleted.
 this issue is caused by HBASE-6758. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7886) [replication] hlog zk node will not delete if client roll hlog

2013-02-19 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13581993#comment-13581993
 ] 

terry zhang commented on HBASE-7886:


this issue will be also reproduced when no data write to cluster which is same 
as run 'hlog_roll' in shell.


 [replication] hlog zk node will not delete if client roll hlog
 --

 Key: HBASE-7886
 URL: https://issues.apache.org/jira/browse/HBASE-7886
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.94.4
Reporter: terry zhang
Assignee: terry zhang

 if we use the hbase shell command hlog_roll on a regionserver which is 
 configured replication. the Hlog zk node under /hbase/replication/rs/1 can 
 not be deleted.
 this issue is caused by HBASE-6758. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7886) [replication] hlog zk node will not delete if client roll hlog

2013-02-19 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-7886:
---

Status: Patch Available  (was: Open)

 [replication] hlog zk node will not delete if client roll hlog
 --

 Key: HBASE-7886
 URL: https://issues.apache.org/jira/browse/HBASE-7886
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.94.4
Reporter: terry zhang
Assignee: terry zhang

 if we use the hbase shell command hlog_roll on a regionserver which is 
 configured replication. the Hlog zk node under /hbase/replication/rs/1 can 
 not be deleted.
 this issue is caused by HBASE-6758. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7886) [replication] hlog zk node will not delete if client roll hlog

2013-02-19 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-7886:
---

Status: Open  (was: Patch Available)

 [replication] hlog zk node will not delete if client roll hlog
 --

 Key: HBASE-7886
 URL: https://issues.apache.org/jira/browse/HBASE-7886
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.94.4
Reporter: terry zhang
Assignee: terry zhang

 if we use the hbase shell command hlog_roll on a regionserver which is 
 configured replication. the Hlog zk node under /hbase/replication/rs/1 can 
 not be deleted.
 this issue is caused by HBASE-6758. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7886) [replication] hlog zk node will not delete if client roll hlog

2013-02-19 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-7886:
---

Attachment: HBASE-7886.patch

 [replication] hlog zk node will not delete if client roll hlog
 --

 Key: HBASE-7886
 URL: https://issues.apache.org/jira/browse/HBASE-7886
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.94.4
Reporter: terry zhang
Assignee: terry zhang
 Attachments: HBASE-7886.patch


 if we use the hbase shell command hlog_roll on a regionserver which is 
 configured replication. the Hlog zk node under /hbase/replication/rs/1 can 
 not be deleted.
 this issue is caused by HBASE-6758. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-7886) [replication] hlog zk node will not be deleted if client roll hlog

2013-02-19 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-7886:
---

Summary: [replication] hlog zk node will not be deleted if client roll hlog 
 (was: [replication] hlog zk node will not delete if client roll hlog)

 [replication] hlog zk node will not be deleted if client roll hlog
 --

 Key: HBASE-7886
 URL: https://issues.apache.org/jira/browse/HBASE-7886
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.94.4
Reporter: terry zhang
Assignee: terry zhang
 Attachments: HBASE-7886.patch


 if we use the hbase shell command hlog_roll on a regionserver which is 
 configured replication. the Hlog zk node under /hbase/replication/rs/1 can 
 not be deleted.
 this issue is caused by HBASE-6758. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6770) Allow scanner setCaching to specify size instead of number of rows

2013-01-25 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562557#comment-13562557
 ] 

terry zhang commented on HBASE-6770:


hi Karthik Ranganathan , I saw this patch had checked in to fb branch 0.89-fb 
last October. when are we going to check it to trunk. This is a good feature to 
avoid rs OOM.

 Allow scanner setCaching to specify size instead of number of rows
 --

 Key: HBASE-6770
 URL: https://issues.apache.org/jira/browse/HBASE-6770
 Project: HBase
  Issue Type: Sub-task
  Components: Client, regionserver
Reporter: Karthik Ranganathan
Assignee: Chen Jin

 Currently, we have the following api's to customize the behavior of scans:
 setCaching() - how many rows to cache on client to speed up scans
 setBatch() - max columns per row to return per row to prevent a very large 
 response.
 Ideally, we should be able to specify a memory buffer size because:
 1. that would take care of both of these use cases.
 2. it does not need any knowledge of the size of the rows or cells, as the 
 final thing we are worried about is the available memory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-7451) [snapshot] regionserver will be deadlock when GlobalSnapshotOperation timeout happen

2012-12-28 Thread terry zhang (JIRA)
terry zhang created HBASE-7451:
--

 Summary: [snapshot] regionserver will be deadlock when 
GlobalSnapshotOperation timeout happen
 Key: HBASE-7451
 URL: https://issues.apache.org/jira/browse/HBASE-7451
 Project: HBase
  Issue Type: Bug
  Components: snapshots
Reporter: terry zhang
Assignee: terry zhang


Hi Matteo Bertozzi and Jesse Yates, My observation is base on code in github : 
https://github.com/matteobertozzi/hbase/
If we create a snapshot and meet regionserver timeout. Rs will be lock and can 
not put any data. Please take a look at log below :

// regionserver snapshot timeout
org.apache.hadoop.hbase.server.commit.distributed.DistributedCommitException: 
org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException:
 Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms
at 
org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.wrap(DistributedThreePhaseCommitErrorDispatcher.java:135)
at 
org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.operationTimeout(DistributedThreePhaseCommitErrorDispatcher.java:71)
at 
org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:92)
at 
org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:89)
at 
org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer$1.run(OperationAttemptTimer.java:71)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
Caused by: 
org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException:
 Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms
... 3 more
2012-12-26 18:44:57,211 DEBUG 
org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase.
2012-12-26 18:44:57,211 DEBUG 
org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: 
Cleanup snapshot - handled in sub-tasks on error
2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.serv

//Waiting for 'commit allowed' latch and do not exist

2012-12-26 18:44:57,211 DEBUG 
org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase.
2012-12-26 18:44:57,211 DEBUG 
org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: 
Cleanup snapshot - handled in sub-tasks on error
2012-12-26 18:44:57,212 DEBUG 
org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running finish phase.
2012-12-26 18:44:57,212 DEBUG 
org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: 
Finish snapshot - handling in subtasks on error
2012-12-26 18:44:57,212 WARN 
org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer: Timer 
already marked completed, ignoring!
2012-12-26 18:45:01,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 
'commit allowed' latch. (sleep:5000 ms)
2012-12-26 18:45:06,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 
'commit allowed' latch. (sleep:5000 ms)
2012-12-26 18:45:11,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 
'commit allowed' latch. (sleep:5000 ms)
2012-12-26 18:45:16,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 
'commit allowed' latch. (sleep:5000 ms)
2012-12-26 18:45:17,002 INFO 
org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
 Received children changed event:/hbase-TERRY-73/online-snapshot/prepare
2012-12-26 18:45:17,002 INFO 
org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
 Recieved start event.
2012-12-26 18:45:17,002 DEBUG 
org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
 Looking for new operations under znode:/hbase-TERRY-73/online-snapshot/prepare
2012-12-26 18:45:17,003 INFO 
org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
 Received children changed event:/hbase-TERRY-73/online-snapshot/abort
2012-12-26 18:45:17,003 INFO 
org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
 Recieved abort event.
2012-12-26 18:45:17,003 DEBUG 
org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
 Checking for aborted operations on node:/hbase-TERRY-73/online-snapshot/abort
2012-12-26 18:45:21,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 
'commit allowed' latch. (sleep:5000 ms)
2012-12-26 18:45:26,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 
'commit allowed' latch. (sleep:5000 ms)
2012-12-26 18:45:31,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 
'commit allowed' latch. (sleep:5000 ms)
2012-12-26 18:45:36,992 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting for 
'commit 

[jira] [Commented] (HBASE-7451) [snapshot] regionserver will be deadlock when GlobalSnapshotOperation timeout happen

2012-12-28 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540414#comment-13540414
 ] 

terry zhang commented on HBASE-7451:


This is because GloballyConsistentRegionLockTask is extends from TwoPhaseCommit 
(not ThreePhaseCommit). So it do not have OperationAttemptTimer to handle 
timeout exception. So when GlobalSnapshotOperation meet timeout,below code will 
not be executed. and AllowCommit will not released.

{code:title=GlobalSnapshotOperation.java|borderStyle=solid}

  @Override
  public void commit() throws DistributedCommitException {
// Release all the locks taken on the involved regions
if (ops == null || ops.size() == 0) {
  LOG.debug(No region operations to release from the snapshot because we 
didn't get a chance
  +  to create them.);
  return;
}
LOG.debug(Releasing commit barrier for globally consistent snapshot.);
for (RegionSnapshotOperation op : ops) {
  ((GloballyConsistentRegionLockTask) op).getAllowCommitLatch().countDown();
}

// wait for all the outstanding tasks
waitUntilDone();
  }
{code} 

So GloballyConsistentRegionLockTask will wait for ever.

 

 [snapshot] regionserver will be deadlock when GlobalSnapshotOperation timeout 
 happen
 

 Key: HBASE-7451
 URL: https://issues.apache.org/jira/browse/HBASE-7451
 Project: HBase
  Issue Type: Bug
  Components: snapshots
Reporter: terry zhang
Assignee: terry zhang

 Hi Matteo Bertozzi and Jesse Yates, My observation is base on code in github 
 : https://github.com/matteobertozzi/hbase/
 If we create a snapshot and meet regionserver timeout. Rs will be lock and 
 can not put any data. Please take a look at log below :
 // regionserver snapshot timeout
 org.apache.hadoop.hbase.server.commit.distributed.DistributedCommitException: 
 org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException:
  Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms
 at 
 org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.wrap(DistributedThreePhaseCommitErrorDispatcher.java:135)
 at 
 org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.operationTimeout(DistributedThreePhaseCommitErrorDispatcher.java:71)
 at 
 org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:92)
 at 
 org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:89)
 at 
 org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer$1.run(OperationAttemptTimer.java:71)
 at java.util.TimerThread.mainLoop(Timer.java:512)
 at java.util.TimerThread.run(Timer.java:462)
 Caused by: 
 org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException:
  Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms
 ... 3 more
 2012-12-26 18:44:57,211 DEBUG 
 org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase.
 2012-12-26 18:44:57,211 DEBUG 
 org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: 
 Cleanup snapshot - handled in sub-tasks on error
 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.serv
 //Waiting for 'commit allowed' latch and do not exist
 2012-12-26 18:44:57,211 DEBUG 
 org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase.
 2012-12-26 18:44:57,211 DEBUG 
 org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: 
 Cleanup snapshot - handled in sub-tasks on error
 2012-12-26 18:44:57,212 DEBUG 
 org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running finish phase.
 2012-12-26 18:44:57,212 DEBUG 
 org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: 
 Finish snapshot - handling in subtasks on error
 2012-12-26 18:44:57,212 WARN 
 org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer: Timer 
 already marked completed, ignoring!
 2012-12-26 18:45:01,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
 for 'commit allowed' latch. (sleep:5000 ms)
 2012-12-26 18:45:06,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
 for 'commit allowed' latch. (sleep:5000 ms)
 2012-12-26 18:45:11,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
 for 'commit allowed' latch. (sleep:5000 ms)
 2012-12-26 18:45:16,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
 for 'commit allowed' latch. (sleep:5000 ms)
 2012-12-26 18:45:17,002 INFO 
 org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
  Received children changed event:/hbase-TERRY-73/online-snapshot/prepare
 2012-12-26 18:45:17,002 INFO 
 

[jira] [Updated] (HBASE-7451) [snapshot] regionserver will be deadlock when GlobalSnapshotOperation timeout happened

2012-12-28 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-7451:
---

Summary: [snapshot] regionserver will be deadlock when 
GlobalSnapshotOperation timeout happened  (was: [snapshot] regionserver will be 
deadlock when GlobalSnapshotOperation timeout happen)

 [snapshot] regionserver will be deadlock when GlobalSnapshotOperation timeout 
 happened
 --

 Key: HBASE-7451
 URL: https://issues.apache.org/jira/browse/HBASE-7451
 Project: HBase
  Issue Type: Bug
  Components: snapshots
Reporter: terry zhang
Assignee: terry zhang

 Hi Matteo Bertozzi and Jesse Yates, My observation is base on code in github 
 : https://github.com/matteobertozzi/hbase/
 If we create a snapshot and meet regionserver timeout. Rs will be lock and 
 can not put any data. Please take a look at log below :
 // regionserver snapshot timeout
 org.apache.hadoop.hbase.server.commit.distributed.DistributedCommitException: 
 org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException:
  Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms
 at 
 org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.wrap(DistributedThreePhaseCommitErrorDispatcher.java:135)
 at 
 org.apache.hadoop.hbase.server.commit.distributed.DistributedThreePhaseCommitErrorDispatcher.operationTimeout(DistributedThreePhaseCommitErrorDispatcher.java:71)
 at 
 org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:92)
 at 
 org.apache.hadoop.hbase.server.commit.ThreePhaseCommit$1.receiveError(ThreePhaseCommit.java:89)
 at 
 org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer$1.run(OperationAttemptTimer.java:71)
 at java.util.TimerThread.mainLoop(Timer.java:512)
 at java.util.TimerThread.run(Timer.java:462)
 Caused by: 
 org.apache.hadoop.hbase.server.errorhandling.exception.OperationAttemptTimeoutException:
  Timeout elapsed! Start:1356518666984, End:1356518667584, diff:600, max:600 ms
 ... 3 more
 2012-12-26 18:44:57,211 DEBUG 
 org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase.
 2012-12-26 18:44:57,211 DEBUG 
 org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: 
 Cleanup snapshot - handled in sub-tasks on error
 2012-12-26 18:44:57,212 DEBUG org.apache.hadoop.hbase.serv
 //Waiting for 'commit allowed' latch and do not exist
 2012-12-26 18:44:57,211 DEBUG 
 org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running cleanup phase.
 2012-12-26 18:44:57,211 DEBUG 
 org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: 
 Cleanup snapshot - handled in sub-tasks on error
 2012-12-26 18:44:57,212 DEBUG 
 org.apache.hadoop.hbase.server.commit.TwoPhaseCommit: Running finish phase.
 2012-12-26 18:44:57,212 DEBUG 
 org.apache.hadoop.hbase.regionserver.snapshot.operation.SnapshotOperation: 
 Finish snapshot - handling in subtasks on error
 2012-12-26 18:44:57,212 WARN 
 org.apache.hadoop.hbase.server.errorhandling.OperationAttemptTimer: Timer 
 already marked completed, ignoring!
 2012-12-26 18:45:01,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
 for 'commit allowed' latch. (sleep:5000 ms)
 2012-12-26 18:45:06,990 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
 for 'commit allowed' latch. (sleep:5000 ms)
 2012-12-26 18:45:11,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
 for 'commit allowed' latch. (sleep:5000 ms)
 2012-12-26 18:45:16,991 DEBUG org.apache.hadoop.hbase.util.Threads: Waiting 
 for 'commit allowed' latch. (sleep:5000 ms)
 2012-12-26 18:45:17,002 INFO 
 org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
  Received children changed event:/hbase-TERRY-73/online-snapshot/prepare
 2012-12-26 18:45:17,002 INFO 
 org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
  Recieved start event.
 2012-12-26 18:45:17,002 DEBUG 
 org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
  Looking for new operations under 
 znode:/hbase-TERRY-73/online-snapshot/prepare
 2012-12-26 18:45:17,003 INFO 
 org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
  Received children changed event:/hbase-TERRY-73/online-snapshot/abort
 2012-12-26 18:45:17,003 INFO 
 org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
  Recieved abort event.
 2012-12-26 18:45:17,003 DEBUG 
 org.apache.hadoop.hbase.server.commit.distributed.zookeeper.ZKTwoPhaseCommitCohortMemberController:
  Checking for aborted operations on 

[jira] [Commented] (HBASE-6802) Export Snapshot

2012-11-20 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13501693#comment-13501693
 ] 

terry zhang commented on HBASE-6802:


hi Jesse, where is your snapshot branch? Could you please give us the URL to 
checkout the project? Thanks so much!

 Export Snapshot
 ---

 Key: HBASE-6802
 URL: https://issues.apache.org/jira/browse/HBASE-6802
 Project: HBase
  Issue Type: Sub-task
  Components: snapshots
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
 Fix For: hbase-6055, 0.96.0

 Attachments: HBASE-6802-v1.patch


 Export a snapshot to another cluster.
  - Copy the .snapshot/name folder with all the references
  - Copy the hfiles/hlogs needed by the snapshot
 Once the other cluster has the files and the snapshot information it can 
 restore the snapshot.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-6695) [Replication] Data will lose if RegionServer down during transferqueue

2012-09-12 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6695:
---

Attachment: HBASE-6695-4trunk_v3.patch

trunk version createEphemeralNodeAndWatch didn't throw NodeExistsException when 
node exist. Add a new function createNoneExistEphemeralNodeAndWatch to throw 
NodeExistsException when create an existed node .

 [Replication] Data will lose if RegionServer down during transferqueue
 --

 Key: HBASE-6695
 URL: https://issues.apache.org/jira/browse/HBASE-6695
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
Priority: Critical
 Fix For: 0.96.0, 0.94.3

 Attachments: HBASE-6695-4trunk.patch, HBASE-6695-4trunk_v2.patch, 
 HBASE-6695-4trunk_v3.patch, HBASE-6695.patch


 When we ware testing Replication failover feature we found if we kill a 
 regionserver during it transferqueue ,we found only part of the hlog znode 
 copy to the right path because failover process is interrupted. 
 Log:
 2012-08-29 12:20:05,660 INFO 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
 Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue
 2012-08-29 12:20:05,765 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162
 2012-08-29 12:20:05,850 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data
 2012-08-29 12:20:05,938 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data
 2012-08-29 12:20:06,055 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data
 2012-08-29 12:20:06,277 WARN 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
 Failed all from region=.ME
 TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020
 java.util.concurrent.ExecutionException: java.net.ConnectException: 
 Connection refused
 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
 at java.util.concurrent.FutureTask.get(FutureTask.java:83)
 at 
 ..
 This server is down .
 ZK node status:
 [zk: 10.232.98.77:2181(CONNECTED) 6] ls 
 /hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716
 [lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268]
  
 dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6533) [replication] replication will block if WAL compress set differently in master and slave configuration

2012-09-12 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13453788#comment-13453788
 ] 

terry zhang commented on HBASE-6533:


yes,Daniel and Stack. Now replication can't work in hlog compress mode 。 cause 
compress mode need read hlog sequentially to construct the compressionContext 
dictionary . But when replication didn't read the entry in the hlog one by 
one(using Seek).So it can only get a tag(dictIdx) in the hlog. The original 
data is not exist in compressionContext. Usually we can get below error:

java.lang.IndexOutOfBoundsException: index (2) must be less than size (1)
at 
com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:301)
at 
com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:280)
at 
org.apache.hadoop.hbase.regionserver.wal.LRUDictionary$BidirectionalLRUMap.get(LRUDictionary.java:122)
at 
org.apache.hadoop.hbase.regionserver.wal.LRUDictionary$BidirectionalLRUMap.access$000(LRUDictionary.java:69)
at 
org.apache.hadoop.hbase.regionserver.wal.LRUDictionary.getEntry(LRUDictionary.java:40)
at 
org.apache.hadoop.hbase.regionserver.wal.Compressor.readCompressed(Compressor.java:111)
at org.apache.hadoop.hbase.regionserver.wal.HLogKey.readFields(HLogKey.java:321)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1851)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1891)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:235)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:206)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:435)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:311)

 [replication] replication will block if WAL compress set differently in 
 master and slave configuration
 --

 Key: HBASE-6533
 URL: https://issues.apache.org/jira/browse/HBASE-6533
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
Assignee: terry zhang
Priority: Critical
 Fix For: 0.94.3

 Attachments: hbase-6533.patch


 as we know in hbase 0.94.0 we have a configuration below
   property
 namehbase.regionserver.wal.enablecompression/name
  valuetrue/value
   /property
 if we enable it in master cluster and disable it in slave cluster . Then 
 replication will not work. It will throw unwrapRemoteException again and 
 again in master cluster.
 2012-08-09 12:49:55,892 WARN 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't 
 replicate because of an error
  on the remote cluster: 
 java.io.IOException: IPC server unable to read call parameters: Error in 
 readFields
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at 
 org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
 at 
 org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365)
 Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read 
 call parameters: Error in readFields
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921)
 at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151)
 at $Proxy13.replicateLogEntries(Unknown Source)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616)
 ... 1 more 
 This is because Slave cluster can not parse the hlog entry .
 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to 
 read call parameters for client 10.232.98.89
 java.io.IOException: Error in readFields
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635)

[jira] [Commented] (HBASE-6533) [replication] replication will block if WAL compress set differently in master and slave configuration

2012-09-12 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13453791#comment-13453791
 ] 

terry zhang commented on HBASE-6533:


Can we disable hlog compress mode when we start Replication ?

{code:title=HRegionServer.java|borderStyle=solid}
 if (!conf.getBoolean(HConstants.REPLICATION_ENABLE_KEY, false)) {
   return;
 }
+if (conf.getBoolean(HConstants.ENABLE_WAL_COMPRESSION, false)) {
+  throw new RegionServerRunningException(Region server master cluster 
doesn't support +
+  Hlog working in compression mode!);
+}
 
 // read in the name of the source replication class from the config file.
 String sourceClassname = 
conf.get(HConstants.REPLICATION_SOURCE_SERVICE_CLASSNAME,
{code} 

Or we need change the replication do not use seek when we read hlog in 
replication.do not close hlog again and again when we meet EOF exception. Which 
one is better?

 [replication] replication will block if WAL compress set differently in 
 master and slave configuration
 --

 Key: HBASE-6533
 URL: https://issues.apache.org/jira/browse/HBASE-6533
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
Assignee: terry zhang
Priority: Critical
 Fix For: 0.94.3

 Attachments: hbase-6533.patch


 as we know in hbase 0.94.0 we have a configuration below
   property
 namehbase.regionserver.wal.enablecompression/name
  valuetrue/value
   /property
 if we enable it in master cluster and disable it in slave cluster . Then 
 replication will not work. It will throw unwrapRemoteException again and 
 again in master cluster.
 2012-08-09 12:49:55,892 WARN 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't 
 replicate because of an error
  on the remote cluster: 
 java.io.IOException: IPC server unable to read call parameters: Error in 
 readFields
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at 
 org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
 at 
 org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365)
 Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read 
 call parameters: Error in readFields
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921)
 at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151)
 at $Proxy13.replicateLogEntries(Unknown Source)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616)
 ... 1 more 
 This is because Slave cluster can not parse the hlog entry .
 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to 
 read call parameters for client 10.232.98.89
 java.io.IOException: Error in readFields
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635)
 at 
 org.apache.hadoop.hbase.ipc.Invocation.readFields(Invocation.java:125)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:1292)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1207)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:735)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:524)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:499)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:180)
 

[jira] [Commented] (HBASE-6533) [replication] replication will be block if WAL compress set differently in master and slave configuration

2012-09-05 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13448537#comment-13448537
 ] 

terry zhang commented on HBASE-6533:


this is because of master sending the hlog entry in compress mode. But Slave do 
not know about it. So when slave ipc hbaseserver deserilize the buffer and read 
the hlog entry fields error will happen. We can let the Master send the buffer 
in none compress mode. then whether master use hlog compression or not. Slave 
both can work fine

 [replication] replication will be block if WAL compress set differently in 
 master and slave configuration
 -

 Key: HBASE-6533
 URL: https://issues.apache.org/jira/browse/HBASE-6533
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
Priority: Critical

 as we know in hbase 0.94.0 we have a configuration below
   property
 namehbase.regionserver.wal.enablecompression/name
  valuetrue/value
   /property
 if we enable it in master cluster and disable it in slave cluster . Then 
 replication will not work. It will throw unwrapRemoteException again and 
 again in master cluster.
 2012-08-09 12:49:55,892 WARN 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't 
 replicate because of an error
  on the remote cluster: 
 java.io.IOException: IPC server unable to read call parameters: Error in 
 readFields
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at 
 org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
 at 
 org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365)
 Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read 
 call parameters: Error in readFields
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921)
 at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151)
 at $Proxy13.replicateLogEntries(Unknown Source)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616)
 ... 1 more 
 This is because Slave cluster can not parse the hlog entry .
 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to 
 read call parameters for client 10.232.98.89
 java.io.IOException: Error in readFields
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635)
 at 
 org.apache.hadoop.hbase.ipc.Invocation.readFields(Invocation.java:125)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:1292)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1207)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:735)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:524)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:499)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:180)
 at org.apache.hadoop.hbase.KeyValue.readFields(KeyValue.java:2254)
 at 
 org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFields(WALEdit.java:146)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLog$Entry.readFields(HLog.java:1767)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:682)
 ... 11 more 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more 

[jira] [Updated] (HBASE-6533) [replication] replication will be block if WAL compress set differently in master and slave configuration

2012-09-05 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6533:
---

Priority: Critical  (was: Major)

 [replication] replication will be block if WAL compress set differently in 
 master and slave configuration
 -

 Key: HBASE-6533
 URL: https://issues.apache.org/jira/browse/HBASE-6533
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
Priority: Critical

 as we know in hbase 0.94.0 we have a configuration below
   property
 namehbase.regionserver.wal.enablecompression/name
  valuetrue/value
   /property
 if we enable it in master cluster and disable it in slave cluster . Then 
 replication will not work. It will throw unwrapRemoteException again and 
 again in master cluster.
 2012-08-09 12:49:55,892 WARN 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't 
 replicate because of an error
  on the remote cluster: 
 java.io.IOException: IPC server unable to read call parameters: Error in 
 readFields
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at 
 org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
 at 
 org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365)
 Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read 
 call parameters: Error in readFields
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921)
 at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151)
 at $Proxy13.replicateLogEntries(Unknown Source)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616)
 ... 1 more 
 This is because Slave cluster can not parse the hlog entry .
 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to 
 read call parameters for client 10.232.98.89
 java.io.IOException: Error in readFields
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635)
 at 
 org.apache.hadoop.hbase.ipc.Invocation.readFields(Invocation.java:125)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:1292)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1207)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:735)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:524)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:499)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:180)
 at org.apache.hadoop.hbase.KeyValue.readFields(KeyValue.java:2254)
 at 
 org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFields(WALEdit.java:146)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLog$Entry.readFields(HLog.java:1767)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:682)
 ... 11 more 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-6533) [replication] replication will be block if WAL compress set differently in master and slave configuration

2012-09-05 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6533:
---

Attachment: hbase-6533.patch

 [replication] replication will be block if WAL compress set differently in 
 master and slave configuration
 -

 Key: HBASE-6533
 URL: https://issues.apache.org/jira/browse/HBASE-6533
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
Priority: Critical
 Attachments: hbase-6533.patch


 as we know in hbase 0.94.0 we have a configuration below
   property
 namehbase.regionserver.wal.enablecompression/name
  valuetrue/value
   /property
 if we enable it in master cluster and disable it in slave cluster . Then 
 replication will not work. It will throw unwrapRemoteException again and 
 again in master cluster.
 2012-08-09 12:49:55,892 WARN 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't 
 replicate because of an error
  on the remote cluster: 
 java.io.IOException: IPC server unable to read call parameters: Error in 
 readFields
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at 
 org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
 at 
 org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365)
 Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read 
 call parameters: Error in readFields
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921)
 at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151)
 at $Proxy13.replicateLogEntries(Unknown Source)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616)
 ... 1 more 
 This is because Slave cluster can not parse the hlog entry .
 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to 
 read call parameters for client 10.232.98.89
 java.io.IOException: Error in readFields
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635)
 at 
 org.apache.hadoop.hbase.ipc.Invocation.readFields(Invocation.java:125)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:1292)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1207)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:735)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:524)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:499)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:180)
 at org.apache.hadoop.hbase.KeyValue.readFields(KeyValue.java:2254)
 at 
 org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFields(WALEdit.java:146)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLog$Entry.readFields(HLog.java:1767)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:682)
 ... 11 more 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-6533) [replication] replication will be block if WAL compress set differently in master and slave configuration

2012-09-05 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6533:
---

Fix Version/s: 0.94.3

 [replication] replication will be block if WAL compress set differently in 
 master and slave configuration
 -

 Key: HBASE-6533
 URL: https://issues.apache.org/jira/browse/HBASE-6533
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
Priority: Critical
 Fix For: 0.94.3

 Attachments: hbase-6533.patch


 as we know in hbase 0.94.0 we have a configuration below
   property
 namehbase.regionserver.wal.enablecompression/name
  valuetrue/value
   /property
 if we enable it in master cluster and disable it in slave cluster . Then 
 replication will not work. It will throw unwrapRemoteException again and 
 again in master cluster.
 2012-08-09 12:49:55,892 WARN 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't 
 replicate because of an error
  on the remote cluster: 
 java.io.IOException: IPC server unable to read call parameters: Error in 
 readFields
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at 
 org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
 at 
 org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365)
 Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read 
 call parameters: Error in readFields
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921)
 at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151)
 at $Proxy13.replicateLogEntries(Unknown Source)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616)
 ... 1 more 
 This is because Slave cluster can not parse the hlog entry .
 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to 
 read call parameters for client 10.232.98.89
 java.io.IOException: Error in readFields
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635)
 at 
 org.apache.hadoop.hbase.ipc.Invocation.readFields(Invocation.java:125)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:1292)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1207)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:735)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:524)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:499)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:180)
 at org.apache.hadoop.hbase.KeyValue.readFields(KeyValue.java:2254)
 at 
 org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFields(WALEdit.java:146)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLog$Entry.readFields(HLog.java:1767)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:682)
 ... 11 more 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-6533) [replication] replication will be block if WAL compress set differently in master and slave configuration

2012-09-05 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6533:
---

Assignee: terry zhang

 [replication] replication will be block if WAL compress set differently in 
 master and slave configuration
 -

 Key: HBASE-6533
 URL: https://issues.apache.org/jira/browse/HBASE-6533
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
Assignee: terry zhang
Priority: Critical
 Fix For: 0.94.3

 Attachments: hbase-6533.patch


 as we know in hbase 0.94.0 we have a configuration below
   property
 namehbase.regionserver.wal.enablecompression/name
  valuetrue/value
   /property
 if we enable it in master cluster and disable it in slave cluster . Then 
 replication will not work. It will throw unwrapRemoteException again and 
 again in master cluster.
 2012-08-09 12:49:55,892 WARN 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't 
 replicate because of an error
  on the remote cluster: 
 java.io.IOException: IPC server unable to read call parameters: Error in 
 readFields
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at 
 org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
 at 
 org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365)
 Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read 
 call parameters: Error in readFields
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921)
 at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151)
 at $Proxy13.replicateLogEntries(Unknown Source)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616)
 ... 1 more 
 This is because Slave cluster can not parse the hlog entry .
 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to 
 read call parameters for client 10.232.98.89
 java.io.IOException: Error in readFields
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635)
 at 
 org.apache.hadoop.hbase.ipc.Invocation.readFields(Invocation.java:125)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:1292)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1207)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:735)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:524)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:499)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:180)
 at org.apache.hadoop.hbase.KeyValue.readFields(KeyValue.java:2254)
 at 
 org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFields(WALEdit.java:146)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLog$Entry.readFields(HLog.java:1767)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:682)
 ... 11 more 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-6719) [replication] Data will lose if open a Hlog failed more than maxRetriesMultiplier

2012-09-05 Thread terry zhang (JIRA)
terry zhang created HBASE-6719:
--

 Summary: [replication] Data will lose if open a Hlog failed more 
than maxRetriesMultiplier
 Key: HBASE-6719
 URL: https://issues.apache.org/jira/browse/HBASE-6719
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
Assignee: terry zhang
Priority: Critical
 Fix For: 0.94.2


Please Take a look below code

{code:title=ReplicationSource.java|borderStyle=solid}

protected boolean openReader(int sleepMultiplier) {
{
  ...
  catch (IOException ioe) {

  LOG.warn(peerClusterZnode +  Got: , ioe);
  // TODO Need a better way to determinate if a file is really gone but
  // TODO without scanning all logs dir
  if (sleepMultiplier == this.maxRetriesMultiplier) {
LOG.warn(Waited too long for this file, considering dumping);
return !processEndOfFile(); // Open a file failed over 
maxRetriesMultiplier(default 10)
  }
}
return true;


  ...
}

  protected boolean processEndOfFile() {
if (this.queue.size() != 0) {// Skipped this Hlog . Data loss
  this.currentPath = null;
  this.position = 0;
  return true;
} else if (this.queueRecovered) {   // Terminate Failover Replication 
source thread ,data loss
  this.manager.closeRecoveredQueue(this);
  LOG.info(Finished recovering the queue);
  this.running = false;
  return true;
}
return false;
  }

{code} 


Some Time HDFS will meet some problem but actually Hlog file is OK , So after 
HDFS back  ,Some data will lose and can not find them back in slave cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-6719) [replication] Data will lose if open a Hlog failed more than maxRetriesMultiplier

2012-09-05 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6719:
---

Attachment: hbase-6719.patch

 [replication] Data will lose if open a Hlog failed more than 
 maxRetriesMultiplier
 -

 Key: HBASE-6719
 URL: https://issues.apache.org/jira/browse/HBASE-6719
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
Assignee: terry zhang
Priority: Critical
 Fix For: 0.94.2

 Attachments: hbase-6719.patch


 Please Take a look below code
 {code:title=ReplicationSource.java|borderStyle=solid}
 protected boolean openReader(int sleepMultiplier) {
 {
   ...
   catch (IOException ioe) {
   LOG.warn(peerClusterZnode +  Got: , ioe);
   // TODO Need a better way to determinate if a file is really gone but
   // TODO without scanning all logs dir
   if (sleepMultiplier == this.maxRetriesMultiplier) {
 LOG.warn(Waited too long for this file, considering dumping);
 return !processEndOfFile(); // Open a file failed over 
 maxRetriesMultiplier(default 10)
   }
 }
 return true;
   ...
 }
   protected boolean processEndOfFile() {
 if (this.queue.size() != 0) {// Skipped this Hlog . Data loss
   this.currentPath = null;
   this.position = 0;
   return true;
 } else if (this.queueRecovered) {   // Terminate Failover Replication 
 source thread ,data loss
   this.manager.closeRecoveredQueue(this);
   LOG.info(Finished recovering the queue);
   this.running = false;
   return true;
 }
 return false;
   }
 {code} 
 Some Time HDFS will meet some problem but actually Hlog file is OK , So after 
 HDFS back  ,Some data will lose and can not find them back in slave cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6719) [replication] Data will lose if open a Hlog failed more than maxRetriesMultiplier

2012-09-05 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13448578#comment-13448578
 ] 

terry zhang commented on HBASE-6719:


I think we need to handle the IOException carefully and better not to skip the 
Hlog unless it is really corrupted. We can log this failture as a fatal in Log 
and skip the Hlog (by delete the hlog zk node manually ) if we have to.

 [replication] Data will lose if open a Hlog failed more than 
 maxRetriesMultiplier
 -

 Key: HBASE-6719
 URL: https://issues.apache.org/jira/browse/HBASE-6719
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
Assignee: terry zhang
Priority: Critical
 Fix For: 0.94.2

 Attachments: hbase-6719.patch


 Please Take a look below code
 {code:title=ReplicationSource.java|borderStyle=solid}
 protected boolean openReader(int sleepMultiplier) {
 {
   ...
   catch (IOException ioe) {
   LOG.warn(peerClusterZnode +  Got: , ioe);
   // TODO Need a better way to determinate if a file is really gone but
   // TODO without scanning all logs dir
   if (sleepMultiplier == this.maxRetriesMultiplier) {
 LOG.warn(Waited too long for this file, considering dumping);
 return !processEndOfFile(); // Open a file failed over 
 maxRetriesMultiplier(default 10)
   }
 }
 return true;
   ...
 }
   protected boolean processEndOfFile() {
 if (this.queue.size() != 0) {// Skipped this Hlog . Data loss
   this.currentPath = null;
   this.position = 0;
   return true;
 } else if (this.queueRecovered) {   // Terminate Failover Replication 
 source thread ,data loss
   this.manager.closeRecoveredQueue(this);
   LOG.info(Finished recovering the queue);
   this.running = false;
   return true;
 }
 return false;
   }
 {code} 
 Some Time HDFS will meet some problem but actually Hlog file is OK , So after 
 HDFS back  ,Some data will lose and can not find them back in slave cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6719) [replication] Data will lose if open a Hlog failed more than maxRetriesMultiplier

2012-09-05 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13448584#comment-13448584
 ] 

terry zhang commented on HBASE-6719:


now we can handler it like below:

hlog size = 0, Hlog queue =0,Recovery thread = yes. Terminate recovery 
thread(return !processEndOfFile())
hlog size = 0, Hlog queue =0,Recovery thread = no. Continue Loop (return 
!processEndOfFile())
hlog size = 0, Hlog queue !=0,Recovery thread = yes. Skip hlog (return 
!processEndOfFile())
hlog size = 0, Hlog queue !=0,Recovery thread = no. skip hlog (return 
!processEndOfFile())

hlog size = 1, Hlog queue =0,Recovery thread = yes. LOG as a Fatal mistake in 
regionserver's log
hlog size = 1, Hlog queue =0,Recovery thread = no. LOG as a Fatal mistake in 
regionserver's log
hlog size = 1, Hlog queue !=0,Recovery thread = yes. LOG as a Fatal mistake in 
regionserver's log
hlog size = 1, Hlog queue !=0,Recovery thread = no. LOG as a Fatal mistake in 
regionserver's log

 [replication] Data will lose if open a Hlog failed more than 
 maxRetriesMultiplier
 -

 Key: HBASE-6719
 URL: https://issues.apache.org/jira/browse/HBASE-6719
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
Assignee: terry zhang
Priority: Critical
 Fix For: 0.94.2

 Attachments: hbase-6719.patch


 Please Take a look below code
 {code:title=ReplicationSource.java|borderStyle=solid}
 protected boolean openReader(int sleepMultiplier) {
 {
   ...
   catch (IOException ioe) {
   LOG.warn(peerClusterZnode +  Got: , ioe);
   // TODO Need a better way to determinate if a file is really gone but
   // TODO without scanning all logs dir
   if (sleepMultiplier == this.maxRetriesMultiplier) {
 LOG.warn(Waited too long for this file, considering dumping);
 return !processEndOfFile(); // Open a file failed over 
 maxRetriesMultiplier(default 10)
   }
 }
 return true;
   ...
 }
   protected boolean processEndOfFile() {
 if (this.queue.size() != 0) {// Skipped this Hlog . Data loss
   this.currentPath = null;
   this.position = 0;
   return true;
 } else if (this.queueRecovered) {   // Terminate Failover Replication 
 source thread ,data loss
   this.manager.closeRecoveredQueue(this);
   LOG.info(Finished recovering the queue);
   this.running = false;
   return true;
 }
 return false;
   }
 {code} 
 Some Time HDFS will meet some problem but actually Hlog file is OK , So after 
 HDFS back  ,Some data will lose and can not find them back in slave cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6719) [replication] Data will lose if open a Hlog failed more than maxRetriesMultiplier

2012-09-05 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13448586#comment-13448586
 ] 

terry zhang commented on HBASE-6719:


hlog size=1 Means hlog size is not 0.( hlog size != 0)

 [replication] Data will lose if open a Hlog failed more than 
 maxRetriesMultiplier
 -

 Key: HBASE-6719
 URL: https://issues.apache.org/jira/browse/HBASE-6719
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
Assignee: terry zhang
Priority: Critical
 Fix For: 0.94.2

 Attachments: hbase-6719.patch


 Please Take a look below code
 {code:title=ReplicationSource.java|borderStyle=solid}
 protected boolean openReader(int sleepMultiplier) {
 {
   ...
   catch (IOException ioe) {
   LOG.warn(peerClusterZnode +  Got: , ioe);
   // TODO Need a better way to determinate if a file is really gone but
   // TODO without scanning all logs dir
   if (sleepMultiplier == this.maxRetriesMultiplier) {
 LOG.warn(Waited too long for this file, considering dumping);
 return !processEndOfFile(); // Open a file failed over 
 maxRetriesMultiplier(default 10)
   }
 }
 return true;
   ...
 }
   protected boolean processEndOfFile() {
 if (this.queue.size() != 0) {// Skipped this Hlog . Data loss
   this.currentPath = null;
   this.position = 0;
   return true;
 } else if (this.queueRecovered) {   // Terminate Failover Replication 
 source thread ,data loss
   this.manager.closeRecoveredQueue(this);
   LOG.info(Finished recovering the queue);
   this.running = false;
   return true;
 }
 return false;
   }
 {code} 
 Some Time HDFS will meet some problem but actually Hlog file is OK , So after 
 HDFS back  ,Some data will lose and can not find them back in slave cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-6695) [Replication] Data will lose if RegionServer down during transferqueue

2012-09-03 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6695:
---

Attachment: HBASE-6695-4trunk_v2.patch

Check regionserver stopper during loop

 [Replication] Data will lose if RegionServer down during transferqueue
 --

 Key: HBASE-6695
 URL: https://issues.apache.org/jira/browse/HBASE-6695
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
Priority: Critical
 Fix For: 0.96.0, 0.94.3

 Attachments: HBASE-6695-4trunk.patch, HBASE-6695-4trunk_v2.patch, 
 HBASE-6695.patch


 When we ware testing Replication failover feature we found if we kill a 
 regionserver during it transferqueue ,we found only part of the hlog znode 
 copy to the right path because failover process is interrupted. 
 Log:
 2012-08-29 12:20:05,660 INFO 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
 Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue
 2012-08-29 12:20:05,765 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162
 2012-08-29 12:20:05,850 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data
 2012-08-29 12:20:05,938 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data
 2012-08-29 12:20:06,055 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data
 2012-08-29 12:20:06,277 WARN 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
 Failed all from region=.ME
 TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020
 java.util.concurrent.ExecutionException: java.net.ConnectException: 
 Connection refused
 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
 at java.util.concurrent.FutureTask.get(FutureTask.java:83)
 at 
 ..
 This server is down .
 ZK node status:
 [zk: 10.232.98.77:2181(CONNECTED) 6] ls 
 /hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716
 [lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268]
  
 dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-6695) [Replication] Data will lose if RegionServer down during transferqueue

2012-09-02 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6695:
---

Attachment: HBASE-6695-4trunk.patch

add patch for trunk

 [Replication] Data will lose if RegionServer down during transferqueue
 --

 Key: HBASE-6695
 URL: https://issues.apache.org/jira/browse/HBASE-6695
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
Priority: Critical
 Fix For: 0.96.0, 0.94.3

 Attachments: HBASE-6695-4trunk.patch, HBASE-6695.patch


 When we ware testing Replication failover feature we found if we kill a 
 regionserver during it transferqueue ,we found only part of the hlog znode 
 copy to the right path because failover process is interrupted. 
 Log:
 2012-08-29 12:20:05,660 INFO 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
 Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue
 2012-08-29 12:20:05,765 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162
 2012-08-29 12:20:05,850 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data
 2012-08-29 12:20:05,938 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data
 2012-08-29 12:20:06,055 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data
 2012-08-29 12:20:06,277 WARN 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
 Failed all from region=.ME
 TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020
 java.util.concurrent.ExecutionException: java.net.ConnectException: 
 Connection refused
 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
 at java.util.concurrent.FutureTask.get(FutureTask.java:83)
 at 
 ..
 This server is down .
 ZK node status:
 [zk: 10.232.98.77:2181(CONNECTED) 6] ls 
 /hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716
 [lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268]
  
 dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HBASE-6700) [replication] replication node will never delete if copy newQueues size is 0

2012-08-31 Thread terry zhang (JIRA)
terry zhang created HBASE-6700:
--

 Summary: [replication] replication node will never delete if copy 
newQueues size is 0
 Key: HBASE-6700
 URL: https://issues.apache.org/jira/browse/HBASE-6700
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang


Please check code below

{code:title=ReplicationSourceManager.java|borderStyle=solid}
// NodeFailoverWorker class
public void run() {
{
...

  LOG.info(Moving  + rsZnode + 's hlogs to my queue);
  SortedMapString, SortedSetString newQueues =
  zkHelper.copyQueuesFromRS(rsZnode);   // Node create here*
  zkHelper.deleteRsQueues(rsZnode); 
  if (newQueues == null || newQueues.size() == 0) {
return;  
  }
...
}


  public void closeRecoveredQueue(ReplicationSourceInterface src) {
LOG.info(Done with the recovered queue  + src.getPeerClusterZnode());
this.oldsources.remove(src);
this.zkHelper.deleteSource(src.getPeerClusterZnode(), false);  // Node 
delete here*
  }
{code} 

So from code we can see if newQueues == null or newQueues.size() == 0, Failover 
replication Source will never start and the failover zk node will never deleted.


eg below failover node will never be delete:

[zk: 10.232.98.77:2181(CONNECTED) 16] ls 
/hbase-test3-repl/replication/rs/dw93.kgb.sqa.cm4,60020,1346337383956/1-dw93.kgb.sqa.cm4,60
020,1346309263932-dw91.kgb.sqa.cm4,60020,1346307150041-dw89.kgb.sqa.cm4,60020,1346307911711-dw93.kgb.sqa.cm4,60020,1346312019213-dw8
8.kgb.sqa.cm4,60020,1346311774939-dw89.kgb.sqa.cm4,60020,1346312314229-dw93.kgb.sqa.cm4,60020,1346312524307-dw88.kgb.sqa.cm4,60020,1
346313203367-dw89.kgb.sqa.cm4,60020,1346313944402-dw88.kgb.sqa.cm4,60020,1346314214286-dw91.kgb.sqa.cm4,60020,1346315119613-dw93.kgb
.sqa.cm4,60020,1346314186436-dw88.kgb.sqa.cm4,60020,1346315594396-dw89.kgb.sqa.cm4,60020,1346315909491-dw92.kgb.sqa.cm4,60020,134631
5315634-dw89.kgb.sqa.cm4,60020,1346316742242-dw93.kgb.sqa.cm4,60020,1346317604055-dw92.kgb.sqa.cm4,60020,1346318098972-dw91.kgb.sqa.
cm4,60020,1346317855650-dw93.kgb.sqa.cm4,60020,1346318532530-dw92.kgb.sqa.cm4,60020,1346318573238-dw89.kgb.sqa.cm4,60020,13463212990
40-dw91.kgb.sqa.cm4,60020,1346321304393-dw92.kgb.sqa.cm4,60020,1346325755894-dw89.kgb.sqa.cm4,60020,1346326520895-dw91.kgb.sqa.cm4,6
0020,1346328246992-dw92.kgb.sqa.cm4,60020,1346327290653-dw93.kgb.sqa.cm4,60020,1346337303018-dw91.kgb.sqa.cm4,60020,1346337318929
[] *= Empty node *
   



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-6700) [replication] replication node will never delete if copy newQueues size is 0

2012-08-31 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6700:
---

Description: 
Please check code below

{code:title=ReplicationSourceManager.java|borderStyle=solid}
// NodeFailoverWorker class
public void run() {
{
...

  LOG.info(Moving  + rsZnode + 's hlogs to my queue);
  SortedMapString, SortedSetString newQueues =
  zkHelper.copyQueuesFromRS(rsZnode);   // Node create here*
  zkHelper.deleteRsQueues(rsZnode); 
  if (newQueues == null || newQueues.size() == 0) {
return;  
  }
...
}


  public void closeRecoveredQueue(ReplicationSourceInterface src) {
LOG.info(Done with the recovered queue  + src.getPeerClusterZnode());
this.oldsources.remove(src);
this.zkHelper.deleteSource(src.getPeerClusterZnode(), false);  // Node 
delete here*
  }
{code} 

So from code we can see if newQueues == null or newQueues.size() == 0, Failover 
replication Source will never start and the failover zk node will never deleted.


eg below failover node will never be delete:

[zk: 10.232.98.77:2181(CONNECTED) 16] ls 
/hbase-test3-repl/replication/rs/dw93.kgb.sqa.cm4,60020,1346337383956/1-dw93.kgb.sqa.cm4,60020,1346309263932-dw91.kgb.sqa.cm4,60020,1346307150041-dw89.kgb.sqa.cm4,60020,1346307911711-dw93.kgb.sqa.cm4,60020,1346312019213-dw88.kgb.sqa.cm4,60020,1346311774939-dw89.kgb.sqa.cm4,60020,1346312314229-dw93.kgb.sqa.cm4,60020,1346312524307-dw88.kgb.sqa.cm4,60020,1346313203367-dw89.kgb.sqa.cm4,60020,1346313944402-dw88.kgb.sqa.cm4,60020,1346314214286-dw91.kgb.sqa.cm4,60020,1346315119613-dw93.kgb.sqa.cm4,60020,1346314186436-dw88.kgb.sqa.cm4,60020,1346315594396-dw89.kgb.sqa.cm4,60020,1346315909491-dw92.kgb.sqa.cm4,60020,1346315315634-dw89.kgb.sqa.cm4,60020,1346316742242-dw93.kgb.sqa.cm4,60020,1346317604055-dw92.kgb.sqa.cm4,60020,1346318098972-dw91.kgb.sqa.cm4,60020,1346317855650-dw93.kgb.sqa.cm4,60020,1346318532530-dw92.kgb.sqa.cm4,60020,1346318573238-dw89.kgb.sqa.cm4,60020,1346321299040-dw91.kgb.sqa.cm4,60020,1346321304393-dw92.kgb.sqa.cm4,60020,1346325755894-dw89.kgb.sqa.cm4,60020,1346326520895-dw91.kgb.sqa.cm4,60020,1346328246992-dw92.kgb.sqa.cm4,60020,1346327290653-dw93.kgb.sqa.cm4,60020,1346337303018-dw91.kgb.sqa.cm4,60020,1346337318929
[] 
   



  was:
Please check code below

{code:title=ReplicationSourceManager.java|borderStyle=solid}
// NodeFailoverWorker class
public void run() {
{
...

  LOG.info(Moving  + rsZnode + 's hlogs to my queue);
  SortedMapString, SortedSetString newQueues =
  zkHelper.copyQueuesFromRS(rsZnode);   // Node create here*
  zkHelper.deleteRsQueues(rsZnode); 
  if (newQueues == null || newQueues.size() == 0) {
return;  
  }
...
}


  public void closeRecoveredQueue(ReplicationSourceInterface src) {
LOG.info(Done with the recovered queue  + src.getPeerClusterZnode());
this.oldsources.remove(src);
this.zkHelper.deleteSource(src.getPeerClusterZnode(), false);  // Node 
delete here*
  }
{code} 

So from code we can see if newQueues == null or newQueues.size() == 0, Failover 
replication Source will never start and the failover zk node will never deleted.


eg below failover node will never be delete:

[zk: 10.232.98.77:2181(CONNECTED) 16] ls 
/hbase-test3-repl/replication/rs/dw93.kgb.sqa.cm4,60020,1346337383956/1-dw93.kgb.sqa.cm4,60
020,1346309263932-dw91.kgb.sqa.cm4,60020,1346307150041-dw89.kgb.sqa.cm4,60020,1346307911711-dw93.kgb.sqa.cm4,60020,1346312019213-dw8
8.kgb.sqa.cm4,60020,1346311774939-dw89.kgb.sqa.cm4,60020,1346312314229-dw93.kgb.sqa.cm4,60020,1346312524307-dw88.kgb.sqa.cm4,60020,1
346313203367-dw89.kgb.sqa.cm4,60020,1346313944402-dw88.kgb.sqa.cm4,60020,1346314214286-dw91.kgb.sqa.cm4,60020,1346315119613-dw93.kgb
.sqa.cm4,60020,1346314186436-dw88.kgb.sqa.cm4,60020,1346315594396-dw89.kgb.sqa.cm4,60020,1346315909491-dw92.kgb.sqa.cm4,60020,134631
5315634-dw89.kgb.sqa.cm4,60020,1346316742242-dw93.kgb.sqa.cm4,60020,1346317604055-dw92.kgb.sqa.cm4,60020,1346318098972-dw91.kgb.sqa.
cm4,60020,1346317855650-dw93.kgb.sqa.cm4,60020,1346318532530-dw92.kgb.sqa.cm4,60020,1346318573238-dw89.kgb.sqa.cm4,60020,13463212990
40-dw91.kgb.sqa.cm4,60020,1346321304393-dw92.kgb.sqa.cm4,60020,1346325755894-dw89.kgb.sqa.cm4,60020,1346326520895-dw91.kgb.sqa.cm4,6
0020,1346328246992-dw92.kgb.sqa.cm4,60020,1346327290653-dw93.kgb.sqa.cm4,60020,1346337303018-dw91.kgb.sqa.cm4,60020,1346337318929
[] *= Empty node *
   




 [replication] replication node will never delete if copy newQueues size is 0
 

 Key: HBASE-6700
 URL: https://issues.apache.org/jira/browse/HBASE-6700
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang

 Please check code below
 

[jira] [Updated] (HBASE-6700) [replication] replication node will never delete if copy newQueues size is 0

2012-08-31 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6700:
---

Description: 
Please check code below

{code:title=ReplicationSourceManager.java|borderStyle=solid}
// NodeFailoverWorker class
public void run() {
{
...

  LOG.info(Moving  + rsZnode + 's hlogs to my queue);
  SortedMapString, SortedSetString newQueues =
  zkHelper.copyQueuesFromRS(rsZnode);   // Node create here*
  zkHelper.deleteRsQueues(rsZnode); 
  if (newQueues == null || newQueues.size() == 0) {
return;  
  }
...
}


  public void closeRecoveredQueue(ReplicationSourceInterface src) {
LOG.info(Done with the recovered queue  + src.getPeerClusterZnode());
this.oldsources.remove(src);
this.zkHelper.deleteSource(src.getPeerClusterZnode(), false);  // Node 
delete here*
  }
{code} 

So from code we can see if newQueues == null or newQueues.size() == 0, Failover 
replication Source will never start and the failover zk node will never deleted.


eg below failover node will never be delete:

[zk: 10.232.98.77:2181(CONNECTED) 16] ls 
/hbase-test3-repl/replication/rs/dw93.kgb.sqa.cm4,60020,1346337383956/1-dw93.kgb.sqa.cm4,60020,1346309263932-dw91.kgb.sqa.cm4,60020,1346307150041-dw89.kgb.sqa.cm4,60020,1346307911711-dw93.kgb.sqa.cm4,60020,1346312019213-dw88.kgb.sqa.cm4,60020,1346311774939-dw89.kgb.sqa.cm4,60020,1346312314229-dw93.kgb.sqa.cm4,60020,1346312524307-dw88.kgb.sqa.cm4,60020,1346313203367-dw89.kgb.sqa.cm4,60020,1346313944402-dw88.kgb.sqa.cm4,60020,1346314214286-dw91.kgb.sqa.cm4,60020,1346315119613-dw93.kgb.sqa.cm4,60020,1346314186436-dw88.kgb.sqa.cm4,60020,1346315594396-dw89.kgb.sqa.cm4,60020,1346315909491-dw92.kgb.sqa.cm4,60020,1346315315634-dw89.kgb.sqa.cm4,60020,1346316742242-dw93.kgb.sqa.cm4,60020,1346317604055-dw92.kgb.sqa.cm4,60020,1346318098972-dw91.kgb.sqa.cm4,60020,1346317855650-dw93.kgb.sqa.cm4,60020,1346318532530-dw92.kgb.sqa.cm4,60020,1346318573238-dw89.kgb.sqa.cm4,60020,1346321299040-dw91.kgb.sqa.cm4,60020,1346321304393-dw92.kgb.sqa.cm4,60020,1346325755894-dw89.kgb.sqa.cm4,60020,1346326520895-dw91.kgb.sqa.cm4,60020,1346328246992-dw92.kgb.sqa.cm4,60020,1346327290653-dw93.kgb.sqa.cm4,60020,1346337303018-dw91.kgb.sqa.cm4,60020,1346337318929
[] // empty node will never be deleted
   



  was:
Please check code below

{code:title=ReplicationSourceManager.java|borderStyle=solid}
// NodeFailoverWorker class
public void run() {
{
...

  LOG.info(Moving  + rsZnode + 's hlogs to my queue);
  SortedMapString, SortedSetString newQueues =
  zkHelper.copyQueuesFromRS(rsZnode);   // Node create here*
  zkHelper.deleteRsQueues(rsZnode); 
  if (newQueues == null || newQueues.size() == 0) {
return;  
  }
...
}


  public void closeRecoveredQueue(ReplicationSourceInterface src) {
LOG.info(Done with the recovered queue  + src.getPeerClusterZnode());
this.oldsources.remove(src);
this.zkHelper.deleteSource(src.getPeerClusterZnode(), false);  // Node 
delete here*
  }
{code} 

So from code we can see if newQueues == null or newQueues.size() == 0, Failover 
replication Source will never start and the failover zk node will never deleted.


eg below failover node will never be delete:

[zk: 10.232.98.77:2181(CONNECTED) 16] ls 
/hbase-test3-repl/replication/rs/dw93.kgb.sqa.cm4,60020,1346337383956/1-dw93.kgb.sqa.cm4,60020,1346309263932-dw91.kgb.sqa.cm4,60020,1346307150041-dw89.kgb.sqa.cm4,60020,1346307911711-dw93.kgb.sqa.cm4,60020,1346312019213-dw88.kgb.sqa.cm4,60020,1346311774939-dw89.kgb.sqa.cm4,60020,1346312314229-dw93.kgb.sqa.cm4,60020,1346312524307-dw88.kgb.sqa.cm4,60020,1346313203367-dw89.kgb.sqa.cm4,60020,1346313944402-dw88.kgb.sqa.cm4,60020,1346314214286-dw91.kgb.sqa.cm4,60020,1346315119613-dw93.kgb.sqa.cm4,60020,1346314186436-dw88.kgb.sqa.cm4,60020,1346315594396-dw89.kgb.sqa.cm4,60020,1346315909491-dw92.kgb.sqa.cm4,60020,1346315315634-dw89.kgb.sqa.cm4,60020,1346316742242-dw93.kgb.sqa.cm4,60020,1346317604055-dw92.kgb.sqa.cm4,60020,1346318098972-dw91.kgb.sqa.cm4,60020,1346317855650-dw93.kgb.sqa.cm4,60020,1346318532530-dw92.kgb.sqa.cm4,60020,1346318573238-dw89.kgb.sqa.cm4,60020,1346321299040-dw91.kgb.sqa.cm4,60020,1346321304393-dw92.kgb.sqa.cm4,60020,1346325755894-dw89.kgb.sqa.cm4,60020,1346326520895-dw91.kgb.sqa.cm4,60020,1346328246992-dw92.kgb.sqa.cm4,60020,1346327290653-dw93.kgb.sqa.cm4,60020,1346337303018-dw91.kgb.sqa.cm4,60020,1346337318929
[] 
   




 [replication] replication node will never delete if copy newQueues size is 0
 

 Key: HBASE-6700
 URL: https://issues.apache.org/jira/browse/HBASE-6700
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang

 Please check code below
 

[jira] [Commented] (HBASE-6695) [Replication] Data will lose if RegionServer down during transferqueue

2012-08-31 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445727#comment-13445727
 ] 

terry zhang commented on HBASE-6695:


[~lhofhansl]
{noformat} 
logQueue.add(hlog);
+ ZKUtil.deleteNodeRecursively(this.zookeeper, z);
}
{noformat} 
If delete the hlog zk node after copy to new rs , Then 1 HLog won't be not 
replayed by 2 region server.

 [Replication] Data will lose if RegionServer down during transferqueue
 --

 Key: HBASE-6695
 URL: https://issues.apache.org/jira/browse/HBASE-6695
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
Priority: Critical
 Fix For: 0.96.0, 0.94.3

 Attachments: HBASE-6695.patch


 When we ware testing Replication failover feature we found if we kill a 
 regionserver during it transferqueue ,we found only part of the hlog znode 
 copy to the right path because failover process is interrupted. 
 Log:
 2012-08-29 12:20:05,660 INFO 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
 Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue
 2012-08-29 12:20:05,765 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162
 2012-08-29 12:20:05,850 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data
 2012-08-29 12:20:05,938 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data
 2012-08-29 12:20:06,055 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data
 2012-08-29 12:20:06,277 WARN 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
 Failed all from region=.ME
 TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020
 java.util.concurrent.ExecutionException: java.net.ConnectException: 
 Connection refused
 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
 at java.util.concurrent.FutureTask.get(FutureTask.java:83)
 at 
 ..
 This server is down .
 ZK node status:
 [zk: 10.232.98.77:2181(CONNECTED) 6] ls 
 /hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716
 [lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268]
  
 dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-6700) [replication] replication node will never delete if copy newQueues size is 0

2012-08-31 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6700:
---

Attachment: HBASE-6700.patch

 [replication] replication node will never delete if copy newQueues size is 0
 

 Key: HBASE-6700
 URL: https://issues.apache.org/jira/browse/HBASE-6700
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
 Attachments: HBASE-6700.patch


 Please check code below
 {code:title=ReplicationSourceManager.java|borderStyle=solid}
 // NodeFailoverWorker class
 public void run() {
 {
 ...
   LOG.info(Moving  + rsZnode + 's hlogs to my queue);
   SortedMapString, SortedSetString newQueues =
   zkHelper.copyQueuesFromRS(rsZnode);   // Node create here*
   zkHelper.deleteRsQueues(rsZnode); 
   if (newQueues == null || newQueues.size() == 0) {
 return;  
   }
 ...
 }
   public void closeRecoveredQueue(ReplicationSourceInterface src) {
 LOG.info(Done with the recovered queue  + src.getPeerClusterZnode());
 this.oldsources.remove(src);
 this.zkHelper.deleteSource(src.getPeerClusterZnode(), false);  // Node 
 delete here*
   }
 {code} 
 So from code we can see if newQueues == null or newQueues.size() == 0, 
 Failover replication Source will never start and the failover zk node will 
 never deleted.
 eg below failover node will never be delete:
 [zk: 10.232.98.77:2181(CONNECTED) 16] ls 
 /hbase-test3-repl/replication/rs/dw93.kgb.sqa.cm4,60020,1346337383956/1-dw93.kgb.sqa.cm4,60020,1346309263932-dw91.kgb.sqa.cm4,60020,1346307150041-dw89.kgb.sqa.cm4,60020,1346307911711-dw93.kgb.sqa.cm4,60020,1346312019213-dw88.kgb.sqa.cm4,60020,1346311774939-dw89.kgb.sqa.cm4,60020,1346312314229-dw93.kgb.sqa.cm4,60020,1346312524307-dw88.kgb.sqa.cm4,60020,1346313203367-dw89.kgb.sqa.cm4,60020,1346313944402-dw88.kgb.sqa.cm4,60020,1346314214286-dw91.kgb.sqa.cm4,60020,1346315119613-dw93.kgb.sqa.cm4,60020,1346314186436-dw88.kgb.sqa.cm4,60020,1346315594396-dw89.kgb.sqa.cm4,60020,1346315909491-dw92.kgb.sqa.cm4,60020,1346315315634-dw89.kgb.sqa.cm4,60020,1346316742242-dw93.kgb.sqa.cm4,60020,1346317604055-dw92.kgb.sqa.cm4,60020,1346318098972-dw91.kgb.sqa.cm4,60020,1346317855650-dw93.kgb.sqa.cm4,60020,1346318532530-dw92.kgb.sqa.cm4,60020,1346318573238-dw89.kgb.sqa.cm4,60020,1346321299040-dw91.kgb.sqa.cm4,60020,1346321304393-dw92.kgb.sqa.cm4,60020,1346325755894-dw89.kgb.sqa.cm4,60020,1346326520895-dw91.kgb.sqa.cm4,60020,1346328246992-dw92.kgb.sqa.cm4,60020,1346327290653-dw93.kgb.sqa.cm4,60020,1346337303018-dw91.kgb.sqa.cm4,60020,1346337318929
 [] // empty node will never be deleted


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-6700) [replication] replication node will never delete if copy newQueues size is 0

2012-08-31 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445734#comment-13445734
 ] 

terry zhang commented on HBASE-6700:


we can let the NodeFailoverWorker create newClusterZnode after check the hlog 
size 

 [replication] replication node will never delete if copy newQueues size is 0
 

 Key: HBASE-6700
 URL: https://issues.apache.org/jira/browse/HBASE-6700
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
 Attachments: HBASE-6700.patch


 Please check code below
 {code:title=ReplicationSourceManager.java|borderStyle=solid}
 // NodeFailoverWorker class
 public void run() {
 {
 ...
   LOG.info(Moving  + rsZnode + 's hlogs to my queue);
   SortedMapString, SortedSetString newQueues =
   zkHelper.copyQueuesFromRS(rsZnode);   // Node create here*
   zkHelper.deleteRsQueues(rsZnode); 
   if (newQueues == null || newQueues.size() == 0) {
 return;  
   }
 ...
 }
   public void closeRecoveredQueue(ReplicationSourceInterface src) {
 LOG.info(Done with the recovered queue  + src.getPeerClusterZnode());
 this.oldsources.remove(src);
 this.zkHelper.deleteSource(src.getPeerClusterZnode(), false);  // Node 
 delete here*
   }
 {code} 
 So from code we can see if newQueues == null or newQueues.size() == 0, 
 Failover replication Source will never start and the failover zk node will 
 never deleted.
 eg below failover node will never be delete:
 [zk: 10.232.98.77:2181(CONNECTED) 16] ls 
 /hbase-test3-repl/replication/rs/dw93.kgb.sqa.cm4,60020,1346337383956/1-dw93.kgb.sqa.cm4,60020,1346309263932-dw91.kgb.sqa.cm4,60020,1346307150041-dw89.kgb.sqa.cm4,60020,1346307911711-dw93.kgb.sqa.cm4,60020,1346312019213-dw88.kgb.sqa.cm4,60020,1346311774939-dw89.kgb.sqa.cm4,60020,1346312314229-dw93.kgb.sqa.cm4,60020,1346312524307-dw88.kgb.sqa.cm4,60020,1346313203367-dw89.kgb.sqa.cm4,60020,1346313944402-dw88.kgb.sqa.cm4,60020,1346314214286-dw91.kgb.sqa.cm4,60020,1346315119613-dw93.kgb.sqa.cm4,60020,1346314186436-dw88.kgb.sqa.cm4,60020,1346315594396-dw89.kgb.sqa.cm4,60020,1346315909491-dw92.kgb.sqa.cm4,60020,1346315315634-dw89.kgb.sqa.cm4,60020,1346316742242-dw93.kgb.sqa.cm4,60020,1346317604055-dw92.kgb.sqa.cm4,60020,1346318098972-dw91.kgb.sqa.cm4,60020,1346317855650-dw93.kgb.sqa.cm4,60020,1346318532530-dw92.kgb.sqa.cm4,60020,1346318573238-dw89.kgb.sqa.cm4,60020,1346321299040-dw91.kgb.sqa.cm4,60020,1346321304393-dw92.kgb.sqa.cm4,60020,1346325755894-dw89.kgb.sqa.cm4,60020,1346326520895-dw91.kgb.sqa.cm4,60020,1346328246992-dw92.kgb.sqa.cm4,60020,1346327290653-dw93.kgb.sqa.cm4,60020,1346337303018-dw91.kgb.sqa.cm4,60020,1346337318929
 [] // empty node will never be deleted


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HBASE-6695) [Replication] Data will lose if RegionServer down during transferqueue

2012-08-30 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6695:
---

Description: 
When we ware testing Replication failover feature we found if we kill a 
regionserver during it transferqueue ,we found only part of the hlog znode copy 
to the right path because failover process is interrupted. 

Log:

2012-08-29 12:20:05,660 INFO 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue

2012-08-29 12:20:05,765 DEBUG 
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162
2012-08-29 12:20:05,850 DEBUG 
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data

2012-08-29 12:20:05,938 DEBUG 
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data

2012-08-29 12:20:06,055 DEBUG 
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data

2012-08-29 12:20:06,277 WARN 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
Failed all from region=.ME
TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020
java.util.concurrent.ExecutionException: java.net.ConnectException: Connection 
refused
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at 
..
{color:red} 
This server is down .
{color}

ZK node status:

[zk: 10.232.98.77:2181(CONNECTED) 6] ls 
/hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716
[lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268]



{color:red} 
dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted
{color}









  was:
When we ware testing Replication failover feature we found if we kill a 
regionserver during it transferqueue ,we found only part of the hlog znode copy 
to the right path because failover process is interrupted. 

Log:

2012-08-29 12:20:05,660 INFO 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue

2012-08-29 12:20:05,765 DEBUG 
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162
2012-08-29 12:20:05,850 DEBUG 
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213886800 with data

2012-08-29 12:20:05,938 DEBUG 
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213830559 with data

2012-08-29 12:20:06,055 DEBUG 
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
dw92.kgb.sqa.cm4%2C60020%2C1346210789716.1346213775146 with data

2012-08-29 12:20:06,277 WARN 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
Failed all from region=.ME
TA.,,1.1028785192, hostname=dw93.kgb.sqa.cm4, port=60020
java.util.concurrent.ExecutionException: java.net.ConnectException: Connection 
refused
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at 
..
{color:red} 
This server is down .
{color}

ZK node status:
[zk: 10.232.98.77:2181(CONNECTED) 6] ls 
/hbase-test3-repl/replication/rs/dw92.kgb.sqa.cm4,60020,1346210789716
[lock, 1, 1-dw89.kgb.sqa.cm4,60020,1346202436268]
{color:red} 
dw92 is down , but Node dw92.kgb.sqa.cm4,60020,1346210789716 can't be deleted
{color}










 [Replication] Data will lose if RegionServer down during transferqueue
 --

 Key: HBASE-6695
 URL: https://issues.apache.org/jira/browse/HBASE-6695
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
Priority: Critical

 When we ware testing Replication failover feature we found if we kill a 
 regionserver during it transferqueue ,we found only part of the hlog znode 
 copy to the right path because failover process is interrupted. 
 Log:
 2012-08-29 12:20:05,660 INFO 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
 Moving dw92.kgb.sqa.cm4,60020,1346210789716's hlogs to my queue
 2012-08-29 12:20:05,765 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 dw92.kgb.sqa.cm4%2C60020%2C13462107 89716.1346213720708 with data 210508162
 2012-08-29 12:20:05,850 DEBUG 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating 
 

[jira] [Commented] (HBASE-6652) [replication]replicationQueueSizeCapacity and replicationQueueNbCapacity default value is too big, Slave regionserver maybe outmemory after master start replication

2012-08-24 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13440981#comment-13440981
 ] 

terry zhang commented on HBASE-6652:


if we use patch HBASE-6165 and don't set custom queue size , replication will 
us IPC call queue. So if hbase.region.server.handler.count set too much, the 
slave cluster region server maybe out of memory when replication running. So 
can we replicationQueueSizeCapacity default value to 4M?


 [replication]replicationQueueSizeCapacity and replicationQueueNbCapacity 
 default value is too big, Slave regionserver maybe outmemory after master 
 start replication
 

 Key: HBASE-6652
 URL: https://issues.apache.org/jira/browse/HBASE-6652
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
Assignee: terry zhang

 now our replication replicationQueueSizeCapacity is set to 64M and 
 replicationQueueNbCapacity is set to 25000. So when a master cluster with 
 many regionserver replicate to a small cluster 。 Slave rpc queue will full 
 and out of memory .
 java.util.concurrent.ExecutionException: java.io.IOException: Call queue is 
 full, is ipc.server.max.callqueue.size too small?
 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
 at java.util.concurrent.FutureTask.get(FutureTask.java:83)
 at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:
 1524)
 at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1376)
 at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:700)
 at 
 org.apache.hadoop.hbase.client.HTablePool$PooledHTable.batch(HTablePool.java:361)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:172)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:129)
 at 
 org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:139)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.replicateLogEntries(HRegionServer.java:4018)
 at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:361)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1414)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6652) [replication]replicationQueueSizeCapacity and replicationQueueNbCapacity default value is too big, Slave regionserver maybe outmemory after master start replication

2012-08-24 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13440985#comment-13440985
 ] 

terry zhang commented on HBASE-6652:


another case will case slave region server oom is master disable replication 
and restart many times. When we enable replication master region server will 
start many recovery thread (many zk node in replication/rs/xxx/). this will 
still let the slave rs work in very heavy load.

 [replication]replicationQueueSizeCapacity and replicationQueueNbCapacity 
 default value is too big, Slave regionserver maybe outmemory after master 
 start replication
 

 Key: HBASE-6652
 URL: https://issues.apache.org/jira/browse/HBASE-6652
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
Assignee: terry zhang

 now our replication replicationQueueSizeCapacity is set to 64M and 
 replicationQueueNbCapacity is set to 25000. So when a master cluster with 
 many regionserver replicate to a small cluster 。 Slave rpc queue will full 
 and out of memory .
 java.util.concurrent.ExecutionException: java.io.IOException: Call queue is 
 full, is ipc.server.max.callqueue.size too small?
 at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
 at java.util.concurrent.FutureTask.get(FutureTask.java:83)
 at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:
 1524)
 at 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1376)
 at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:700)
 at 
 org.apache.hadoop.hbase.client.HTablePool$PooledHTable.batch(HTablePool.java:361)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:172)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:129)
 at 
 org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:139)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.replicateLogEntries(HRegionServer.java:4018)
 at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:361)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1414)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6652) [replication]replicationQueueSizeCapacity and replicationQueueNbCapacity default value is too big, Slave regionserver maybe outmemory after master start replication

2012-08-23 Thread terry zhang (JIRA)
terry zhang created HBASE-6652:
--

 Summary: [replication]replicationQueueSizeCapacity and 
replicationQueueNbCapacity default value is too big, Slave regionserver maybe 
outmemory after master start replication
 Key: HBASE-6652
 URL: https://issues.apache.org/jira/browse/HBASE-6652
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
Assignee: terry zhang


now our replication replicationQueueSizeCapacity is set to 64M and 
replicationQueueNbCapacity is set to 25000. So when a master cluster with many 
regionserver replicate to a small cluster 。 Slave rpc queue will full and out 
of memory .


java.util.concurrent.ExecutionException: java.io.IOException: Call queue is 
full, is ipc.server.max.callqueue.size too small?
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:
1524)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1376)
at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:700)
at 
org.apache.hadoop.hbase.client.HTablePool$PooledHTable.batch(HTablePool.java:361)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.batch(ReplicationSink.java:172)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSink.replicateEntries(ReplicationSink.java:129)
at 
org.apache.hadoop.hbase.replication.regionserver.Replication.replicateLogEntries(Replication.java:139)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.replicateLogEntries(HRegionServer.java:4018)
at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:361)
at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1414)




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6624) [Replication]currentNbOperations should set to 0 after update the shippedOpsRate

2012-08-21 Thread terry zhang (JIRA)
terry zhang created HBASE-6624:
--

 Summary: [Replication]currentNbOperations should set to 0 after 
update the shippedOpsRate
 Key: HBASE-6624
 URL: https://issues.apache.org/jira/browse/HBASE-6624
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
Assignee: terry zhang


now currentNbOperations will not reset to 0 and increase after replication 
start. Now this value is used for calculate shippedOpsRate. if it is not reset 
to 0 shippedOpsRate is not correct

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6624) [Replication]currentNbOperations should set to 0 after update the shippedOpsRate

2012-08-21 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6624:
---

Attachment: jira-6624.patch

 [Replication]currentNbOperations should set to 0 after update the 
 shippedOpsRate
 

 Key: HBASE-6624
 URL: https://issues.apache.org/jira/browse/HBASE-6624
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
Assignee: terry zhang
 Attachments: jira-6624.patch


 now currentNbOperations will not reset to 0 and increase after replication 
 start. Now this value is used for calculate shippedOpsRate. if it is not 
 reset to 0 shippedOpsRate is not correct

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6623) [replication] replication metrics value AgeOfLastShippedOp is not set correctly

2012-08-20 Thread terry zhang (JIRA)
terry zhang created HBASE-6623:
--

 Summary: [replication] replication metrics value 
AgeOfLastShippedOp is not set correctly
 Key: HBASE-6623
 URL: https://issues.apache.org/jira/browse/HBASE-6623
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
Assignee: terry zhang
Priority: Minor


From code below we can see AgeOfLastShippedOp is not set correctly



{code:title=ReplicationSource.java|borderStyle=solid}
// entriesArray init
  public void init(){
this.entriesArray = new HLog.Entry[this.replicationQueueNbCapacity];
for (int i = 0; i  this.replicationQueueNbCapacity; i++) {
  this.entriesArray[i] = new HLog.Entry();
}

}

//set the metrics value should not get the array length

protected void shipEdits() {
...
this.metrics.setAgeOfLastShippedOp(

this.entriesArray[this.entriesArray.length-1].getKey().getWriteTime());
...
}

{code} 



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6623) [replication] replication metrics value AgeOfLastShippedOp is not set correctly

2012-08-20 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13438461#comment-13438461
 ] 

terry zhang commented on HBASE-6623:


we can use currentNbEntries instead of this.entriesArray.length

 [replication] replication metrics value AgeOfLastShippedOp is not set 
 correctly
 ---

 Key: HBASE-6623
 URL: https://issues.apache.org/jira/browse/HBASE-6623
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
Assignee: terry zhang
Priority: Minor

 From code below we can see AgeOfLastShippedOp is not set correctly
 {code:title=ReplicationSource.java|borderStyle=solid}
 // entriesArray init
   public void init(){
 this.entriesArray = new HLog.Entry[this.replicationQueueNbCapacity];
 for (int i = 0; i  this.replicationQueueNbCapacity; i++) {
   this.entriesArray[i] = new HLog.Entry();
 }
 }
 //set the metrics value should not get the array length
 protected void shipEdits() {
 ...
 this.metrics.setAgeOfLastShippedOp(
 
 this.entriesArray[this.entriesArray.length-1].getKey().getWriteTime());
 ...
 }
 {code} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6623) [replication] replication metrics value AgeOfLastShippedOp is not set correctly

2012-08-20 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6623:
---

Attachment: jira-6623.patch

 [replication] replication metrics value AgeOfLastShippedOp is not set 
 correctly
 ---

 Key: HBASE-6623
 URL: https://issues.apache.org/jira/browse/HBASE-6623
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.1
Reporter: terry zhang
Assignee: terry zhang
Priority: Minor
 Attachments: jira-6623.patch


 From code below we can see AgeOfLastShippedOp is not set correctly
 {code:title=ReplicationSource.java|borderStyle=solid}
 // entriesArray init
   public void init(){
 this.entriesArray = new HLog.Entry[this.replicationQueueNbCapacity];
 for (int i = 0; i  this.replicationQueueNbCapacity; i++) {
   this.entriesArray[i] = new HLog.Entry();
 }
 }
 //set the metrics value should not get the array length
 protected void shipEdits() {
 ...
 this.metrics.setAgeOfLastShippedOp(
 
 this.entriesArray[this.entriesArray.length-1].getKey().getWriteTime());
 ...
 }
 {code} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6533) [replication] replication will be block if WAL compress set differently in master and slave configuration

2012-08-09 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431640#comment-13431640
 ] 

terry zhang commented on HBASE-6533:


so sorry create so many same issue because my IE issue. Could anyone help me 
delete them ? 

 [replication] replication will be block if WAL compress set differently in 
 master and slave configuration
 -

 Key: HBASE-6533
 URL: https://issues.apache.org/jira/browse/HBASE-6533
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang

 as we know in hbase 0.94.0 we have a configuration below
   property
 namehbase.regionserver.wal.enablecompression/name
  valuetrue/value
   /property
 if we enable it in master cluster and disable it in slave cluster . Then 
 replication will not work. It will throw unwrapRemoteException again and 
 again in master cluster.
 2012-08-09 12:49:55,892 WARN 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't 
 replicate because of an error
  on the remote cluster: 
 java.io.IOException: IPC server unable to read call parameters: Error in 
 readFields
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at 
 org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
 at 
 org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365)
 Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read 
 call parameters: Error in readFields
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921)
 at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151)
 at $Proxy13.replicateLogEntries(Unknown Source)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616)
 ... 1 more 
 This is because Slave cluster can not parse the hlog entry .
 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to 
 read call parameters for client 10.232.98.89
 java.io.IOException: Error in readFields
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635)
 at 
 org.apache.hadoop.hbase.ipc.Invocation.readFields(Invocation.java:125)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:1292)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1207)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:735)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:524)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:499)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:180)
 at org.apache.hadoop.hbase.KeyValue.readFields(KeyValue.java:2254)
 at 
 org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFields(WALEdit.java:146)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLog$Entry.readFields(HLog.java:1767)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:682)
 ... 11 more 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6533) [replication] replication will be block if WAL compress set differently in master and slave configuration

2012-08-09 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13431643#comment-13431643
 ] 

terry zhang commented on HBASE-6533:


  now we can only go around this issue is set the master back to uncompressed 
mode and delete the zk node replication/rs. And restart the master cluster. 
Cause replication slave don't support reading compress hlog.
  But if we have multi master and some of them set hlog to compressed mode. 
Them we can not handler this situation.


 [replication] replication will be block if WAL compress set differently in 
 master and slave configuration
 -

 Key: HBASE-6533
 URL: https://issues.apache.org/jira/browse/HBASE-6533
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang

 as we know in hbase 0.94.0 we have a configuration below
   property
 namehbase.regionserver.wal.enablecompression/name
  valuetrue/value
   /property
 if we enable it in master cluster and disable it in slave cluster . Then 
 replication will not work. It will throw unwrapRemoteException again and 
 again in master cluster.
 2012-08-09 12:49:55,892 WARN 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Can't 
 replicate because of an error
  on the remote cluster: 
 java.io.IOException: IPC server unable to read call parameters: Error in 
 readFields
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at 
 org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
 at 
 org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:635)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:365)
 Caused by: org.apache.hadoop.ipc.RemoteException: IPC server unable to read 
 call parameters: Error in readFields
 at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:921)
 at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:151)
 at $Proxy13.replicateLogEntries(Unknown Source)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:616)
 ... 1 more 
 This is because Slave cluster can not parse the hlog entry .
 2012-08-09 14:46:05,891 WARN org.apache.hadoop.ipc.HBaseServer: Unable to 
 read call parameters for client 10.232.98.89
 java.io.IOException: Error in readFields
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:685)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:586)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:635)
 at 
 org.apache.hadoop.hbase.ipc.Invocation.readFields(Invocation.java:125)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:1292)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1207)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:735)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:524)
 at 
 org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:499)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Caused by: java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:180)
 at org.apache.hadoop.hbase.KeyValue.readFields(KeyValue.java:2254)
 at 
 org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFields(WALEdit.java:146)
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLog$Entry.readFields(HLog.java:1767)
 at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:682)
 ... 11 more 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 

[jira] [Commented] (HBASE-6453) Hbase Replication point in time feature

2012-07-26 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13422872#comment-13422872
 ] 

terry zhang commented on HBASE-6453:


Hi,Stack. I think we can use point in time feature with Snapshots feature 
(HBASE-6055) in below case.
1. Master cluster in data center China do a Snapshot at Time A
2. copy the Snapshot to Slave cluster in data center US and Set Replication 
Time stamp to A
3. Restore snapshots (HBASE-6230 ) for Slave cluster and start replication.

Them Slave cluster data will be as same as Master Cluster . And data is more 
safe and US user can visit Slave cluster for getting or scanning data to 
decrease the stress for China data center. Enable replication can not control 
the accurate time so may be it will lose some data or replicate some useless 
data. Mysql also has point in time/position feature in it replication 
Framework. It is very convenience for data center administrate to use. 

We can give some better name for this operation cause I am not good at naming 
...

 Hbase Replication point in time feature
 ---

 Key: HBASE-6453
 URL: https://issues.apache.org/jira/browse/HBASE-6453
 Project: HBase
  Issue Type: New Feature
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
Assignee: terry zhang
 Attachments: hbase-6453-v1.patch


 Now we can not control when hbase replication  start to work. this patch 
 support we set a time stamp filter . All the row which is below this time 
 stamp will not be replicated. We also can delete and show this time stamp in 
 hbase shell if we want to change it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6453) Hbase Replication point in time feature

2012-07-25 Thread terry zhang (JIRA)
terry zhang created HBASE-6453:
--

 Summary: Hbase Replication point in time feature
 Key: HBASE-6453
 URL: https://issues.apache.org/jira/browse/HBASE-6453
 Project: HBase
  Issue Type: New Feature
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
Assignee: terry zhang


Now we can not control when hbase replication  start to work. this patch 
support we set a time stamp filter . All the row which is below this time stamp 
will not be replicated. We also can delete and show this time stamp in hbase 
shell if we want to change it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6453) Hbase Replication point in time feature

2012-07-25 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6453:
---

Attachment: hbase-6453-v1.patch

 Hbase Replication point in time feature
 ---

 Key: HBASE-6453
 URL: https://issues.apache.org/jira/browse/HBASE-6453
 Project: HBase
  Issue Type: New Feature
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
Assignee: terry zhang
 Attachments: hbase-6453-v1.patch


 Now we can not control when hbase replication  start to work. this patch 
 support we set a time stamp filter . All the row which is below this time 
 stamp will not be replicated. We also can delete and show this time stamp in 
 hbase shell if we want to change it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6453) Hbase Replication point in time feature

2012-07-25 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13422147#comment-13422147
 ] 

terry zhang commented on HBASE-6453:


{code}
hbase(main):001:0 set_timefilter

ERROR: wrong number of arguments (0 for 2)

Here is some help for this command:
Set a peer cluster time filter to replicate to, the row which time stamp is 
before
the timestamp will be filtered.

Examples:

  hbase set_timefilter '1', 1329896850047
  hbase set_timefilter '2', 1329896850047


hbase(main):002:0 set_timefilter '1',1329896850047
0 row(s) in 0.3000 seconds
{code} 

set time stamp to 1329896850047. Them all the kvs which is early than 
1329896850047 will be filterd

 Hbase Replication point in time feature
 ---

 Key: HBASE-6453
 URL: https://issues.apache.org/jira/browse/HBASE-6453
 Project: HBase
  Issue Type: New Feature
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
Assignee: terry zhang
 Attachments: hbase-6453-v1.patch


 Now we can not control when hbase replication  start to work. this patch 
 support we set a time stamp filter . All the row which is below this time 
 stamp will not be replicated. We also can delete and show this time stamp in 
 hbase shell if we want to change it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6453) Hbase Replication point in time feature

2012-07-25 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13422148#comment-13422148
 ] 

terry zhang commented on HBASE-6453:


{code}
hbase(main):003:0 get_timefilter '1'
PEER_IDTIME_FILTER  

 1 1329896850047 
{code}
we can show the time stamp by clusterId and check if it is set correctly

 Hbase Replication point in time feature
 ---

 Key: HBASE-6453
 URL: https://issues.apache.org/jira/browse/HBASE-6453
 Project: HBase
  Issue Type: New Feature
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
Assignee: terry zhang
 Attachments: hbase-6453-v1.patch


 Now we can not control when hbase replication  start to work. this patch 
 support we set a time stamp filter . All the row which is below this time 
 stamp will not be replicated. We also can delete and show this time stamp in 
 hbase shell if we want to change it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6453) Hbase Replication point in time feature

2012-07-25 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13422149#comment-13422149
 ] 

terry zhang commented on HBASE-6453:


we also can drop the time filter. after we drop it , time stamp change to zero 
. and all the kvs will be replicated.

{code}
hbase(main):004:0 drop_timefilter '1'
0 row(s) in 0.0030 seconds

hbase(main):005:0 get_timefilter '1'
PEER_IDTIME_FILTER  

 1 0 
{code}

 Hbase Replication point in time feature
 ---

 Key: HBASE-6453
 URL: https://issues.apache.org/jira/browse/HBASE-6453
 Project: HBase
  Issue Type: New Feature
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
Assignee: terry zhang
 Attachments: hbase-6453-v1.patch


 Now we can not control when hbase replication  start to work. this patch 
 support we set a time stamp filter . All the row which is below this time 
 stamp will not be replicated. We also can delete and show this time stamp in 
 hbase shell if we want to change it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-6446) Replication source will throw EOF exception when hlog size is 0

2012-07-24 Thread terry zhang (JIRA)
terry zhang created HBASE-6446:
--

 Summary: Replication source will throw EOF exception when hlog 
size is 0
 Key: HBASE-6446
 URL: https://issues.apache.org/jira/browse/HBASE-6446
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang


when master cluster startup new hlog which size is 0 will be created. if we 
start replication, replication source will print many EOF exception when 
openreader. I think we need to ignore this case and do not print so many 
exception warning log .

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6446) Replication source will throw EOF exception when hlog size is 0

2012-07-24 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13421310#comment-13421310
 ] 

terry zhang commented on HBASE-6446:


[]$ hadoop dfs -ls /hbase-73/.logs/dw73.kgb.sqa.cm4,60020,1343114427581
Found 3 items
rw-rr- 3 wuting supergroup 578 2012-07-24 15:20 
/hbase-73/.logs/dw73.kgb.sqa.cm4,60020,1343114427581/dw73.kgb.sqa.cm4%2C60020%2C1343114427581.1343114427921
rw-rr- 3 wuting supergroup 399 2012-07-24 15:20 
/hbase-73/.logs/dw73.kgb.sqa.cm4,60020,1343114427581/dw73.kgb.sqa.cm4%2C60020%2C1343114427581.1343114433385
{color:red} 
rw-rr- 3 wuting supergroup 0 2012-07-24 15:20 
/hbase-73/.logs/dw73.kgb.sqa.cm4,60020,1343114427581/dw73.kgb.sqa.cm4%2C60020%2C1343114427581.1343114434732
{color} 
2012-07-24 15:24:55,516 DEBUG 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log 
for replication dw73.kgb.sqa.cm4%2C60020%2C1343114427581.1343114434732 at 0
2012-07-24 15:24:55,521 WARN 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: 1 Got: 
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readFully(DataInputStream.java:152)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1465)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1437)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1424)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1419)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.init(SequenceFileLogReader.java:55)
at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175)
at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:721)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:475)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:286)

 Replication source will throw EOF exception when hlog size is 0
 ---

 Key: HBASE-6446
 URL: https://issues.apache.org/jira/browse/HBASE-6446
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang

 when master cluster startup new hlog which size is 0 will be created. if we 
 start replication, replication source will print many EOF exception when 
 openreader. I think we need to ignore this case and do not print so many 
 exception warning log .

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6446) Replication source will throw EOF exception when hlog size is 0

2012-07-24 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6446:
---

Attachment: hbase-6446.patch

 Replication source will throw EOF exception when hlog size is 0
 ---

 Key: HBASE-6446
 URL: https://issues.apache.org/jira/browse/HBASE-6446
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
 Attachments: hbase-6446.patch


 when master cluster startup new hlog which size is 0 will be created. if we 
 start replication, replication source will print many EOF exception when 
 openreader. I think we need to ignore this case and do not print so many 
 exception warning log .

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-6446) Replication source will throw EOF exception when hlog size is 0

2012-07-24 Thread terry zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13421963#comment-13421963
 ] 

terry zhang commented on HBASE-6446:


hi Daniel ,what about check the hlog queue length?  

 {code:title=ReplicationSource.java|borderStyle=solid}
catch (IOException ioe) {
  
  if(this.queue.size() != 0){
LOG.warn(peerClusterZnode +  Got: , ioe);
  }
 
  // TODO Need a better way to determinate if a file is really gone but
  // TODO without scanning all logs dir
  if (sleepMultiplier == this.maxRetriesMultiplier) {
LOG.warn(Waited too long for this file, considering dumping);
return !processEndOfFile();
  }
}
{code} 

this can prevent warning exception print again and again when master rs startup 
and no data was put in.

 Replication source will throw EOF exception when hlog size is 0
 ---

 Key: HBASE-6446
 URL: https://issues.apache.org/jira/browse/HBASE-6446
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
 Attachments: hbase-6446.patch


 when master cluster startup new hlog which size is 0 will be created. if we 
 start replication, replication source will print many EOF exception when 
 openreader. I think we need to ignore this case and do not print so many 
 exception warning log .

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-6446) Replication source will throw EOF exception when hlog size is 0

2012-07-24 Thread terry zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terry zhang updated HBASE-6446:
---

Attachment: hbase-6446-v2.patch

OK,Daniel. Let's change it from warning level to debug level to prevent warning 
in online cluster.

 Replication source will throw EOF exception when hlog size is 0
 ---

 Key: HBASE-6446
 URL: https://issues.apache.org/jira/browse/HBASE-6446
 Project: HBase
  Issue Type: Bug
  Components: replication
Affects Versions: 0.94.0
Reporter: terry zhang
 Attachments: hbase-6446-v2.patch, hbase-6446.patch


 when master cluster startup new hlog which size is 0 will be created. if we 
 start replication, replication source will print many EOF exception when 
 openreader. I think we need to ignore this case and do not print so many 
 exception warning log .

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira