from:"\"Liu Shaohui \\$JIRA\\$\""

[jira] [Created] (HBASE-11868) Data loss in hlog when the hdfs is unavailable

2014-08-31 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11868:
---

 Summary: Data loss in hlog when the hdfs is unavailable
 Key: HBASE-11868
 URL: https://issues.apache.org/jira/browse/HBASE-11868
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.5
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Blocker


When using the new thread model in hbase, we found a bug which may cause data 
loss when the the hdfs is unavailable.

When writing wal Edits to hlog in doMiniBatchMutation of HRegion, the hlog 
first call appendNoSync to write the edits to hlog and then call sync with 
txid. 

Assumed that the txid of current write is 10, and the syncedTillHere in hlog is 
9 and the failedTxid is 0. When  the the hdfs is unavailable, the AsyncWriter 
or AsyncSyncer will fail to apend the edits or sync, then they will update the 
syncedTillHere to 10 and the failedTxid to 10.

When the hlog calls the sync with txid :10, the failedTxid will nerver be 
checked for txid is less than syncedTillHere.  The client thinks the write 
success , but the data only be writtten to memstore not hlog. If the 
regionserver is down later before the memstore if flushed, the data will be 
lost.
{code}
  // sync all transactions upto the specified txid
  private void syncer(long txid) throws IOException {
synchronized (this.syncedTillHere) {
  while (this.syncedTillHere.get() < txid) {
try {
  this.syncedTillHere.wait();

  if (txid <= this.failedTxid.get()) {
assert asyncIOE != null :
  "current txid is among(under) failed txids, but asyncIOE is 
null!";
throw asyncIOE;
  }
} catch (InterruptedException e) {
  LOG.debug("interrupted while waiting for notification from 
AsyncNotifier");
}
  }
}
  }
{code}

We can fix this issue by moving the comparing of txid and failedTxid outside 
the while block.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-11869) Support snapshot owner

2014-08-31 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11869:
---

 Summary: Support snapshot owner
 Key: HBASE-11869
 URL: https://issues.apache.org/jira/browse/HBASE-11869
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


In current codebase, the table snapshot operations only can be done by the 
global admin , not by  the table admin.

There is a multi-tenant hbase cluster, each table has different snapshot 
policies, eg: do snapshot per week, or snapshot after the new data are 
imported. 
We want to release the snapshot permission to each table admin.

According to [~mbertozzi]'s suggestion, we implement the snapshot own feature.


* The user with table admin permission can create snapshot and the owner of 
this snapshot is this user.
* The owner of snapshot can delete and restore the snapshot.
* Only the user with global admin permission can clone a snapshot, for this 
operation creates a new table.
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-11877) Make TableSplit more readable

2014-09-02 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11877:
---

 Summary: Make TableSplit more readable
 Key: HBASE-11877
 URL: https://issues.apache.org/jira/browse/HBASE-11877
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Priority: Minor


When debugging MR jobs reading from hbase table, it's import to figure out 
which region a map task is reading from.
But the table split object is hard to read.
eg:
{code}
2014-09-01 20:58:39,783 INFO [main] org.apache.hadoop.mapred.MapTask: 
Processing split: lg-hadoop-prc-st40.bj:,0
{code}

We should make it more readable.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-11897) Add append and remove table-cfs cmds for replication

2014-09-04 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11897:
---

 Summary: Add append and remove table-cfs cmds for replication
 Key: HBASE-11897
 URL: https://issues.apache.org/jira/browse/HBASE-11897
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Priority: Minor


HBASE-8751 introduces the tables/table-column families config for a replication 
peer. It's very flexible for practical replication in hbase clusters.

But it is easy to make mistakes during add or remove a table/table-column 
family for a existing peer, especially when the table-cfs is very long, for we 
need to copy the current table-cfs of the peer first,  and then add or remove a 
table/table-column family to/from the table-cfs, at last set  back the 
table-cfs using the cmd: set_peer_tableCFs. 

So we implements two new cmds: append_peer_tableCFs and remove_peer_tableCFs to 
do the operation of adding and removing a table/table-column family. They are 
useful operation tools.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-11957) Backport HBASE-5974 to 0.94

2014-09-12 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11957:
---

 Summary: Backport HBASE-5974 to 0.94
 Key: HBASE-11957
 URL: https://issues.apache.org/jira/browse/HBASE-11957
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Critical
 Fix For: 0.94.24


HBASE-5974:Scanner retry behavior with RPC timeout on next() seems incorrect, 
which cause data missing in hbase scan.

I think we should fix it in 0.94.
[~lhofhansl]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-11958) Add documents about snapshot owner

2014-09-12 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11958:
---

 Summary: Add documents about snapshot owner
 Key: HBASE-11958
 URL: https://issues.apache.org/jira/browse/HBASE-11958
 Project: HBase
  Issue Type: Sub-task
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


HBASE-11869 introduct snapshot owner feature. We need to add documents about it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12241) The crash of regionServer when taking deadserver's queue break replication

2014-10-13 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12241:
---

 Summary: The crash of regionServer when taking deadserver's queue 
break replication
 Key: HBASE-12241
 URL: https://issues.apache.org/jira/browse/HBASE-12241
 Project: HBase
  Issue Type: Bug
  Components: Replication
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Critical


When a regionserver crash, another regionserver will try to take over the 
replication hlogs queue and help the the the dead regionserver to finish the 
replcation.See NodeFailoverWorker in ReplicationSourceManager

Currently hbase.zookeeper.useMulti is false in default configuration. The 
operation of taking over replication queue is not atomic. The 
ReplicationSourceManager firstly lock the replication node of dead regionserver 
and then copy the replication queue, and delete replication node of dead 
regionserver at last. The operation of the lockOtherRS just creates a 
persistent zk node named "lock" which prevent other regionserver taking over 
the replication queue.
See:
{code}
  public boolean lockOtherRS(String znode) {
try {
  String parent = ZKUtil.joinZNode(this.rsZNode, znode);
  if (parent.equals(rsServerNameZnode)) {
LOG.warn("Won't lock because this is us, we're dead!");
return false;
  }
  String p = ZKUtil.joinZNode(parent, RS_LOCK_ZNODE);
  ZKUtil.createAndWatch(this.zookeeper, p, 
Bytes.toBytes(rsServerNameZnode));
} catch (KeeperException e) {
  ...
  return false;
}
return true;
  }
{code}

But if a regionserver crashed after creating this "lock" zk node and before 
coping the replication queue to its replication queue, the "lock" zk node will 
be left forever and
no other regionserver can take over the replication queue.

In out production cluster, we encounter this problem. We found the replication 
queue was there and no regionserver took over it and a "lock" zk node left 
there.
{quote}
hbase.32561.log:2014-09-24,14:09:28,790 INFO 
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Won't transfer the 
queue, another RS took care of it because of: KeeperErrorCode = NoNode for 
/hbase/hhsrv-micloud/replication/rs/hh-hadoop-srv-st09.bj,12610,1410937824255/lock
hbase.32561.log:2014-09-24,14:14:45,148 INFO 
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Won't transfer the 
queue, another RS took care of it because of: KeeperErrorCode = NoNode for 
/hbase/hhsrv-micloud/replication/rs/hh-hadoop-srv-st10.bj,12600,1410937795685/lock
{quote}

A quick solution is that the lock operation just create an ephemeral "lock" 
zookeeper node and when the lock node is deleted, other regionserver will be 
notified to check if there are replication queue left.

Suggestions are welcomed! Thanks.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12263) RegionServer listens on localhost in distributed cluster when DNS is unavailable

2014-10-14 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12263:
---

 Summary: RegionServer listens on localhost in distributed cluster 
when DNS is unavailable
 Key: HBASE-12263
 URL: https://issues.apache.org/jira/browse/HBASE-12263
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Priority: Minor


When DNS is unavailable, the new started regionservers will listen on 
localhost(127.0.0.1) in a distributed cluster, which results that the hmaster 
will fail to assign regions to those regionservers.
{quote}
2014-10-15,04:26:42,273 WARN org.apache.hadoop.net.DNS: Unable to determine 
local hostname -falling back to "localhost"
java.net.UnknownHostException: xx-hadoop-srv-st01.bj: xx-hadoop-srv-st01.bj
  at java.net.InetAddress.getLocalHost(InetAddress.java:1360)
  at org.apache.hadoop.net.DNS.resolveLocalHostname(DNS.java:260)
  at org.apache.hadoop.net.DNS.(DNS.java:58)
  at 
org.apache.hadoop.hbase.regionserver.HRegionServer.(HRegionServer.java:472)
{quote}

{quote}
$ netstat -nap | grep 13748
tcp0  0 127.0.0.1:12610 0.0.0.0:*   
LISTEN  13748/java 
tcp0  0 0.0.0.0:12611   0.0.0.0:*   
LISTEN  13748/java
{quote}

In this situation, I think we shoud throw an exception and make the startup of 
regionservers failed.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12336) RegionServer failed to shutdown for NodeFailoverWorker thread

2014-10-24 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12336:
---

 Summary: RegionServer failed to shutdown for NodeFailoverWorker 
thread
 Key: HBASE-12336
 URL: https://issues.apache.org/jira/browse/HBASE-12336
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.11
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


After enabling hbase.zookeeper.useMulti in hbase cluster, we found that 
regionserver failed to shutdown. Other threads have exited except the a 
NodeFailoverWorker thread.

{code}
"ReplicationExecutor-0" prio=10 tid=0x7f0d40195ad0 nid=0x73a in 
Object.wait() [0x7f0dc8fe6000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309)
- locked <0x0005a16df080> (a org.apache.zookeeper.ClientCnxn$Packet)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:930)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:912)
at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:531)
at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1518)
at 
org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:804)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:612)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

{code}
It's sure that the shutdown method of the executor is called in  
ReplicationSourceManager#join.
 
I am looking for the root cause and suggestions are welcomed. Thanks




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12361) Show data locality of region in tabel page

2014-10-28 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12361:
---

 Summary: Show data locality of region in tabel page
 Key: HBASE-12361
 URL: https://issues.apache.org/jira/browse/HBASE-12361
 Project: HBase
  Issue Type: New Feature
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


Data Locality is an important metric added in HBASE-4114 for read performance.
It will be useful to show it in table page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12434) Add a command to compact all the region in a regionserver

2014-11-05 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12434:
---

 Summary: Add a command to compact all the region in a regionserver
 Key: HBASE-12434
 URL: https://issues.apache.org/jira/browse/HBASE-12434
 Project: HBase
  Issue Type: Improvement
  Components: shell
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 2.0.0


When adding new regionserver to the hbase cluster, the data locality of the 
regions in the new regionserver is very low. Most read hdfs requests from these 
regions are remote read, not local read. So, the latency of read request to 
these regions is not stable. 
Usually we can compact all the regions in the new regionserver to improve the 
data locality in offpeak.

So we add a command: compact_rs in hbase shell.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12451) IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region split in rolling update of cluster

2014-11-10 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12451:
---

 Summary: IncreasingToUpperBoundRegionSplitPolicy may cause 
unnecessary region split in rolling update of cluster
 Key: HBASE-12451
 URL: https://issues.apache.org/jira/browse/HBASE-12451
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 2.0.0


Currently IncreasingToUpperBoundRegionSplitPolicy is the default region split 
policy. In this policy, split size is the number of regions that are on this 
server that all are of the same table, cubed, times 2x the region flush size.

But when unloading regions of a regionserver in a cluster using 
region_mover.rb, the number of regions that are on this server that all are of 
the same table will decrease, and the split size will decrease too, which may 
cause the left region split in the regionsever. Region Splits also happens when 
loading regions of a regionserver in a cluster. 

A improvment may set a minmum split size in 
IncreasingToUpperBoundRegionSplitPolicy
Suggestions are welcomed. Thanks~




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12462) Support deleting all columns of the specified family of a row in hbase shell

2014-11-12 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12462:
---

 Summary: Support deleting all columns of the specified family of a 
row in hbase shell
 Key: HBASE-12462
 URL: https://issues.apache.org/jira/browse/HBASE-12462
 Project: HBase
  Issue Type: New Feature
  Components: shell
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 2.0.0


Currently, HBase shell only support deleting a column of a row in a table. In 
some scenarios, we want to delete all the columns under a a column family of a 
row,  but there may be many columns there. It's difficult to delete the columns 
one by one in shell.

It's easy to add this feature in shell since the Delete have the API of 
deleting a family.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12534) Wrong region location cache in client after regions are moved

2014-11-19 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12534:
---

 Summary: Wrong region location cache in client after regions are 
moved
 Key: HBASE-12534
 URL: https://issues.apache.org/jira/browse/HBASE-12534
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.24
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Critical


In our 0.94 hbase cluster, we found that client got wrong region location cache 
and did not update it after a region is moved to another regionserver.
The reason is wrong client config and bug in RpcRetryingCaller  of hbase client.
The rpc configs are following:
{code}
hbase.rpc.timeout=1000
hbase.client.pause=200
hbase.client.operation.timeout=1200
{code}
But the client retry number is 3
{code}
hbase.client.retries.number=3
{code}
Assumed that a region is at regionserver A before, and then it is moved to 
regionserver B. The client try to make a  call to regionserver A and get an 
NotServingRegionException. For the rety number is not 1, the region server 
location cache is not cleaned. See: RpcRetryingCaller.java#141 and 
RegionServerCallable.java#127
{code}
  @Override
  public void throwable(Throwable t, boolean retrying) {
if (t instanceof SocketTimeoutException ||
  
} else if (t instanceof NotServingRegionException && !retrying) {
  // Purge cache entries for this specific region from hbase:meta cache
  // since we don't call connect(true) when number of retries is 1.
  getConnection().deleteCachedRegionLocation(location);
}
  }
{code}
But the call did not retry and throw an SocketTimeoutException for the time the 
call will take is larger than the operation timeout.See 
RpcRetryingCaller.java#152
{code}
expectedSleep = callable.sleep(pause, tries + 1);

// If, after the planned sleep, there won't be enough time left, we 
stop now.
long duration = singleCallDuration(expectedSleep);
if (duration > callTimeout) {
  String msg = "callTimeout=" + callTimeout + ", callDuration=" + 
duration +
  ": " + callable.getExceptionMessageAdditionalDetail();
  throw (SocketTimeoutException)(new 
SocketTimeoutException(msg).initCause(t));
}
{code}

At last, the wrong region location will never be not cleaned up . 

[~lhofhansl]
In hbase 0.94, the MIN_RPC_TIMEOUT in singleCallDuration is 2000 in default, 
which trigger this bug. 
{code}
  private long singleCallDuration(final long expectedSleep) {
return (EnvironmentEdgeManager.currentTimeMillis() - this.globalStartTime)
  + MIN_RPC_TIMEOUT + expectedSleep;
  }
{code}
But there is risk in master code too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12542) Delete a family of table online will crash regionserver

2014-11-19 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12542:
---

 Summary: Delete a family of table online will crash regionserver 
 Key: HBASE-12542
 URL: https://issues.apache.org/jira/browse/HBASE-12542
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Critical
 Fix For: 2.0.0, 0.94.25


Using alter command to delete a family of table online will make the 
regionsevers that serve the regions of the table crash.
{code}
alter 't', NAME => 'f', METHOD => 'delete'
{code}

The reason is that TableDeleteFamilyHandler in HMaster delete the family dir 
firstly and then reopen all the regions of table.
When the regionserver reopen the region, it will crash for the exception in 
flushing memstore to hfile of the deleted family during closing the region, 
because the parent dir of the hfile has been deleted in 
TableDeleteFamilyHandler.
See: TableDeleteFamilyHandler.java #57

A simple solution is change the order of operations in TableDeleteFamilyHandler.
- update table descriptor first, 
- reopen all the regions,
- delete the the family dir at last.

Suggestions are welcomed.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12635) Delete acl notify znode of table after the table is deleted

2014-12-04 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12635:
---

 Summary: Delete acl notify znode of table after the table is 
deleted
 Key: HBASE-12635
 URL: https://issues.apache.org/jira/browse/HBASE-12635
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


In our multi-tenant hbase cluster, we found that there are over 1M znodes under 
the acl node. The reason is that users create and delete tables with different 
names frequently.  The acl notify znode are left there after the tables are 
deleted.

Simple solution is that deleting acl notify znode of table in AccessController 
when deleting the table.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12636) Avoid too many write operations on zookeeper in replication

2014-12-04 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12636:
---

 Summary: Avoid too many write operations on zookeeper in 
replication
 Key: HBASE-12636
 URL: https://issues.apache.org/jira/browse/HBASE-12636
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.94.11
Reporter: Liu Shaohui
Assignee: Liu Shaohui
 Fix For: 1.0.0


In our production cluster, we found there are about over 1k write operations 
per second on zookeeper from hbase replication. The reason is that the 
replication source will write the log position to zookeeper for every edit 
shipping. If the current replicating WAL is just the WAL that regionserver is 
writing to,  each skipping will be very small but the frequency is very high, 
which causes many write operations on zookeeper.

A simple solution is that writing log position to zookeeper when position diff 
or skipped edit number is larger than a threshold, not every  edit shipping.

Suggestions are welcomed, thx~





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12641) Grant all permissions of hbase zookeeper node to hbase superuser in a secure cluster

2014-12-04 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12641:
---

 Summary: Grant all permissions of hbase zookeeper node to hbase 
superuser in a secure cluster
 Key: HBASE-12641
 URL: https://issues.apache.org/jira/browse/HBASE-12641
 Project: HBase
  Issue Type: Improvement
  Components: Zookeeper
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


Currently in a secure cluster, only the master/regionserver kerberos user can 
manage the znode of hbase. But he master/regionserver kerberos user is for rpc 
connection and we usually use another super user to manage the cluster.

In some special scenarios, we need to manage the data of znode with the supper 
user.

eg: 
a, To get the data of the znode for debugging.
b, HBASE-8253: We need to delete the znode for the corrupted hlog to avoid it 
block the replication.

So we grant all permissions of hbase zookeeper node to hbase superuser during 
creating these znodes.

Suggestions are welcomed.

[~apurtell]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12739) Avoid too large identifier of ZooKeeperWatcher

2014-12-22 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12739:
---

 Summary: Avoid too large identifier of ZooKeeperWatcher
 Key: HBASE-12739
 URL: https://issues.apache.org/jira/browse/HBASE-12739
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


For each SyncConnected event,  the ZooKeeperWatcher will append the session id 
to its identity.

During the failover of zk, the zookeeper client can connect to the zk server, 
but the the zk server can not serve the request, so the client will try 
continually, which will produce many SyncConnected events and a very large 
identifier of ZooKeeperWatcher in hbase log.

{code}
2014-12-22,12:38:56,296 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:16500-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-0x349cbb4e4a7f0ba-...
{code}
A simple patch to fix this problem.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12801) Failed to truncate a table while maintaing binary region boundaries

2015-01-03 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12801:
---

 Summary: Failed to truncate a table while maintaing binary region 
boundaries
 Key: HBASE-12801
 URL: https://issues.apache.org/jira/browse/HBASE-12801
 Project: HBase
  Issue Type: Bug
  Components: shell
Affects Versions: 0.94.11
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


Binary region boundaries become wrong during 
converting byte array to normal string, and back to byte array in 
truncate_preserve of admin.rb, which makes the truncation of table failed.

See: truncate_preserve method in admin.rb
{code}
 splits = h_table.getRegionLocations().keys().map{|i| 
Bytes.toString(i.getStartKey)}.delete_if{|k| k == ""}.to_java :String
 splits = org.apache.hadoop.hbase.util.Bytes.toByteArrays(splits)
{code}
eg:
{code}
\xFA\x00\x00\x00\x00\x00\x00\x00 ->  \xEF\xBF\xBD\x00\x00\x00\x00\x00\x00\x00
\xFC\x00\x00\x00\x00\x00\x00\x00 -> \xEF\xBF\xBD\x00\x00\x00\x00\x00\x00\x00
{code}

Simple patch is using binary string instead of normal string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12865) Wals may be deleted before they are replicated to peers

2015-01-14 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12865:
---

 Summary: Wals may be deleted before they are replicated to peers
 Key: HBASE-12865
 URL: https://issues.apache.org/jira/browse/HBASE-12865
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui


By design, ReplicationLogCleaner guarantee that the WALs  being in replication 
queue can't been deleted by the HMaster. The ReplicationLogCleaner gets the WAL 
set from zookeeper by scanning the replication zk node. But it may get 
uncompleted WAL set during replication failover for the scan operation is not 
atomic.

For example: There are three region servers: rs1, rs2, rs3, and peer id 10.  
The layout of replication zookeeper nodes is:
{code}
/hbase/replication/rs/rs1/10/wals
 /rs2/10/wals
 /rs3/10/wals
{code}
- t1: the ReplicationLogCleaner finished scanning the replication queue of rs1, 
and start to scan the queue of rs2.
- t2: region server rs3 is down, and rs1 take over rs3's replication queue. The 
new layout is

{code}
/hbase/replication/rs/rs1/10/wals
 /rs1/10-rs3/wals
 /rs2/10/wals
 /rs3
{code}
- t3, the ReplicationLogCleaner finished scanning the queue of rs2, and start 
to scan the node of rs3. But the the queue has been moved to  
"replication/rs1/10-rs3/WALS"

So the  ReplicationLogCleaner will miss the WALs of rs3 in peer 10 and the 
hmaster may delete these WALs before they are replicated to peer clusters.

We encountered this problem in our cluster and I think it's a serious bug for 
replication.

Suggestions are welcomed to fix this bug. thx~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12884) Asynchronous Event Notification in HBase

2015-01-19 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12884:
---

 Summary:  Asynchronous Event Notification in HBase
 Key: HBASE-12884
 URL: https://issues.apache.org/jira/browse/HBASE-12884
 Project: HBase
  Issue Type: New Feature
Reporter: Liu Shaohui


*Background*
In many scenarios, we need an asynchronous event notification 
mechanism on HBase to know which data have been changed and users can do some 
pre-defined reactions to these events.

For example: 
* Incremental statistics of data in HBase
* Audit about change of important data
* To clean invalid data in other cache systems

*Features maybe*
* The mechanism is scalable.
* The notification is asynchronous. We don't want to affect the write 
performance of HBase.
* The notification is reliable. The events can't be lost, but we can tolerate 
duplicated events.

*Solution Maybe*
* Event notification based on replication. Transform the WAL edits to events, 
and replicates a special peer that users implements.


This is just a brief thought about this feature. Discussions and suggestions 
are welcomed! Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12916) No access control for replicating WAL entries

2015-01-25 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12916:
---

 Summary: No access control for replicating WAL entries
 Key: HBASE-12916
 URL: https://issues.apache.org/jira/browse/HBASE-12916
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 0.94.26, 2.0.0, 0.98.12
Reporter: Liu Shaohui
Assignee: Liu Shaohui


Currently, there is no access control for replicating WAL entries in secure 
HBase cluster. Any authenticated user can write any data they want to any table 
of a secure cluster by using the replication api.

Simple solution is  to add permission check before replicating WAL entries. And 
only user with global write permission can replicate WAL entries to this 
cluster.

Another option is adding "Replication" action in hbase and only user with 
"Replication" permission can replicate WAL entries to this cluster?

[~apurtell] 

What's your suggestion? Thanks




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12921) Port HBASE-5356 'region_mover.rb can hang if table region it belongs to is deleted' to 0.94

2015-01-26 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12921:
---

 Summary: Port HBASE-5356 'region_mover.rb can hang if table region 
it belongs to is deleted' to 0.94
 Key: HBASE-12921
 URL: https://issues.apache.org/jira/browse/HBASE-12921
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.11
 Environment: 


Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 0.94.28


This is backport of HBASE-5356 'region_mover.rb can hang if table region it 
belongs to is deleted' to 0.94.

[~lhofhansl]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-12943) Set sun.net.inetaddr.ttl in HBase

2015-01-29 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-12943:
---

 Summary: Set sun.net.inetaddr.ttl in HBase
 Key: HBASE-12943
 URL: https://issues.apache.org/jira/browse/HBASE-12943
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui


The default value of config: sun.net.inetaddr.ttl is -1 and the java processes 
will cache the mapping of hostname to ip address  forever, See: 
http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html

But things go wrong when a regionserver with same hostname and different ip 
address rejoins the hbase cluster. The HMaster will get wrong ip address of the 
regionserver from this cache and every region assignment to this regionserver 
will be blocked for a time because the HMaster can't communicate with the 
regionserver.

A tradeoff is to set the sun.net.inetaddr.ttl to 10m or 1h and make the wrong 
cache expired.

Suggestions are welcomed. Thanks~





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-13199) Some small improvements on canary tool

2015-03-11 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-13199:
---

 Summary: Some small improvements on canary tool
 Key: HBASE-13199
 URL: https://issues.apache.org/jira/browse/HBASE-13199
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Assignee: Liu Shaohui


Improvements
- Make the sniff of region and regionserver parallel to support large cluster 
with 1+ region and 500+ regionservers using thread pool.
- Set cacheblock to false in get and scan to avoid influence to block cache.
- Add FirstKeyOnlyFilter to get and scan to avoid read and translate too many 
data from HBase. There may be many column under a column family in a flat-wide 
table.
 - random select the region when sniffing a regionserver.

[~stack]

Suggestions are welcomed. Thanks~

Another question is that why to check each column family with separate requests 
when sniffing a region? Can we just check a  column family of a region?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-13216) Add version info in RPC connection header

2015-03-11 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-13216:
---

 Summary: Add version info in RPC connection header
 Key: HBASE-13216
 URL: https://issues.apache.org/jira/browse/HBASE-13216
 Project: HBase
  Issue Type: Improvement
  Components: Client, rpc
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 2.0.0


In the operation of a cluster, we usually want to know which clients are using 
the HBase client with critical bugs or too old version we will not support in 
future.

By adding version info in RPC connection header, we can get these informations 
from audit log and promote them upgrade before a deadline.

Discussions and suggestions are welcomed. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-13280) TestSecureRPC failed

2015-03-18 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-13280:
---

 Summary: TestSecureRPC failed
 Key: HBASE-13280
 URL: https://issues.apache.org/jira/browse/HBASE-13280
 Project: HBase
  Issue Type: Test
Reporter: Liu Shaohui
Priority: Minor


{code}
Running org.apache.hadoop.hbase.security.TestSecureRPC
Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 33.795 sec <<< 
FAILURE! - in org.apache.hadoop.hbase.security.TestSecureRPC
testRpc(org.apache.hadoop.hbase.security.TestSecureRPC)  Time elapsed: 14.963 
sec  <<< ERROR!
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at 
org.apache.hadoop.hbase.security.TestSecureRPC.testRpcCallWithEnabledKerberosSaslAuth(TestSecureRPC.java:160)
at 
org.apache.hadoop.hbase.security.TestSecureRPC.testRpc(TestSecureRPC.java:102)

testAsyncRpc(org.apache.hadoop.hbase.security.TestSecureRPC)  Time elapsed: 
5.15 sec  <<< ERROR!
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at 
org.apache.hadoop.hbase.security.TestSecureRPC.testRpcCallWithEnabledKerberosSaslAuth(TestSecureRPC.java:160)
at 
org.apache.hadoop.hbase.security.TestSecureRPC.testAsyncRpc(TestSecureRPC.java:107)
{code}

>From log we saw that:
{code}
2015-03-19 11:27:02,271 WARN  [Thread-5] ipc.RpcClientImpl$Connection$1(662): 
Couldn't setup connection for hbase/liushaohui-optiplex-...@example.com to 
hbase/liushaohui-optiplex-...@example.com
Exception in thread "Thread-5" java.lang.RuntimeException: 
com.google.protobuf.ServiceException: java.io.IOException: Couldn't setup 
connection for hbase/liushaohui-optiplex-...@example.com to 
hbase/liushaohui-optiplex-...@example.com
at 
org.apache.hadoop.hbase.ipc.TestDelayedRpc$TestThread.run(TestDelayedRpc.java:275)
Caused by: com.google.protobuf.ServiceException: java.io.IOException: Couldn't 
setup connection for hbase/liushaohui-optiplex-...@example.com to 
hbase/liushaohui-optiplex-...@example.com
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
at 
org.apache.hadoop.hbase.ipc.protobuf.generated.TestDelayedRpcProtos$TestDelayedService$BlockingStub.test(TestDelayedRpcProtos.java:1115)
at 
org.apache.hadoop.hbase.ipc.TestDelayedRpc$TestThread.run(TestDelayedRpc.java:272)
Caused by: java.io.IOException: Couldn't setup connection for 
hbase/liushaohui-optiplex-...@example.com to 
hbase/liushaohui-optiplex-...@example.com
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$1.run(RpcClientImpl.java:663)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.handleSaslConnectionFailure(RpcClientImpl.java:635)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:743)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:885)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:854)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1170)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
... 3 more
Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by 
GSSException: No valid credentials provided (Mechanism level: Server not found 
in Kerberos database (7) - Server not found in Kerberos database)]
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
at 
org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:609)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:154)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:735)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:732)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstr

[jira] [Created] (HBASE-13319) Support 64-bits total row number in PerformanceEvaluation

2015-03-24 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-13319:
---

 Summary: Support 64-bits total row number in PerformanceEvaluation
 Key: HBASE-13319
 URL: https://issues.apache.org/jira/browse/HBASE-13319
 Project: HBase
  Issue Type: Improvement
  Components: Performance
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


Currently the total row number in PerformanceEvaluation is 32 bits. It's not 
enough when testing a large hbase cluster using PerformanceEvaluation in 
mapreduce mode.

Suggestions are welcomed~ Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-13348) Separate the thread number configs for meta server and server operations

2015-03-27 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-13348:
---

 Summary: Separate the thread number configs for meta server and 
server operations
 Key: HBASE-13348
 URL: https://issues.apache.org/jira/browse/HBASE-13348
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui


Currently, the config keys for thread number of meta server and server 
operations in HMaster are same: 
See: HMaster.java #993
{code}
this.service.startExecutorService(ExecutorType.MASTER_SERVER_OPERATIONS,
  conf.getInt("hbase.master.executor.serverops.threads", 5));
   this.service.startExecutorService(ExecutorType.MASTER_META_SERVER_OPERATIONS,
  conf.getInt("hbase.master.executor.serverops.threads", 5));
{code}

In large cluster, we usually enlarge the thread number for server operation 
separately to make the master handle   regionserver shutdown events quickly in 
some extremely cases.

So I think we need separate the thread number config for the two operations.

Suggestions are welcomed. Thanks~





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-13366) Throw DoNotRetryIOException instead of read only IOException

2015-03-31 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-13366:
---

 Summary: Throw DoNotRetryIOException instead of read only 
IOException
 Key: HBASE-13366
 URL: https://issues.apache.org/jira/browse/HBASE-13366
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


Currently, the read only region just throws an IOException to the clients who 
send write requests to it. This will cause the clients retry for configured 
times or until operation timeout.

Changing this exception to DoNotRetryIOException will make the client failed 
fast.

Suggestions are welcomed~ Thanks




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-13367) Add a replication label to mutations from replication

2015-03-31 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-13367:
---

 Summary: Add a replication label to mutations from replication
 Key: HBASE-13367
 URL: https://issues.apache.org/jira/browse/HBASE-13367
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Assignee: Liu Shaohui


In some scenarios, the regions need to distinguish the mutations from client 
for actual users or from replication of peer cluster. 
- Lower the priority of mutations from replication to improve the latency of 
requests from actuals users
- Set table in replicated state to keep data consistent. In this state, the 
table will reject mutations from users, but accept mutations from replication 
and read requests from users.

So we need to add a replication label to mutations from replication.

Suggestions and discussions are welcomed. Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-13396) Cleanup unclosed writers in later writer rolling

2015-04-03 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-13396:
---

 Summary: Cleanup unclosed writers in later writer rolling
 Key: HBASE-13396
 URL: https://issues.apache.org/jira/browse/HBASE-13396
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


Currently, the default value of hbase.regionserver.logroll.errors.tolerated is 
2, which means regionserver can tolerate two continuous failures of closing 
writers at most. Temporary problems of network or namenode may cause those 
failures. After those failures, the hdfs clients in RS may continue to renew 
the lease of the hlog of the writer and the namenode will not help to recover 
the lease of this hlog. So the last block of this hlog will be RBW(replica 
being written) state until the regionserver is down. Blocks in this state will 
block the datanode decommission and other operations in HDFS.

So I think we need a mechanism to clean up those unclosed writers afterwards. A 
simple solution is to record those unclosed writers and attempt to close these 
writers until success.

Discussions and suggestions are welcomed~ Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-13988) Add exception handler for lease thread

2015-06-29 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-13988:
---

 Summary: Add exception handler for lease thread
 Key: HBASE-13988
 URL: https://issues.apache.org/jira/browse/HBASE-13988
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


In a prod cluster, a region server exited for some important 
threads were not alive. After excluding other threads from the log, we doubted 
the lease thread was the root. 

So we need to add an exception handler to the lease thread to debug why it 
exited in future.
 
{quote}
2015-06-29,12:46:09,222 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: One or more 
threads are no longer alive -- stop
2015-06-29,12:46:09,223 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server 
on 21600
...
2015-06-29,12:46:09,330 INFO org.apache.hadoop.hbase.regionserver.LogRoller: 
LogRoller exiting.
2015-06-29,12:46:09,330 INFO 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Thread-37 exiting
2015-06-29,12:46:09,330 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer$CompactionChecker: 
regionserver21600.compactionChecker exiting
2015-06-29,12:46:12,403 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer$PeriodicMemstoreFlusher: 
regionserver21600.periodicFlusher exiting
{quote}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-13996) Add write sniffing in canary

2015-06-30 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-13996:
---

 Summary: Add write sniffing in canary
 Key: HBASE-13996
 URL: https://issues.apache.org/jira/browse/HBASE-13996
 Project: HBase
  Issue Type: New Feature
  Components: canary
Reporter: Liu Shaohui
Assignee: Liu Shaohui


Currently the canary tool only sniff the read operations, it's hard to finding 
the problem in write path. 

To support the write sniffing, we create a system table named '_canary_'  in 
the canary tool. And the tool will make sure that the region number is large 
than the number of the regionserver and the regions will be distributed onto 
all regionservers.

Periodically, the tool will put data to these regions to calculate the write 
availability of HBase and send alerts if needed.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-9526) LocalHBaseCluster.shutdown hangs for regionserver thread wait zk deleteMyEphemeralNode packet.

2013-09-13 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-9526:
--

 Summary: LocalHBaseCluster.shutdown hangs for regionserver thread 
wait zk deleteMyEphemeralNode packet. 
 Key: HBASE-9526
 URL: https://issues.apache.org/jira/browse/HBASE-9526
 Project: HBase
  Issue Type: Test
Affects Versions: 0.94.3
Reporter: Liu Shaohui
Priority: Minor
 Attachments: stack.log

When LocalHBaseCluster is shutdowned, it will join all regionserver threads.
Regionserver thread will try to delete EphemeralNode and wait zk packet, but 
the sendThread has been not existed, so no one notify the regionserver thread. 

{noformat}

"RegionServer:0;10.237.14.236,43311,1378958529812" prio=10 
tid=0x7f0a9d02 nid=0x18db in Object.wait() [0x7f0a8aa35000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309)
- locked <0x8ece3de8> (a org.apache.zookeeper.ClientCnxn$Packet)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:866)
at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:127)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1038)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1027)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1073)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:851)
at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.runRegionServer(MiniHBaseCluster.java:147)
at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.access$000(MiniHBaseCluster.java:100)
at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer$1.run(MiniHBaseCluster.java:131)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:337)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1340)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.util.Methods.call(Methods.java:37)
at org.apache.hadoop.hbase.security.User.call(User.java:603)
at org.apache.hadoop.hbase.security.User.access$700(User.java:50)
at 
org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:443)
at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.run(MiniHBaseCluster.java:129)

{noformat}

This situation emerges randomly. If i rerun the test, the tests may pass.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HBASE-9543) Impl unique aggregation

2013-09-16 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-9543:
--

 Summary:  Impl unique aggregation
 Key: HBASE-9543
 URL: https://issues.apache.org/jira/browse/HBASE-9543
 Project: HBase
  Issue Type: New Feature
  Components: Coprocessors
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


Impl unique aggregation: return a set of all columns' values in a scan.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HBASE-9568) backport HBASE-6508 to 0.94

2013-09-17 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-9568:
--

 Summary: backport HBASE-6508 to 0.94
 Key: HBASE-9568
 URL: https://issues.apache.org/jira/browse/HBASE-9568
 Project: HBase
  Issue Type: Improvement
  Components: MTTR
Reporter: Liu Shaohui
Priority: Minor


Backport HBASE-6508: Filter out edits at log split time to hbase 0.94

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HBASE-9764) htable AutoFlush is hardcoded as false in PerformanceEvaluation

2013-10-15 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-9764:
--

 Summary: htable AutoFlush is hardcoded as false in 
PerformanceEvaluation
 Key: HBASE-9764
 URL: https://issues.apache.org/jira/browse/HBASE-9764
 Project: HBase
  Issue Type: Bug
  Components: Performance, test
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


In PerformanceEvaluation, htable AutoFlush option is hardcoded as false

{code:title=PerformanceEvaluation.java|borderStyle=solid}

void testSetup() throws IOException {
  this.admin = new HBaseAdmin(conf);
  this.table = new HTable(conf, tableName);
  this.table.setAutoFlush(false);
  this.table.setScannerCaching(30);
}
{code}
This makes the write performace unreal. 

Should we add an autoflush option in PerformanceEvaluation?



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Created] (HBASE-9780) Total row num to write in PerformanceEvaluation vary on thread number

2013-10-16 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-9780:
--

 Summary: Total row num to write in PerformanceEvaluation vary on 
thread number
 Key: HBASE-9780
 URL: https://issues.apache.org/jira/browse/HBASE-9780
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Priority: Minor


Total row num to write in PerformanceEvaluation vary on thread number
{code}
// Set total number of rows to write.
this.R = this.R * N;
{code}

Different row num may result in different random read perf. More threads,  More 
rows, may less block cache hit raito.

Should we make the total row num be immutable for different thread numer?



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Created] (HBASE-9873) Some improvements in hlog and hlog split

2013-11-01 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-9873:
--

 Summary: Some improvements in hlog and hlog split
 Key: HBASE-9873
 URL: https://issues.apache.org/jira/browse/HBASE-9873
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui


Some improvements in hlog and hlog split

1) Try to clean old hlog after each memstore flush to avoid unnecessary hlogs 
split in failover.  Now hlogs cleaning only be run in rolling hlog writer. 

2) Add a background hlog compaction thread to compaction the hlog: remove the 
hlog entries whose data have been flushed to hfile. The scenario is that in a 
share cluster, write requests of a table may very little and periodical,  a 
lots of hlogs can not be cleaned for entries of this table in those hlogs.

3) Rely on the smallest of all biggest hfile's seqId of previous served regions 
to ignore some entries.  Facebook have implemented this in HBASE-6508 and we 
backport it to hbase 0.94 in HBASE-9568.

4) Support running multiple hlog splitters on a single RS and on master(latter 
can boost split efficiency for tiny cluster)

5) Enable multiple splitters on 'big' hlog file by splitting(logically) hlog to 
slices(configurable size, eg hdfs trunk size 64M)
support concurrent multiple split tasks on a single hlog file slice 

6) Do not cancel the timeout split task until one task reports it succeeds 
(avoids scenario where split for a hlog file fails due to no one task can 
succeed within the timeout period ), and and reschedule a same split task to 
reduce split time ( to avoid some straggler in hlog split)

7) Consider the hlog data locality when schedule the hlog split task.  Schedule 
the hlog to a splitter which is near to hlog data.

8) Support multi hlog writers and switching to another hlog writer when long 
write latency to current hlog due to possible temporary network spike? 

This is a draft which lists the improvements about hlog we try to implement in 
the near future. Comments and discussions are welcomed.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Created] (HBASE-9892) Add info port to ServerName to support multi instances in a node

2013-11-04 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-9892:
--

 Summary: Add info port to ServerName to support multi instances in 
a node
 Key: HBASE-9892
 URL: https://issues.apache.org/jira/browse/HBASE-9892
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


The full GC time of  regionserver with big heap(> 30G ) usually  can not be 
controlled in 30s. At the same time, the servers with 64G memory are normal. So 
we try to deploy multi rs instances(2-3 ) in a single node and the heap of each 
rs is about 20G ~ 24G.

Most of the things works fine, except the hbase web ui. The master get the RS 
info port from conf, which is suitable for this situation of multi rs  
instances in a node. So we add info port to ServerName.
a. at the startup, rs report it's info port to Hmaster.
b, For root region, rs write the servername with info port ro the zookeeper 
root-region-server node.
c, For meta regions, rs write the servername with info port to root region 
d. For user regions,  rs write the servername with info port to meta regions 

So hmaster and client can get info port from the servername.
To test this feature, I change the rs num from 1 to 3 in standalone mode, so we 
can test it in standalone mode,

I think Hoya(hbase on yarn) will encounter the same problem.  Anyone knows how 
Hoya handle this problem?

PS: There are  different formats for servername in zk node and meta table, i 
think we need to unify it and refactor the code.




--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Created] (HBASE-9974) Rest sometimes returns incomplete xml/json data

2013-11-14 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-9974:
--

 Summary: Rest sometimes returns incomplete xml/json data
 Key: HBASE-9974
 URL: https://issues.apache.org/jira/browse/HBASE-9974
 Project: HBase
  Issue Type: Bug
  Components: REST
Reporter: Liu Shaohui


Rest sometimes return incomplete xml/json data.

We found these exceptions in rest server.

13/11/15 11:40:51 ERROR mortbay.log:/log/1A:23:11:0C:06:22*
javax.ws.rs.WebApplicationException: javax.xml.bind.MarshalException
 - with linked exception:
[org.mortbay.jetty.EofException]
at 
com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:159)
at 
com.sun.jersey.spi.container.ContainerResponse.write(ContainerResponse.java:306)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1437)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
at 
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:847)
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
at 
org.apache.hadoop.hbase.rest.filter.GzipFilter.doFilter(GzipFilter.java:73)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:322)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: javax.xml.bind.MarshalException
 - with linked exception:
[org.mortbay.jetty.EofException]
at 
com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:325)
at 
com.sun.xml.bind.v2.runtime.MarshallerImpl.marshal(MarshallerImpl.java:249)
at 
javax.xml.bind.helpers.AbstractMarshallerImpl.marshal(AbstractMarshallerImpl.java:75)
at 
com.sun.jersey.json.impl.JSONMarshallerImpl.marshal(JSONMarshallerImpl.java:74)
at 
com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:179)
at 
com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:157)
... 24 more
Caused by: org.mortbay.jetty.EofException
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791)
at 
org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:551)
at 
org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:572)
at 
org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012)
at 
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:651)
at 
org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:580)
at 
com.sun.jersey.spi.container.servlet.WebComponent$Writer.write(WebComponent.java:307)
at 
com.sun.jersey.spi.container.ContainerResponse$CommittingOutputStream.write(ContainerResponse.java:134)
at 
com.sun.xml.bind.v2.runtime.output.UTF8XmlOutput.flushBuffer(UTF8XmlOutput.java:416)
at 
com.sun.xml.bind.v2.runtime.output.UTF8XmlOutput.text(UTF8XmlOutput.java:369)
at 
com.sun.xml.bind.v2.runtime.unmarshaller.Base64Data.writeTo(Base64Data.java:303)
at 
com.sun.xml.bind.v2.runtime.output.UTF8XmlOutput.text(UTF8XmlOutput.java:310)
at 
com.sun.xml.bind.v2.runtime.XMLSerializer.text(XMLSerializer.java:425)
at 
com.sun.xml.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl$PcdataImpl.writeText(RuntimeBuiltinLeafInfoImpl.java:

[jira] [Resolved] (HBASE-9974) Rest sometimes returns incomplete xml/json data

2013-11-18 Thread Liu Shaohui (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shaohui resolved HBASE-9974.


Resolution: Not A Problem
  Assignee: Liu Shaohui

> Rest sometimes returns incomplete xml/json data
> ---
>
> Key: HBASE-9974
> URL: https://issues.apache.org/jira/browse/HBASE-9974
> Project: HBase
>  Issue Type: Bug
>  Components: REST
>Reporter: Liu Shaohui
>Assignee: Liu Shaohui
>
> Rest sometimes return incomplete xml/json data.
> We found these exceptions in rest server.
> 13/11/15 11:40:51 ERROR mortbay.log:/log/1A:23:11:0C:06:22*
> javax.ws.rs.WebApplicationException: javax.xml.bind.MarshalException
>  - with linked exception:
> [org.mortbay.jetty.EofException]
>   at 
> com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:159)
>   at 
> com.sun.jersey.spi.container.ContainerResponse.write(ContainerResponse.java:306)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1437)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
>   at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
>   at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
>   at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:847)
>   at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>   at 
> org.apache.hadoop.hbase.rest.filter.GzipFilter.doFilter(GzipFilter.java:73)
>   at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>   at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>   at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>   at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>   at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>   at org.mortbay.jetty.Server.handle(Server.java:322)
>   at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>   at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
>   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
>   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>   at 
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
>   at 
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> Caused by: javax.xml.bind.MarshalException
>  - with linked exception:
> [org.mortbay.jetty.EofException]
>   at 
> com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:325)
>   at 
> com.sun.xml.bind.v2.runtime.MarshallerImpl.marshal(MarshallerImpl.java:249)
>   at 
> javax.xml.bind.helpers.AbstractMarshallerImpl.marshal(AbstractMarshallerImpl.java:75)
>   at 
> com.sun.jersey.json.impl.JSONMarshallerImpl.marshal(JSONMarshallerImpl.java:74)
>   at 
> com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:179)
>   at 
> com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:157)
>   ... 24 more
> Caused by: org.mortbay.jetty.EofException
>   at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791)
>   at 
> org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:551)
>   at 
> org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:572)
>   at 
> org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012)
>   at 
> org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:651)
>   at 
> org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:580)
>   at 
> com.sun.jersey.spi.container.servlet.WebComponent$Writer.write(WebComponent.java:307)
>   at 
> com.sun.jersey.spi.container.ContainerResponse$CommittingOutputStream.write(ContainerResponse.java:134)
>   at 
> com.sun.xml.bind.v2.runtime.output.UTF8XmlOutput.flushBuffer(UTF8XmlOutput.java:416)
>   at 
> com.sun.xml.bind.v2.runtime.output.UTF8XmlOutput.text(UTF8XmlOutput.java:369)
>   at 
> com.sun.xml.bind.v2.runtime.

[jira] [Created] (HBASE-10048) Add hlog number metric in regionserver

2013-11-27 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10048:
---

 Summary: Add hlog number metric in regionserver
 Key: HBASE-10048
 URL: https://issues.apache.org/jira/browse/HBASE-10048
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


Add hlog number metric in regionserver. 

We can use this metric to alert about memstore flush because of too many hlogs.




--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Created] (HBASE-10049) Small improvments in region_mover.rb

2013-11-27 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10049:
---

 Summary: Small improvments in region_mover.rb
 Key: HBASE-10049
 URL: https://issues.apache.org/jira/browse/HBASE-10049
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


We use region_mover.rb in the graceful upgrade of hbase cluster.

Here are small improvements.

a. remove the table.close(), because  the htable could be reused.
b. Add more info in the log of moving region.
c. Add 20s sleep in load command to make sure the rs finished initialization of 
rpc server. There is a time gap between rs startup report and rpc server 
initialization.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Created] (HBASE-10054) Add the default column compression option

2013-11-28 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10054:
---

 Summary: Add the default column compression option
 Key: HBASE-10054
 URL: https://issues.apache.org/jira/browse/HBASE-10054
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


Add the default column compression option for cluster level.
If users don't set column compression for a column family, we should use the 
default column compression.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Created] (HBASE-10055) Add option to limit the scan speed in CopyTable and VerifyReplication

2013-11-28 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10055:
---

 Summary: Add option to limit the scan speed in CopyTable and 
VerifyReplication
 Key: HBASE-10055
 URL: https://issues.apache.org/jira/browse/HBASE-10055
 Project: HBase
  Issue Type: Improvement
 Environment: Add option to limit the scan speed in CopyTable and 
VerifyReplication.

When adding a new replication, we use 'CopyTable' to copy old data from online 
cluster to peer cluster.  After that, we use 'VerifyReplication' to check the 
data consistency of two clusters.

To reduce the impact the online cluster's service, we add an option to limit 
the scan speed.


Reporter: Liu Shaohui
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Created] (HBASE-10335) AuthFailedException in zookeeper may block replication forever

2014-01-14 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10335:
---

 Summary: AuthFailedException in zookeeper may block replication 
forever
 Key: HBASE-10335
 URL: https://issues.apache.org/jira/browse/HBASE-10335
 Project: HBase
  Issue Type: Bug
  Components: Replication, security
Reporter: Liu Shaohui


ReplicationSource will rechoose sinks when encounted exceptions during skipping 
edits to the current sink. But if the  zookeeper client for peer cluster go to 
AUTH_FAILED state, the ReplicationSource will always get  AuthFailedException. 
The ReplicationSource does not reconnect  the peer, because reconnectPeer only 
handle ConnectionLossException and SessionExpiredException. As a result, the 
replication will print log: 
{quote}
2014-01-14,12:07:06,892 INFO 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting 0 
rs from peer cluster # 20
2014-01-14,12:07:06,892 INFO 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Slave 
cluster looks down: 20 has 0 region servers
{quote}
and be blocked forever.

I think other places may have same problems for not handling 
AuthFailedException in zookeeper. eg: HBASE-8675.
[~apurtell]



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (HBASE-10370) Compaction in out-of-date Store causes region split failed

2014-01-16 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10370:
---

Summary: Compaction in out-of-date Store causes region split failed
Key: HBASE-10370
URL: https://issues.apache.org/jira/browse/HBASE-10370
Project: HBase
Issue Type: Bug
Components: Compaction
Reporter: Liu Shaohui
Priority: Critical

In out product cluster, we encounter a problem that two daughter regions can
not been opened for FileNotFoundException.
{quote}
2014-01-14,20:12:46,927 INFO org.apache.hadoop.hbase.regionserver.SplitRequest:
Running rollback/cleanup of failed split of
user_profile,x,1389671863815.99e016485b0bc142d67ae07a884f6966.; Failed
lg-hadoop-st34.bj,21600,1389060755669-daughterOpener=ec8bbda0f132c481b451fa40e7152b98
java.io.IOException: Failed
lg-hadoop-st34.bj,21600,1389060755669-daughterOpener=ec8bbda0f132c481b451fa40e7152b98
at
org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:375)
at
org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:467)
at
org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:69)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: java.io.IOException:
java.io.FileNotFoundException: File does not exist:
/hbase/lgprc-xiaomi/user_profile/99e016485b0bc142d67ae07a884f6966/A/5e05d706e4a84f34acc2cf00f089a4cf

{quote}
The reason is that a compaction in an out-of-date Store deletes the hfiles,
which are referenced by the daughter regions after split. This will cause the
daughter regions can not be opened forever.

The timeline is that

Assumption: there are two hfiles: a, b in Store A in Region R
t0: A compaction request of Store A(a+b) in Region R is send.

t1: A Split for Region R. But the split is timeout and rollbacked. In the
rollback, region reinitializes all store objects , see SplitTransaction #824.
Now the store is Region R is A'(a+b).

t2: Run compaction(a + b -> c): A(a+b) -> A(c). Hfile a and b are archived.

t3: A Split for Region R. R splits into two region R.0, R.1, which create hfile
references for hfile a, b from Store A'(a + b)

t4: For hfile a, b have been deleted, the opening for region R.0 and R.1 will
failed for FileNotFoundException.

I have add a test to identity this problem.

After search the jira, maybe HBASE-8502 is the same problem. [~goldin]

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (HBASE-10535) Table trash to recover table deleted by mistake

2014-02-14 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10535:
---

 Summary: Table trash to recover table deleted by mistake
 Key: HBASE-10535
 URL: https://issues.apache.org/jira/browse/HBASE-10535
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Priority: Minor


When table is deleted, only Hfiles are moved to archives dir, table and region 
infos are deleted immediately. So it's very difficult to recover tables which 
are deleted by mistakes.

I think if we can introduce an table trash dir in HDFS. When the table is 
deleted, the entire table dir is moved to trash dir. And after an configurable 
ttl,  the dir is deleted actually. This can be done by HMaster.

If we want to recover the deleted table, we can use a tool which moves table 
dir out of trash and recovery the meta data of the table. There are many 
problems the recover tool will encountered eg, parent  and daughter regions are 
all in the table dir. But I think this feature is useful to handle some special 
cases.

Discussions are welcomed.




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (HBASE-10583) backport HBASE-8402 to 0.94

2014-02-20 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10583:
---

 Summary: backport HBASE-8402 to 0.94
 Key: HBASE-10583
 URL: https://issues.apache.org/jira/browse/HBASE-10583
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui


see HBASE-8402




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (HBASE-10617) Value lost if "$" element is before "column" element in json when posted to Rest Server

2014-02-26 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10617:
---

 Summary: Value lost if "$" element is before "column" element in 
json when posted to Rest Server
 Key: HBASE-10617
 URL: https://issues.apache.org/jira/browse/HBASE-10617
 Project: HBase
  Issue Type: Bug
  Components: REST
Affects Versions: 0.94.11
Reporter: Liu Shaohui
Priority: Minor


When post following json data to rest server, it return 200, but the value is 
null in HBase
{code}
{"Row": { "key":"cjI=", "Cell": {"$":"ZGF0YTE=", "column":"ZjE6YzI="}}}
{code}

>From rest server log, we found the length of value is null after the server 
>paste the json to RowModel object
{code}
14/02/26 17:52:14 DEBUG rest.RowResource: PUT 
{"totalColumns":1,"families":{"f1":[{"timestamp":9223372036854775807,"qualifier":"c2","vlen":0}]},"row":"r2"}
{code}

When the order is that "column" before "$",  it works fine.
{code}
{"Row": { "key":"cjI=", "Cell": {"column":"ZjE6YzI=", "$":"ZGF0YTE=" }}}
{code}

DIfferent json libs may have different order of this two elements even if 
"column" is put before "$".





--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (HBASE-10627) A logic mistake in HRegionServer isHealthy

2014-02-27 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10627:
---

 Summary: A logic mistake in HRegionServer isHealthy
 Key: HBASE-10627
 URL: https://issues.apache.org/jira/browse/HBASE-10627
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Priority: Minor


After visiting the isHealthy in HRegionServer, I think there is a logic mistake.
{code}
// Verify that all threads are alive
if (!(leases.isAlive()
&& cacheFlusher.isAlive() && hlogRoller.isAlive()
&& this.compactionChecker.isAlive())   < logic wrong here
&& this.periodicFlusher.isAlive()) {
  stop("One or more threads are no longer alive -- stop");
  return false;
}
{code}

which should be
{code}
// Verify that all threads are alive
if (!(leases.isAlive()
&& cacheFlusher.isAlive() && hlogRoller.isAlive()
&& this.compactionChecker.isAlive()
&& this.periodicFlusher.isAlive())) {
  stop("One or more threads are no longer alive -- stop");
  return false;
}
{code}

Please finger out if i am wrong. Thx




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (HBASE-10692) The Multi TableMap job don't support the security HBase cluster

2014-03-05 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10692:
---

 Summary: The Multi TableMap job don't support the security HBase 
cluster
 Key: HBASE-10692
 URL: https://issues.apache.org/jira/browse/HBASE-10692
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Reporter: Liu Shaohui
Priority: Minor


HBASE-3996 adds the support of multiple tables and scanners as input to the 
mapper in map/reduce jobs. But it don't support the security HBase cluster.

[~erank] [~bbaugher]

Ps: HBASE-3996 only support multiple tables from the same HBase cluster. Should 
we support multiple tables from the different clusters?





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-10774) Restore TestMultiTableInputFormat

2014-03-17 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10774:
---

 Summary: Restore TestMultiTableInputFormat
 Key: HBASE-10774
 URL: https://issues.apache.org/jira/browse/HBASE-10774
 Project: HBase
  Issue Type: Test
Reporter: Liu Shaohui
Priority: Minor


TestMultiTableInputFormat is removed in HBASE-9009 for this test made the ci 
failed. But in HBASE-10692 we need to add a new test 
TestSecureMultiTableInputFormat which is depends on it. So we try to restore it 
in this issue.

I rerun the test for several times and it passed.
{code}
Running org.apache.hadoop.hbase.mapreduce.TestMultiTableInputFormat
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 314.163 sec
{code}

[~stack]



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-10782) Hadoop2 MR tests fail occasionally because of mapreduce.jobhistory.address is no set in job conf

2014-03-18 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10782:
---

 Summary: Hadoop2 MR tests fail occasionally because of 
mapreduce.jobhistory.address is no set in job conf
 Key: HBASE-10782
 URL: https://issues.apache.org/jira/browse/HBASE-10782
 Project: HBase
  Issue Type: Test
Reporter: Liu Shaohui
Priority: Minor


Hadoop2 MR tests fail occasionally with output like this:
{code}
---
Test set: org.apache.hadoop.hbase.mapreduce.TestTableInputFormatScan1
---
Tests run: 5, Failures: 0, Errors: 5, Skipped: 0, Time elapsed: 347.57 sec <<< 
FAILURE!
testScanEmptyToAPP(org.apache.hadoop.hbase.mapreduce.TestTableInputFormatScan1) 
 Time elapsed: 50.047 sec  <<< ERROR!
java.io.IOException: java.net.ConnectException: Call From 
liushaohui-OptiPlex-990/127.0.0.1 to 0.0.0.0:10020 failed on connection 
exception: java.net.ConnectException: Connection refused; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused
at 
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:334)
at 
org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:419)
at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:524)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:314)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:311)
at java.security.AccessController.doPrivileged(Native Method)
 ...
{code}
The reason is that when MR job was running, the job client pulled the job 
status from AppMaster. When the job is completed, the AppMaster will exit. At 
this time, if the job client have not got the job completed event from 
AppMaster, it will try to get job report from history server. 

But in HBaseTestingUtility#startMiniMapReduceCluster, the config: 
mapreduce.jobhistory.address is not copied to TestUtil's config.
 
CRUNCH-249 reported the same problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-10788) Add 99th percentile of latency in PE

2014-03-18 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10788:
---

 Summary: Add 99th percentile of latency in PE
 Key: HBASE-10788
 URL: https://issues.apache.org/jira/browse/HBASE-10788
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


In production env, 99th percentile of latency is more important than the avg. 
The 99th percentile is helpful to measure the influence of GC, slow read/write 
of HDFS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-10790) make assembly:single as default in pom.xml

2014-03-19 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10790:
---

 Summary: make assembly:single as default in pom.xml
 Key: HBASE-10790
 URL: https://issues.apache.org/jira/browse/HBASE-10790
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Priority: Minor


Now to compile a HBase tar release package, we should use
the cmd: 
{code}
 mvn clean package assembly:single
{code}, which is not convenient. We can make assembly:single as a default 
option and run the assembly plugin in maven package phase. Then we can just use 
the cmd {code} mvn clean package {code} to get a release package.

Other suggestions are welcomed.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-10806) Two protos missing in hbase-protocol/pom.xml

2014-03-20 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10806:
---

 Summary: Two protos missing in hbase-protocol/pom.xml
 Key: HBASE-10806
 URL: https://issues.apache.org/jira/browse/HBASE-10806
 Project: HBase
  Issue Type: Bug
 Environment: VisibilityLabels.proto and Encryption.proto are missing 
in hbase-protocol/pom.xml. The corresponding classed are not regenerated in 
maven cmd:
{code}
mvn compile -Pcompile-protobuf
{code} 
Reporter: Liu Shaohui






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-10839) NullPointerException in construction of RegionServer in Security Cluster

2014-03-26 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10839:
---

 Summary: NullPointerException in construction of RegionServer in 
Security Cluster
 Key: HBASE-10839
 URL: https://issues.apache.org/jira/browse/HBASE-10839
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Liu Shaohui
Priority: Critical


The initialization of secure rpc server  depends on regionserver's servername 
and zooKeeper watcher. But,  After HBASE-10569, they are null when creating 
secure rpc services.

[~jxiang]

{code}
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hbase.ipc.RpcServer.createSecretManager(RpcServer.java:1974)
at org.apache.hadoop.hbase.ipc.RpcServer.start(RpcServer.java:1945)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.(RSRpcServices.java:706)
at 
org.apache.hadoop.hbase.master.MasterRpcServices.(MasterRpcServices.java:190)
at 
org.apache.hadoop.hbase.master.HMaster.createRpcServices(HMaster.java:297)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.(HRegionServer.java:431)
at org.apache.hadoop.hbase.master.HMaster.(HMaster.java:234)
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-10846) Links between active and backup masters are broken

2014-03-26 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10846:
---

 Summary: Links between active and backup masters are broken
 Key: HBASE-10846
 URL: https://issues.apache.org/jira/browse/HBASE-10846
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Priority: Minor


Links between active and backup masters are broken for the the blanks before 
info port in the url.
{code}
href="//wcc-hadoop-tst-ct01.bj: 12501/master-status"
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-10881) Support reverse scan in thrift2

2014-03-31 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10881:
---

 Summary: Support reverse scan in thrift2
 Key: HBASE-10881
 URL: https://issues.apache.org/jira/browse/HBASE-10881
 Project: HBase
  Issue Type: New Feature
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


Support reverse scan in thrift2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-10943) Backport HBASE-7329 to 0.94

2014-04-08 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-10943:
---

 Summary: Backport HBASE-7329 to 0.94
 Key: HBASE-10943
 URL: https://issues.apache.org/jira/browse/HBASE-10943
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.94.18
Reporter: Liu Shaohui
Priority: Minor


See HBASE-7329



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11043) Users with table's read/write permission can't get table's description

2014-04-21 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11043:
---

 Summary: Users with table's read/write permission can't get 
table's description
 Key: HBASE-11043
 URL: https://issues.apache.org/jira/browse/HBASE-11043
 Project: HBase
  Issue Type: Bug
  Components: security
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


AccessController#preGetTableDescriptors only allow users with admin or create 
permission to get table's description.
{quote}
requirePermission("getTableDescriptors", nameAsBytes, null, null,
  Permission.Action.ADMIN, Permission.Action.CREATE);
{quote}

I think Users with table's read/write permission should also be able to get 
table's description. 

Eg: when create a hive table on HBase,  hive will get the table description to 
check if the mapping is right. Usually the hive users only have the read 
permission of table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11095) Add ip restriction in user permissions

2014-04-29 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11095:
---

 Summary: Add ip restriction in user permissions
 Key: HBASE-11095
 URL: https://issues.apache.org/jira/browse/HBASE-11095
 Project: HBase
  Issue Type: New Feature
  Components: security
Reporter: Liu Shaohui
Priority: Minor


For some sensitive data, users want to restrict the from ips of hbase users 
like mysql access control. 

One direct solution is to add the candidated ips when granting user permisions.
{quote}
grant  [  [  [ 
 ] ] ]
{quote}

Any comments and suggestions are welcomed.
[~apurtell]



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11115) Support setting max version per column family in Get

2014-05-04 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-5:
---

 Summary: Support setting max version per column family in Get
 Key: HBASE-5
 URL: https://issues.apache.org/jira/browse/HBASE-5
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Priority: Minor


The Get operation only supports setting the max version for all column 
families. But different column families may have different versions data, and 
users may want to get data with different versions from different column 
families in a single Get operation.

Though, we can translate this kind of Get to multi single-column-family Gets, 
these Gets are sequential in regionserver and have different mvcc.

Comments and suggestions are welcomed. Thx



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11218) Data loss in HBase standalone mode

2014-05-20 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11218:
---

 Summary: Data loss in HBase standalone mode
 Key: HBASE-11218
 URL: https://issues.apache.org/jira/browse/HBASE-11218
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
 Fix For: 0.99.0


Data loss in HBase standalone mode.
*How to produce it*
# Start HBase standalone mode.
# Create a table using hbase shell.
# Scan '.META.' and you will find data in meta table
# Kill the HBase process with -9 option
# Start the HBase agaion
# Scan '.META.' and you will find nothing in meta table.

*There are three main reasons.*
# FSDataOutputStream.sync should call flush() if the underlying wrapped stream 
is not Syncable. See HADOOP-8861
# writeChecksum is ture in  default LocalFileSystem and the 
ChecksumFSOutputSummer will buffer the data, which make the waledits are not 
written to os's filesystem with sync method immediately, and those edits will 
be lost in regionserver's failover.
#  The MiniZooKeeperCluster deletes the old zk data at startup which maye cause 
data loss in meta table. The failover procedure is: split pre root 
regionserver's hlog -> assign root -> split pre meta regionserver's hlog -> 
assign meta -> split all other regionservers' hlogs -> assign other regions. If 
there is no data in zookeeper, we will get null for root regionserver and then 
assign root table. Some data in root table maybe be lost for some root's 
WalEdits have not been splited and replayed. So does the Meta table.

I finished the patch for 0.94 and am working on the patch for trunk. 
Suggestions are welcomed.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11223) Limit the actions number of a call in the batch

2014-05-21 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11223:
---

 Summary: Limit the actions number of a call in the batch 
 Key: HBASE-11223
 URL: https://issues.apache.org/jira/browse/HBASE-11223
 Project: HBase
  Issue Type: Bug
  Components: Client
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui


Huge batch operation will make regionserver crash for GC.
The extreme code like this:
{code}
final List deletes = new ArrayList();
final long rows = 400;
for (long i = 0; i < rows; ++i) {
  deletes.add(new Delete(Bytes.toBytes(i)));
}
table.delete(deletes);
{code}
We should limit the actions number of a call in the batch. 




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11232) Region fail to release the updatelock for illegal CF in multi row mutations

2014-05-21 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11232:
---

 Summary: Region fail to release the updatelock for illegal CF in 
multi row mutations
 Key: HBASE-11232
 URL: https://issues.apache.org/jira/browse/HBASE-11232
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Liu Shaohui
Assignee: Liu Shaohui
 Fix For: 0.99.0


The failback code in processRowsWithLocks did not check the column family. If 
there is an illegal CF in the muation, it will  throw NullPointException and 
the update lock will not be released.  So the region can not be flushed and 
compacted. 
HRegion #4946
{code}
if (!mutations.isEmpty() && !walSyncSuccessful) {
  LOG.warn("Wal sync failed. Roll back " + mutations.size() +
  " memstore keyvalues for row(s):" +
  processor.getRowsToLock().iterator().next() + "...");
  for (KeyValue kv : mutations) {
stores.get(kv.getFamily()).rollback(kv);
  }
}
// 11. Roll mvcc forward
if (writeEntry != null) {
  mvcc.completeMemstoreInsert(writeEntry);
  writeEntry = null;
}
if (locked) {
  this.updatesLock.readLock().unlock();
  locked = false;
}
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11240) Print hdfs pipeline when hlog's sync is slow

2014-05-23 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11240:
---

 Summary: Print hdfs pipeline when hlog's sync is slow
 Key: HBASE-11240
 URL: https://issues.apache.org/jira/browse/HBASE-11240
 Project: HBase
  Issue Type: Improvement
  Components: wal
Reporter: Liu Shaohui
Assignee: Liu Shaohui


Sometimes the slow sync of hlog writer is because there is an abnormal datanode 
in the pipeline. So it will be helpful to print the pipeline of slow sync to 
diagnose those problems.

The ultimate solution is to join the trace of HBase and HDFS.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11255) Negative request num in region load

2014-05-27 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11255:
---

 Summary: Negative request num in region load
 Key: HBASE-11255
 URL: https://issues.apache.org/jira/browse/HBASE-11255
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


We found that the request number of region is negative in long-running hbase 
cluster.
The is because of improper cast in HRegionServer#createRegionLoad
{code}
... .setReadRequestsCount((int)r.readRequestsCount.get())
.setWriteRequestsCount((int) r.writeRequestsCount.get()) 
{code}

The patch is simple and just removes the cast.

 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11263) Share the open/close store file thread pool for all store in a region

2014-05-28 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11263:
---

 Summary: Share the open/close store file thread pool for all store 
in a region
 Key: HBASE-11263
 URL: https://issues.apache.org/jira/browse/HBASE-11263
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Reporter: Liu Shaohui
Priority: Minor


Currently, the open/close store file thread pool is divided equally to all 
stores of a region. 
{code}
  protected ThreadPoolExecutor getStoreFileOpenAndCloseThreadPool(
  final String threadNamePrefix) {
int numStores = Math.max(1, this.htableDescriptor.getFamilies().size());
int maxThreads = Math.max(1,
conf.getInt(HConstants.HSTORE_OPEN_AND_CLOSE_THREADS_MAX,
HConstants.DEFAULT_HSTORE_OPEN_AND_CLOSE_THREADS_MAX)
/ numStores);
return getOpenAndCloseThreadPool(maxThreads, threadNamePrefix);
  }
{code}
This is not very optimal in following scenarios:
# The data of some column families are very large and there are many hfiles in 
those stores, and others may be very small and in-memory column families. 
# Usually we preserve some column families for later needs. The thread pool for 
these column families are wasted。

The simple way is to share a big thread pool for all stores to open/close 
hfiles.  

Suggestions are welcomed. 
Thanks. 




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11274) More general single-row Condition Mutation

2014-05-30 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11274:
---

 Summary: More general single-row Condition Mutation
 Key: HBASE-11274
 URL: https://issues.apache.org/jira/browse/HBASE-11274
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Priority: Minor


Currently, the checkAndDelete and checkAndPut interface  only support atomic 
mutation with single condition. But in actual apps, we need more general 
condition-mutation that support multi conditions and logical expression with 
those conditions.
For example, to support the following sql
{quote}
  insert row  where (column A == 'X' and column B == 'Y') or (column C == 'z')
{quote}

Suggestions are welcomed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11300) Wrong permission check for checkAndPut in AccessController

2014-06-05 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11300:
---

 Summary: Wrong permission check for checkAndPut in AccessController
 Key: HBASE-11300
 URL: https://issues.apache.org/jira/browse/HBASE-11300
 Project: HBase
  Issue Type: Bug
  Components: security
Affects Versions: 0.99.0
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


For the checkAndPut operation, the AccessController only checks the read and 
write permission for the family and qualifier to check, but ignores the write 
permission for the family map of "put". What's more,  we don't need the write 
permission for the family and qualifier to check.

See the code AccessController.java #1538
{code}
Map> families = makeFamilyMap(family, 
qualifier);
User user = getActiveUser();
AuthResult authResult = permissionGranted(OpType.CHECK_AND_PUT, user, env, 
families,
  Action.READ, Action.WRITE);
{code}

Same problem for checkAndDelete operation.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11369) The proceduce of interrupting current split task should be updated after HBASE-9736.

2014-06-17 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11369:
---

 Summary: The proceduce of interrupting current split task should 
be updated after HBASE-9736.
 Key: HBASE-11369
 URL: https://issues.apache.org/jira/browse/HBASE-11369
 Project: HBase
  Issue Type: Bug
  Components: wal
Reporter: Liu Shaohui
Priority: Minor


Before HBASE-9736, SplitLogWorker only split one hlog at a time. When the data 
of znode for this task is changed (task is timeouted and resigned by 
splitLogManager), zookeeper will notify the SplitLogWorker. If this log task is 
owned by other regionserver, the SplitLogWorker will interrupt current task and 
try to get another task.

HBASE-9736 allow multi log splitters per RS so there will be multi current 
tasks running in the thread pool in SplitLogWorker.
So the proceduce of Interrupting current split task need be updated.

[~jeffreyz]  [~stack]






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11373) hbase-protocol compile failed for name conflict of RegionTransition

2014-06-17 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11373:
---

 Summary: hbase-protocol compile failed for name conflict of 
RegionTransition
 Key: HBASE-11373
 URL: https://issues.apache.org/jira/browse/HBASE-11373
 Project: HBase
  Issue Type: Bug
  Components: Protobufs
Reporter: Liu Shaohui
Priority: Minor


The compile of hbase-protocol failed for there are two message named 
RegionTransition in ZooKeeper.proto and RegionServerStatus.proto
{quote}
$mvn clean package -Pcompile-protobuf  -X

\[DEBUG\] RegionServerStatus.proto:81:9: "RegionTransition" is already defined 
in file "ZooKeeper.proto".
\[DEBUG\] RegionServerStatus.proto:114:12: "RegionTransition" seems to be 
defined in "ZooKeeper.proto", which is not imported by 
"RegionServerStatus.proto".  To use it here, please add the necessary import.
\[ERROR\] protoc compiler error
{quote} 
Through It will be ok if we compile the the ZooKeeper.proto and 
RegionServerStatus.proto seperately. it is not very convenient.

The new RegionTransition is RegionServerStatus.proto and introduced in 
HBASE-11059.
[~jxiang]
What's your suggestion about this issue? 




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11410) A tool for deleting data of a column using BulkDeleteEndpoint

2014-06-24 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11410:
---

 Summary: A tool for deleting data of a column using 
BulkDeleteEndpoint
 Key: HBASE-11410
 URL: https://issues.apache.org/jira/browse/HBASE-11410
 Project: HBase
  Issue Type: Improvement
  Components: Coprocessors
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


Sometimes we need a tool to delete unused or wrong format data in some columns. 
So we add a tool using BulkDeleteEndpoint.

usage:
delete column f1:c1 in table t1
{quote}
./hbase org.apache.hadoop.hbase.coprocessor.example.BulkDeleteTool t1 f1:c1
{quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11536) Puts of region location to Meta may be out of order which causes inconsistent of region location

2014-07-17 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11536:
---

 Summary: Puts of region location to Meta may be out of order which 
causes inconsistent of region location
 Key: HBASE-11536
 URL: https://issues.apache.org/jira/browse/HBASE-11536
 Project: HBase
  Issue Type: Bug
  Components: Region Assignment
Reporter: Liu Shaohui
Priority: Critical


In product hbase cluster, we found inconsistency of region location in the meta 
table. Region cdfa2ed711bbdf054d9733a92fd43eb5 is onlined in regionserver 
10.237.12.13:11600 but the region location in Meta table is 10.237.12.15:11600.
This is because of the out-of-order puts for meta table.

# HMaster try to assign the region to 10.237.12.15:11600.
# RegionServer: 10.237.12.15:11600. During the opening the region, the put of 
region location(10.237.12.15:11600) to meta table is timeout(60s) and the 
htable retry for second time. (regionserver serving meta has got the request of 
the put. The timeout is beause  ther is a bad disk in this regionserver and 
sync of hlog is very slow. 
)
During the retry in htable, the OpenRegionHandler is timeout(100s) and the 
PostOpenDeployTasksThread is interrupted. Through the htable is closed in the 
MetaEditor finally, the share connection the htable used is not closed and the 
call of put for meta table is on-flying in the connection. Assumed that this 
on-flying call of put to meta is  named call A.
# RegionServer: 10.237.12.15:11600. For the timeout of OpenRegionHandler, the 
OpenRegionHandler marks the assign state of this region to FAILED_OPEN.
# HMaster watchs this event of FAILED_OPEN and assigns the region to another 
regionserver: 10.237.12.13:11600
# RegionServer: 10.237.12.13:11600. This regionserver opens the region 
successfully . Assumed that the put of region location(10.237.12.13:11600) to 
meta table in this regionserver is named B.

There is no order guarantee for call A and B. If call A is processed after call 
B in regionserver serving meta region, the region location in meta table will 
be wrong.

>From the raw scan of meta table we found:
{code}
scan '.META.', {RAW => true, LIMIT => 1, VERSIONS => 10, STARTROW => 
'xxx.adfa2ed711bbdf054d9733a92fd43eb5.'} 
{code}
{quote}

xxx.adfa2ed711bbdf054d9733a92fd43eb5. column=info:server, 
timestamp=1404885460553(=> Wed Jul 09 13:57:40 +0800 2014), 
value=10.237.12.15:11600 --> Retry put from 10.237.12.15

xxx.adfa2ed711bbdf054d9733a92fd43eb5. column=info:server, 
timestamp=1404885456731(=> Wed Jul 09 13:57:36 +0800 2014), 
value=10.237.12.13:11600 --> put from 10.237.12.13

xxx.adfa2ed711bbdf054d9733a92fd43eb5. column=info:server, 
timestamp=1404885353122( Wed Jul 09 13:55:53 +0800 2014), 
value=10.237.12.15:11600  --> First put from 10.237.12.15
{quote}

Related hbase log is attached in this issue and disscusions are welcomed.

For there is no order guarantee for puts from different htables, one solution 
for this issue is to give an increased id for each assignment of a region and 
use this id as the timestamp of put of region location to meta table. The 
region location with large assign id will be got by hbase clients.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11541) Wrong result when scaning meta with startRow

2014-07-17 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11541:
---

 Summary: Wrong result when scaning meta with startRow
 Key: HBASE-11541
 URL: https://issues.apache.org/jira/browse/HBASE-11541
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


When we scan the meta with STARTROW optiion, wrong result may be returned.
For exmaple: if there are two tables named "a" and "b" in hbase, when we scan 
the meta with startrow = 'b', the region location of table "a" will be returned 
but we expect to get the region location of table "b".

 {code}
> create 'a', {NAME => 'f'}
> create 'b', {NAME => 'f'}
> scan '.META.', {STARTROW => 'b', LIMIT => 1}
a,,1405655897758.f8b547476b6dc80545e6413c31396, 
{code}

The reason is the wrong assumption in MetaKeyComparator

See: KeyValue.java#2011
{code}
  int leftDelimiter = getDelimiter(left, loffset, llength,
  HRegionInfo.DELIMITER);
  int rightDelimiter = getDelimiter(right, roffset, rlength,
  HRegionInfo.DELIMITER);
  if (leftDelimiter < 0 && rightDelimiter >= 0) {
// Nothing between .META. and regionid.  Its first key. 
return -1;
  } else if (rightDelimiter < 0 && leftDelimiter >= 0) {
return 1;
  } else if (leftDelimiter < 0 && rightDelimiter < 0) {
return 0;
  }
{code}

It's a little troublesome to fix this problem for given a start row which 
contains more than two "," for meta, it's not easy to extract the startKey of 
region.
eg: STARTROW => 'aaa,bbb,ccc,xxx'.

Comments and suggestions are welcomed. Thanks



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11648) Typo of config: hbase.hstore.compaction.ratio in book.xml

2014-08-01 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11648:
---

 Summary: Typo of config: hbase.hstore.compaction.ratio in book.xml
 Key: HBASE-11648
 URL: https://issues.apache.org/jira/browse/HBASE-11648
 Project: HBase
  Issue Type: Bug
  Components: Compaction
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


When looking at the parameters used by compaction algorithm in 
http://hbase.apache.org/book/regions.arch.html, we found there is typo.
In hbase code, the config key for compaction ratio is 
hbase.hstore.compaction.ratio.
But in the hbase book it's hbase.store.compaction.ratio.
CompactSelection.java#66
{code}
this.conf = conf;
this.compactRatio = conf.getFloat("hbase.hstore.compaction.ratio", 1.2F);
this.compactRatioOffPeak = 
conf.getFloat("hbase.hstore.compaction.ratio.offpeak", 5.0F);
{code}

Just fix it to avoid misleading.





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11685) Incr/decr on the reference count of HConnectionImplementation need be atomic

2014-08-06 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11685:
---

 Summary: Incr/decr on the reference count of 
HConnectionImplementation need be atomic 
 Key: HBASE-11685
 URL: https://issues.apache.org/jira/browse/HBASE-11685
 Project: HBase
  Issue Type: Bug
  Components: Client
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


Currently, the incr/decr operation on the ref count of 
HConnectionImplementation are not atomic. This may cause that the ref count 
always be larger than 0 and  the connection never be closed.

{code}
/**
 * Increment this client's reference count.
 */
void incCount() {
  ++refCount;
}

/**
 * Decrement this client's reference count.
 */
void decCount() {
  if (refCount > 0) {
--refCount;
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-11707) Using Map instead of list in FailedServers of RpcClient

2014-08-08 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-11707:
---

 Summary: Using Map instead of list in FailedServers of RpcClient
 Key: HBASE-11707
 URL: https://issues.apache.org/jira/browse/HBASE-11707
 Project: HBase
  Issue Type: Improvement
  Components: Client
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


Currently, FailedServers uses a list to record the black list of servers and 
iterate the list to check if a server is in list. It's not efficient when the 
list is very large. And the list is not thread safe for the add and iteration 
operations.

RpcClient.java#175
{code}
  // iterate, looking for the search entry and cleaning expired entries
  Iterator> it = failedServers.iterator();
  while (it.hasNext()) {
Pair cur = it.next();
 if (cur.getFirst() < now) {
  it.remove();
} else {
  if (lookup.equals(cur.getSecond())) {
return true;
  }
}
{code}

A simple change is to change this list to ConcurrentHashMap.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HBASE-14237) Meta region may be onlined on multi regonservers for bugs of assigning meta

2015-08-18 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-14237:
---

 Summary: Meta region may be onlined on multi regonservers for bugs 
of assigning meta
 Key: HBASE-14237
 URL: https://issues.apache.org/jira/browse/HBASE-14237
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.11
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Critical


When a regionserver failed to open the meta region and crash after setting the 
RS_ZK_REGION_FAILED_OPEN state of meta region in zookeeper, the master will 
handle the event of RS_ZK_REGION_FAILED_OPEN and try to assign the meta region 
again in AssignmentManager#handleRegion. But at the same time, the master will 
handle the regionserver expired event and start a MetaServerShutdownHandler for 
the regionserver, because the servername of regionserver is same as the 
servername of the unassigned node of meta region. In the 
MetaServerShutdownHandler, the meta region may be assigned for second time.

[~heliangliang]
We have encountered this problem in our production cluster which resulted in 
inconsistency of region location in meta table. You can see the log from the 
attachment.

The code of AssignmentManager is so complex and I have not get a solution to 
fix this problem. Could someone kindly help to give some suggestions? Thanks




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-14247) Separate the old WALs into different regionserver directories

2015-08-18 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-14247:
---

 Summary: Separate the old WALs into different regionserver 
directories
 Key: HBASE-14247
 URL: https://issues.apache.org/jira/browse/HBASE-14247
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


Currently all old WALs of regionservers are achieved into the single directory 
of oldWALs. In big clusters, because of long TTL of WAL or disabled 
replications, the number of files under oldWALs may reach the 
max-directory-items limit of HDFS, which will make the hbase cluster crashed.

```
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
 The directory item limit of /hbase/lgprc-xiaomi/.oldlogs is exceeded: 
limit=1048576 items=1048576
```

A simple solution is to separate the old WALs into different  directories 
according to the server name of the WAL.

Suggestions are welcomed~ Thanks




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-14254) Wrong error message when throwing NamespaceNotFoundException in shell

2015-08-19 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-14254:
---

 Summary: Wrong error message when throwing 
NamespaceNotFoundException in shell
 Key: HBASE-14254
 URL: https://issues.apache.org/jira/browse/HBASE-14254
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


Wrong error message when throwing NamespaceNotFoundException in shell
{code}
hbase(main):004:0> create 'ns:t1', {NAME => 'f1'}

ERROR: Unknown namespace ns:t1!
{code}
The namespace shoud be {color:red}ns {color}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-14277) TestRegionServerHostname.testRegionServerHostname may fail at host with a case sensitive name

2015-08-20 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-14277:
---

 Summary: TestRegionServerHostname.testRegionServerHostname may 
fail at host with a case sensitive name
 Key: HBASE-14277
 URL: https://issues.apache.org/jira/browse/HBASE-14277
 Project: HBase
  Issue Type: Test
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


After HBASE-13995, hostname will be converted to lower case in ServerName. It 
may cause the test: TestRegionServerHostname.testRegionServerHostname failed at 
host with a case sensitive name.

Just fix it in test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (HBASE-14404) Backport HBASE-14098 (Allow dropping caches behind compactions) to 0.98

2015-09-24 Thread Liu Shaohui (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shaohui reopened HBASE-14404:
-

[~apurtell]
There are typos in patch v2, which made the test failed.
All failed tests passed with patch v3.

You can see the diff of v2 and v3 from the file: v3-v2.diff

> Backport HBASE-14098 (Allow dropping caches behind compactions) to 0.98
> ---
>
> Key: HBASE-14404
> URL: https://issues.apache.org/jira/browse/HBASE-14404
> Project: HBase
>  Issue Type: Task
>Reporter: Andrew Purtell
> Attachments: HBASE-14404-0.98.patch, HBASE-14404-0.98.patch
>
>
> HBASE-14098 adds a new configuration toggle - 
> "hbase.hfile.drop.behind.compaction" - which if set to "true" tells 
> compactions to drop pages from the OS blockcache after write.  It's on by 
> default where committed so far but a backport to 0.98 would default it to 
> off. (The backport would also retain compat methods to LimitedPrivate 
> interface StoreFileScanner.) What could make it a controversial change in 
> 0.98 is it changes the default setting of 
> 'hbase.regionserver.compaction.private.readers' from "false" to "true".  I 
> think it's fine, we use private readers in production. They're stable and do 
> not present perf issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-14517) Show regionserver's version in master status page

2015-09-29 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-14517:
---

 Summary: Show regionserver's version in master status page
 Key: HBASE-14517
 URL: https://issues.apache.org/jira/browse/HBASE-14517
 Project: HBase
  Issue Type: Improvement
  Components: monitoring
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


In production env, regionservers may be removed from the cluster for hardware 
problems and rejoined the cluster after the repair. There is a potential risk 
that the version of rejoined regionserver may diff from others because the 
cluster has been upgraded through many versions. 

To solve this, we can show the all regionservers' version in the server list of 
master's status page, and highlight the regionserver when its version is 
different from the master's version, similar to HDFS-3245

Suggestions are welcome~




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-14591) Region with reference hfile may split after a forced split in IncreasingToUpperBoundRegionSplitPolicy

2015-10-12 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-14591:
---

 Summary: Region with reference hfile may split after a forced 
split in IncreasingToUpperBoundRegionSplitPolicy
 Key: HBASE-14591
 URL: https://issues.apache.org/jira/browse/HBASE-14591
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.15
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 2.0.0


In the IncreasingToUpperBoundRegionSplitPolicy, a region with a store having 
hfile reference may split after a forced split. This will break many 
assumptions of design.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-15312) Update the dependences of pom for mini cluster in HBase Book

2016-02-23 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-15312:
---

 Summary: Update the dependences of pom for mini cluster in HBase 
Book
 Key: HBASE-15312
 URL: https://issues.apache.org/jira/browse/HBASE-15312
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


In HBase book, the dependences of pom for mini cluster is outdated after 
version 0.96.

See: 
http://hbase.apache.org/book.html#_integration_testing_with_an_hbase_mini_cluster




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-15338) Add a option to disable the data block cache for testing the performance of underlying file system

2016-02-25 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-15338:
---

 Summary: Add a option to disable the data block cache for testing 
the performance of underlying file system
 Key: HBASE-15338
 URL: https://issues.apache.org/jira/browse/HBASE-15338
 Project: HBase
  Issue Type: Improvement
  Components: integration tests
Reporter: Liu Shaohui
Assignee: Liu Shaohui


When testing and comparing the performance of different file systems(HDFS, 
Azure blob storage, AWS S3 and so on) for HBase, it's better to avoid the 
affect of the HBase BlockCache and get the actually random read latency when 
data block is read from underlying file system. (Usually, the index block and 
meta block should be cached in memory in the testing).

So we add a option in CacheConfig to disable the data block cache.

Suggestions are welcomed~ Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-15385) A failed atomic folder rename operation can never recovery for the destination file deleted in Wasb filesystem

2016-03-03 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-15385:
---

 Summary: A failed atomic folder rename operation can never 
recovery for the destination file deleted in Wasb filesystem
 Key: HBASE-15385
 URL: https://issues.apache.org/jira/browse/HBASE-15385
 Project: HBase
  Issue Type: Bug
  Components: hadoop-azure
Reporter: Liu Shaohui
Priority: Critical
 Fix For: 3.0.0


When using Wsab file system, we found that a failed atomic folder rename 
operation can never recovery for the destination file deleted in Wasb 
filesystem. 
{quota}
ls: Attempting to complete rename of file 
hbase/azurtst-xiaomi/data/default/YCSBTest/.tabledesc during folder rename 
redo, and file was not found in source or destination.
{quote}

The reason is the the file is renamed to the destination file  before the 
crash, and the destination file is deleted by another process after crash. So 
the recovery is blocked during finishing the rename operation of this file when 
found the source and destination files all don't exist.

See: NativeAzureFileSystem.java #finishSingleFileRename

Another serious problem is that the recovery of atomic rename operation may 
delete new created file which is same name as the source file, because the file 
system don't check if there are rename operation need be redo.

Suggestions are welcomed~





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HBASE-15385) A failed atomic folder rename operation can never recovery for the destination file is deleted in Wasb filesystem

2016-03-03 Thread Liu Shaohui (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-15385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu Shaohui resolved HBASE-15385.
-
   Resolution: Invalid
Fix Version/s: (was: 3.0.0)

> A failed atomic folder rename operation can never recovery for the 
> destination file is deleted in Wasb filesystem
> -
>
> Key: HBASE-15385
> URL: https://issues.apache.org/jira/browse/HBASE-15385
> Project: HBase
>  Issue Type: Bug
>  Components: hadoop-azure
>Reporter: Liu Shaohui
>Priority: Critical
>
> When using Wsab file system, we found that a failed atomic folder rename 
> operation can never recovery for the destination file deleted in Wasb 
> filesystem. 
> {quota}
> ls: Attempting to complete rename of file 
> hbase/azurtst-xiaomi/data/default/YCSBTest/.tabledesc during folder rename 
> redo, and file was not found in source or destination.
> {quote}
> The reason is the the file is renamed to the destination file  before the 
> crash, and the destination file is deleted by another process after crash. So 
> the recovery is blocked during finishing the rename operation of this file 
> when found the source and destination files all don't exist.
> See: NativeAzureFileSystem.java #finishSingleFileRename
> Another serious problem is that the recovery of atomic rename operation may 
> delete new created file which is same name as the source file, because the 
> file system don't check if there are rename operation need be redo.
> Suggestions are welcomed~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-15391) Avoid too large "deleted from META" info log

2016-03-03 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-15391:
---

 Summary: Avoid too large "deleted from META" info log
 Key: HBASE-15391
 URL: https://issues.apache.org/jira/browse/HBASE-15391
 Project: HBase
  Issue Type: Improvement
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 2.0.0


When deleting a large table in HBase, there will be a large info log in HMaster.
{code}
2016-02-29,05:58:45,920 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
Deleted [{ENCODED => 4b54572150941cd03f5addfdeab0a754, NAME => 
'YCSBTest,,1453186492932.4b54572150941cd03f5addfdeab0a754.', STARTKEY => '', 
ENDKEY => 'user01'}, {ENCODED => 715e142bcd6a31d7842abf286ef8a5fe, NAME => 
'YCSBTest,user01,1453186492933.715e142bcd6a31d7842abf286ef8a5fe.', STARTKEY => 
'user01', ENDKEY => 'user02'}, {ENCODED => 5f9cef5714973f13baa63fba29a68d70, 
NAME => 'YCSBTest,user02,1453186492933.5f9cef5714973f13baa63fba29a68d70.', 
STARTKEY => 'user02', ENDKEY => 'user03'}, {ENCODED => 
86cf3fa4c0a6b911275512c1d4b78533, NAME => 'YCSBTest,user0...
{code}
The reason is that MetaTableAccessor will log all regions when deleting them 
from meta. See, MetaTableAccessor.java#deleteRegions
{code}
  public static void deleteRegions(Connection connection,
   List regionsInfo, long ts) 
throws IOException {
List deletes = new ArrayList(regionsInfo.size());
for (HRegionInfo hri: regionsInfo) {
  Delete e = new Delete(hri.getRegionName());
  e.addFamily(getCatalogFamily(), ts);
  deletes.add(e);
}
deleteFromMetaTable(connection, deletes);
LOG.info("Deleted " + regionsInfo);
  }
{code}
Just change the info log to debug and add a info log about the number of 
deleted regions. Others suggestions are welcomed~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-15409) TestHFileBackedByBucketCache failed randomly on jdk8

2016-03-07 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-15409:
---

 Summary: TestHFileBackedByBucketCache failed randomly on jdk8
 Key: HBASE-15409
 URL: https://issues.apache.org/jira/browse/HBASE-15409
 Project: HBase
  Issue Type: Test
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor


When running the small tests, we found  TestHFileBackedByBucketCache failed 
randomly

{code}
mvn clean package install -DrunSmallTests -Dtest=TestHFileBackedByBucketCache

Running org.apache.hadoop.hbase.io.hfile.TestHFileBackedByBucketCache
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.262 sec <<< 
FAILURE! - in org.apache.hadoop.hbase.io.hfile.TestHFileBackedByBucketCache
testBucketCacheCachesAndPersists(org.apache.hadoop.hbase.io.hfile.TestHFileBackedByBucketCache)
  Time elapsed: 0.69 sec  <<< FAILURE!
java.lang.AssertionError: expected:<5> but was:<4>
at 
org.apache.hadoop.hbase.io.hfile.TestHFileBackedByBucketCache.testBucketCacheCachesAndPersists(TestHFileBackedByBucketCache.java:161)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-15420) TestCacheConfig failed after HBASE-15338

2016-03-07 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-15420:
---

 Summary: TestCacheConfig failed after HBASE-15338
 Key: HBASE-15420
 URL: https://issues.apache.org/jira/browse/HBASE-15420
 Project: HBase
  Issue Type: Test
  Components: test
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 2.0.0


TestCacheConfig failed after HBASE-15338.
Fix it in this issue~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-8675) Two active Hmaster for AUTH_FAILED in secure hbase cluster

2013-06-02 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-8675:
--

 Summary: Two active Hmaster for AUTH_FAILED in secure hbase cluster
 Key: HBASE-8675
 URL: https://issues.apache.org/jira/browse/HBASE-8675
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: Liu Shaohui
Priority: Critical


In our product cluster, because of the net problem to kerberos server, the 
ZooKeeperWatcher in active hmaster fails to Auth , gets a connection Event of 
AUTH_FAILED  and loose the master lock. But the zookeeper watcher ignores the 
event, so the old active hmaster keeps to be active. After the net problem is 
fixed, the backup hmaster gets the master lock and becomes active. There are 
two two active hmasters in the cluster.

2013-05-30 09:44:21,004 ERROR org.apache.zookeeper.client.ZooKeeperSaslClient: 
An error: (java.security.PrivilegedActionException: 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: krb1.xiaomi.net)]) occurred 
when evaluating Zookeeper Quorum Member's  received SASL token. Zookeeper 
Client will go to AUTH_FAILED state.

2013-05-30 09:54:07,755 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: 
hconnection-0x3e10d98be405bc Unable to set watcher on znode /hbase/master
org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
AuthFailed for /hbase/master
at org.apache.zookeeper.KeeperException.create(KeeperException.java:123)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1036)
at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:166)
at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:231)
at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:76)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.ensureZookeeperTrackers(HConnectionManager.java:595)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:850)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:825)
at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:286)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:201)
at 
org.apache.hadoop.hbase.catalog.MetaReader.getHTable(MetaReader.java:200)
at 
org.apache.hadoop.hbase.catalog.MetaReader.getMetaHTable(MetaReader.java:226)
at 
org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:705)
at 
org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:183)
at 
org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:168)
at 
org.apache.hadoop.hbase.master.CatalogJanitor.getSplitParents(CatalogJanitor.java:123)
at 
org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:134)
at 
org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:92)
at org.apache.hadoop.hbase.Chore.run(Chore.java:67)
at java.lang.Thread.run(Thread.java:662)


I want to just abort the hmaster server if AuthFailed or SaslAuthenticated. Any 
better idea about this issue? 
For ZookeeperWatcher is used in many classes, will the aborting will bring more 
problems? Any more problems we need consider? 


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HBASE-8690) Reduce unnecessary getFileStatus hdfs calls in TTL hfile and hlog cleanners

2013-06-04 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-8690:
--

 Summary: Reduce unnecessary getFileStatus hdfs calls in TTL hfile 
and hlog cleanners
 Key: HBASE-8690
 URL: https://issues.apache.org/jira/browse/HBASE-8690
 Project: HBase
  Issue Type: Improvement
  Components: master
Reporter: Liu Shaohui
Priority: Minor


For each in file in archive dir, the TimeToLiveHFileCleaner need call 
getFileStatus to get the modify time of file. Actually the CleanerChore have 
had the file status when listing its parent dir. 

When we set the TTL to 7 days in our cluster for data security, the number of 
files left in archive dir is up to 65 thousands. In each clean period, 
TimeToLiveHFileCleaner will generate ten thousand getFileStatus call in a short 
time, which is very heavy for hdfs namenode.

Fix: Change the path param to FileStatus in isFileDeletable method and reduce 
unnecessary getFileStatus hdfs calls in TTL cleaners.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HBASE-8707) Add LongComparator for filter

2013-06-07 Thread Liu Shaohui (JIRA)

Liu Shaohui created HBASE-8707:
--

 Summary: Add LongComparator for filter
 Key: HBASE-8707
 URL: https://issues.apache.org/jira/browse/HBASE-8707
 Project: HBase
  Issue Type: New Feature
Reporter: Liu Shaohui
Priority: Minor


Add LongComparator for filter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

1 2 >

1 - 100 of 105 matches

Mail list logo