HBase client hangs after upgrade to 0.20.4 when used from reducer

2010-05-12 Thread Friso van Vollenhoven
Hi all,

I am using Hadoop (0.20.2) and HBase to periodically import data (every 15 
minutes). There are a number of import processes, but generally they all create 
a sequence file on HDFS, which is then run through a MapReduce job. The 
MapReduce uses the identity mapper (the input file is a Hadoop sequence file) 
and a specialized reducer that does the following:
- Combine the values for a key into one value
- Do a Get from HBase to retrieve existing values for the same key
- Combine the existing value from HBase and the new one into one value again
- Put the final value into HBase under the same key (thus 'overwrite' the 
existing row; I keep only one version)

After I upgraded HBase to the 0.20.4 release, the reducers sometimes start 
hanging on a Get. When the jobs start, some reducers run to completion fine, 
but after a while the last reducers will start to hang. Eventually the reducers 
are killed of by Hadoop (after 600 secs).

I did a thread dump for one of the hanging reducers. It looks like this:
"main" prio=10 tid=0x48083800 nid=0x4c93 in Object.wait() 
[0x420ca000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x2eb50d70> (a 
org.apache.hadoop.hbase.ipc.HBaseClient$Call)
at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:721)
- locked <0x2eb50d70> (a 
org.apache.hadoop.hbase.ipc.HBaseClient$Call)
at 
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
at $Proxy2.get(Unknown Source)
at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:450)
at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:448)
at 
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1050)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:447)
at 
net.ripe.inrdb.hbase.accessor.real.HBaseTableAccessor.get(HBaseTableAccessor.java:36)
at 
net.ripe.inrdb.hbase.store.HBaseStoreUpdater.getExistingRecords(HBaseStoreUpdater.java:101)
at 
net.ripe.inrdb.hbase.store.HBaseStoreUpdater.mergeTimelinesWithExistingRecords(HBaseStoreUpdater.java:60)
at 
net.ripe.inrdb.hbase.store.HBaseStoreUpdater.doInsert(HBaseStoreUpdater.java:40)
at 
net.ripe.inrdb.core.store.SinglePartitionStore$Updater.insert(SinglePartitionStore.java:92)
at 
net.ripe.inrdb.core.store.CompositeStore$CompositeStoreUpdater.insert(CompositeStore.java:142)
at 
net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:70)
at 
net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:17)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

So the client hangs in a wait() call, waiting on a HBaseClient$Call object. I 
looked at the code. The wait is in a while() loop and has no time out, so it 
figures that it never gets out of there if no notify() gets called on the 
object. I am not sure for exactly what condition it is waiting, however.

Meanwhile, after this has happened, I cannot shutdown the master server 
normally. I have to kill -9 it, to make it shut down. Normally and before this 
problem occurs, the master server shuts down just fine. (Sorry, didn't do a 
thread dump of the master and now I downgraded to 0.20.3 again.)

I cannot reproduce this error on my local setup (developer machine). It only 
occurs on our (currently modest) cluster of one machine running 
master+NN+Zookeeper and four datanodes which are all task trackers and region 
servers as well. The inputs to the periodic MapReduce jobs are very small 
(ranging from some Kb to several Mb) and thus contain not so many records. I 
know this is not very efficient to do in MapReduce and will be faster when 
inserted in process by the importer process because of startup overhead, but we 
are setting up this architecture of importers and insertion for anticipated 
larger loads (up to 80 million records per day).

Does anyone have a clue about what happens? Or where to look for further 
investigation?

Thanks a lot!


Cheers,
Friso



Re: HBase client hangs after upgrade to 0.20.4 when used from reducer

2010-05-12 Thread Jean-Daniel Cryans
Friso,

Unfortunately it's hard to determine the cause with the provided
information, the client call you pasted is pretty much normal i.e. the
client is waiting to receive a result from a region server.

The fact that you can't shut down the master when this happens is very
concerning. Do you still have those logs around? Same for the region
servers? Can you post this in pastebin or on a web server?

Also, feel free to come chat with us on IRC, it's always easier to
debug when live. #hbase on freenode

J-D

On Wed, May 12, 2010 at 8:31 AM, Friso van Vollenhoven
 wrote:
> Hi all,
>
> I am using Hadoop (0.20.2) and HBase to periodically import data (every 15 
> minutes). There are a number of import processes, but generally they all 
> create a sequence file on HDFS, which is then run through a MapReduce job. 
> The MapReduce uses the identity mapper (the input file is a Hadoop sequence 
> file) and a specialized reducer that does the following:
> - Combine the values for a key into one value
> - Do a Get from HBase to retrieve existing values for the same key
> - Combine the existing value from HBase and the new one into one value again
> - Put the final value into HBase under the same key (thus 'overwrite' the 
> existing row; I keep only one version)
>
> After I upgraded HBase to the 0.20.4 release, the reducers sometimes start 
> hanging on a Get. When the jobs start, some reducers run to completion fine, 
> but after a while the last reducers will start to hang. Eventually the 
> reducers are killed of by Hadoop (after 600 secs).
>
> I did a thread dump for one of the hanging reducers. It looks like this:
> "main" prio=10 tid=0x48083800 nid=0x4c93 in Object.wait() 
> [0x420ca000]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        - waiting on <0x2eb50d70> (a 
> org.apache.hadoop.hbase.ipc.HBaseClient$Call)
>        at java.lang.Object.wait(Object.java:485)
>        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:721)
>        - locked <0x2eb50d70> (a 
> org.apache.hadoop.hbase.ipc.HBaseClient$Call)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>        at $Proxy2.get(Unknown Source)
>        at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:450)
>        at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:448)
>        at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1050)
>        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:447)
>        at 
> net.ripe.inrdb.hbase.accessor.real.HBaseTableAccessor.get(HBaseTableAccessor.java:36)
>        at 
> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.getExistingRecords(HBaseStoreUpdater.java:101)
>        at 
> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.mergeTimelinesWithExistingRecords(HBaseStoreUpdater.java:60)
>        at 
> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.doInsert(HBaseStoreUpdater.java:40)
>        at 
> net.ripe.inrdb.core.store.SinglePartitionStore$Updater.insert(SinglePartitionStore.java:92)
>        at 
> net.ripe.inrdb.core.store.CompositeStore$CompositeStoreUpdater.insert(CompositeStore.java:142)
>        at 
> net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:70)
>        at 
> net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:17)
>        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>        at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> So the client hangs in a wait() call, waiting on a HBaseClient$Call object. I 
> looked at the code. The wait is in a while() loop and has no time out, so it 
> figures that it never gets out of there if no notify() gets called on the 
> object. I am not sure for exactly what condition it is waiting, however.
>
> Meanwhile, after this has happened, I cannot shutdown the master server 
> normally. I have to kill -9 it, to make it shut down. Normally and before 
> this problem occurs, the master server shuts down just fine. (Sorry, didn't 
> do a thread dump of the master and now I downgraded to 0.20.3 again.)
>
> I cannot reproduce this error on my local setup (developer machine). It only 
> occurs on our (currently modest) cluster of one machine running 
> master+NN+Zookeeper and four datanodes which are all task trackers and region 
> servers as well. The inputs to the periodic MapReduce jobs are very small 
> (ranging from some Kb to several Mb) and thus contain not so many records. I 
> know this is not very efficient to do in MapReduce and will be faster when 
> inserted in process by the importer process because of startup overhead, but 
> we are setting up this architecture of importers and insertion for 
> anticipated

Re: HBase client hangs after upgrade to 0.20.4 when used from reducer

2010-05-12 Thread Todd Lipcon
Hi Friso,

Also, if you can capture a jstack of the regionservers at thie time
that would be great.

-Todd

On Wed, May 12, 2010 at 9:26 AM, Jean-Daniel Cryans  wrote:
> Friso,
>
> Unfortunately it's hard to determine the cause with the provided
> information, the client call you pasted is pretty much normal i.e. the
> client is waiting to receive a result from a region server.
>
> The fact that you can't shut down the master when this happens is very
> concerning. Do you still have those logs around? Same for the region
> servers? Can you post this in pastebin or on a web server?
>
> Also, feel free to come chat with us on IRC, it's always easier to
> debug when live. #hbase on freenode
>
> J-D
>
> On Wed, May 12, 2010 at 8:31 AM, Friso van Vollenhoven
>  wrote:
>> Hi all,
>>
>> I am using Hadoop (0.20.2) and HBase to periodically import data (every 15 
>> minutes). There are a number of import processes, but generally they all 
>> create a sequence file on HDFS, which is then run through a MapReduce job. 
>> The MapReduce uses the identity mapper (the input file is a Hadoop sequence 
>> file) and a specialized reducer that does the following:
>> - Combine the values for a key into one value
>> - Do a Get from HBase to retrieve existing values for the same key
>> - Combine the existing value from HBase and the new one into one value again
>> - Put the final value into HBase under the same key (thus 'overwrite' the 
>> existing row; I keep only one version)
>>
>> After I upgraded HBase to the 0.20.4 release, the reducers sometimes start 
>> hanging on a Get. When the jobs start, some reducers run to completion fine, 
>> but after a while the last reducers will start to hang. Eventually the 
>> reducers are killed of by Hadoop (after 600 secs).
>>
>> I did a thread dump for one of the hanging reducers. It looks like this:
>> "main" prio=10 tid=0x48083800 nid=0x4c93 in Object.wait() 
>> [0x420ca000]
>>   java.lang.Thread.State: WAITING (on object monitor)
>>        at java.lang.Object.wait(Native Method)
>>        - waiting on <0x2eb50d70> (a 
>> org.apache.hadoop.hbase.ipc.HBaseClient$Call)
>>        at java.lang.Object.wait(Object.java:485)
>>        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:721)
>>        - locked <0x2eb50d70> (a 
>> org.apache.hadoop.hbase.ipc.HBaseClient$Call)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>        at $Proxy2.get(Unknown Source)
>>        at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:450)
>>        at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:448)
>>        at 
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1050)
>>        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:447)
>>        at 
>> net.ripe.inrdb.hbase.accessor.real.HBaseTableAccessor.get(HBaseTableAccessor.java:36)
>>        at 
>> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.getExistingRecords(HBaseStoreUpdater.java:101)
>>        at 
>> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.mergeTimelinesWithExistingRecords(HBaseStoreUpdater.java:60)
>>        at 
>> net.ripe.inrdb.hbase.store.HBaseStoreUpdater.doInsert(HBaseStoreUpdater.java:40)
>>        at 
>> net.ripe.inrdb.core.store.SinglePartitionStore$Updater.insert(SinglePartitionStore.java:92)
>>        at 
>> net.ripe.inrdb.core.store.CompositeStore$CompositeStoreUpdater.insert(CompositeStore.java:142)
>>        at 
>> net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:70)
>>        at 
>> net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:17)
>>        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>>        at 
>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
>>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>
>> So the client hangs in a wait() call, waiting on a HBaseClient$Call object. 
>> I looked at the code. The wait is in a while() loop and has no time out, so 
>> it figures that it never gets out of there if no notify() gets called on the 
>> object. I am not sure for exactly what condition it is waiting, however.
>>
>> Meanwhile, after this has happened, I cannot shutdown the master server 
>> normally. I have to kill -9 it, to make it shut down. Normally and before 
>> this problem occurs, the master server shuts down just fine. (Sorry, didn't 
>> do a thread dump of the master and now I downgraded to 0.20.3 again.)
>>
>> I cannot reproduce this error on my local setup (developer machine). It only 
>> occurs on our (currently modest) cluster of one machine running 
>> master+NN+Zookeeper and four datanodes which are all task trackers and 
>> region servers as well. The inputs to the periodic MapReduce jobs are very 
>> small (ranging from some Kb to several Mb) and thus co

Re: HBase client hangs after upgrade to 0.20.4 when used from reducer

2010-05-13 Thread Friso van Vollenhoven
Hi,

I was kind of hoping that this was a known thing and I was just overlooking 
something. Apparently it requires more investigation.

I reproduced the situation and gathered some more info. I still have to look 
through everything myself a bit more (today is a national holiday in The 
Netherlands, so not much working), but I posted some info to pastebin.

I did the following sequence (with HBase 0.20.4):
- startup HBase (waited for all the regions to come online and let it settle)
- startup our application
- wait for the importer job to hang (it only happened on the second run, which 
started 15 reducers; the first run was really small and only one key was 
generated, so just one reducer)
- kill the hanging importer job (hadoop job -kill)
- try to shutdown HBase (as I type this it is still producing dots on my 
console)

The HBase master logs are here (includes shutdown attempt): 
http://pastebin.com/PYpPVcyK
The jstacks are here:
- HMaster: http://pastebin.com/Da6jCAuA (this includes two thread dumps, one 
during operation with the hanging clients and one during hanging shutdown)
- RegionServer 1: http://pastebin.com/5dQXfxCn
- RegionServer 2: http://pastebin.com/XWwBGXYC
- RegionServer 3: http://pastebin.com/mDgWbYGV
- RegionServer 4: http://pastebin.com/XDR14bth

As you can see in the master logs, the shutdown cannot get a thread called 
Thread-10 to stop running. The trace for that thread looks like this:
"Thread-10" prio=10 tid=0x4d218800 nid=0x1e73 in Object.wait() 
[0x427a7000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x2aaab364c9d0> (a java.lang.Object)
at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:89)
- locked <0x2aaab364c9d0> (a java.lang.Object)
at org.apache.hadoop.hbase.Chore.run(Chore.java:76)

I still have no clue what happened, but I will investigate a bit more tomorrow.


Thanks for the responses.


Friso



On May 12, 2010, at 9:02 PM, Todd Lipcon wrote:

> Hi Friso,
> 
> Also, if you can capture a jstack of the regionservers at thie time
> that would be great.
> 
> -Todd
> 
> On Wed, May 12, 2010 at 9:26 AM, Jean-Daniel Cryans  
> wrote:
>> Friso,
>> 
>> Unfortunately it's hard to determine the cause with the provided
>> information, the client call you pasted is pretty much normal i.e. the
>> client is waiting to receive a result from a region server.
>> 
>> The fact that you can't shut down the master when this happens is very
>> concerning. Do you still have those logs around? Same for the region
>> servers? Can you post this in pastebin or on a web server?
>> 
>> Also, feel free to come chat with us on IRC, it's always easier to
>> debug when live. #hbase on freenode
>> 
>> J-D
>> 
>> On Wed, May 12, 2010 at 8:31 AM, Friso van Vollenhoven
>>  wrote:
>>> Hi all,
>>> 
>>> I am using Hadoop (0.20.2) and HBase to periodically import data (every 15 
>>> minutes). There are a number of import processes, but generally they all 
>>> create a sequence file on HDFS, which is then run through a MapReduce job. 
>>> The MapReduce uses the identity mapper (the input file is a Hadoop sequence 
>>> file) and a specialized reducer that does the following:
>>> - Combine the values for a key into one value
>>> - Do a Get from HBase to retrieve existing values for the same key
>>> - Combine the existing value from HBase and the new one into one value again
>>> - Put the final value into HBase under the same key (thus 'overwrite' the 
>>> existing row; I keep only one version)
>>> 
>>> After I upgraded HBase to the 0.20.4 release, the reducers sometimes start 
>>> hanging on a Get. When the jobs start, some reducers run to completion 
>>> fine, but after a while the last reducers will start to hang. Eventually 
>>> the reducers are killed of by Hadoop (after 600 secs).
>>> 
>>> I did a thread dump for one of the hanging reducers. It looks like this:
>>> "main" prio=10 tid=0x48083800 nid=0x4c93 in Object.wait() 
>>> [0x420ca000]
>>>   java.lang.Thread.State: WAITING (on object monitor)
>>>at java.lang.Object.wait(Native Method)
>>>- waiting on <0x2eb50d70> (a 
>>> org.apache.hadoop.hbase.ipc.HBaseClient$Call)
>>>at java.lang.Object.wait(Object.java:485)
>>>at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:721)
>>>- locked <0x2eb50d70> (a 
>>> org.apache.hadoop.hbase.ipc.HBaseClient$Call)
>>>at 
>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
>>>at $Proxy2.get(Unknown Source)
>>>at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:450)
>>>at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:448)
>>>at 
>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1050)
>>>at org.apache.hadoop.hbase.client.HTable.get(HTable.java

Re: HBase client hangs after upgrade to 0.20.4 when used from reducer

2010-05-13 Thread Todd Lipcon
Hi Friso,

When did you take the jstack dumps of the region servers? Was it when the
reduce tasks were still hanging?

Do all of the reduce tasks hang or is it just one that gets stuck?

If, once the reduce tasks are hung, you open the hbase shell and run "count
'mytable', 10" does it successfully count the rows?

(I'm trying to determine if the client is locked up, or one of the RSes is
locked up)

Enjoy your holiday!

-Todd

On Thu, May 13, 2010 at 12:38 AM, Friso van Vollenhoven <
fvanvollenho...@xebia.com> wrote:

> Hi,
>
> I was kind of hoping that this was a known thing and I was just overlooking
> something. Apparently it requires more investigation.
>
> I reproduced the situation and gathered some more info. I still have to
> look through everything myself a bit more (today is a national holiday in
> The Netherlands, so not much working), but I posted some info to pastebin.
>
> I did the following sequence (with HBase 0.20.4):
> - startup HBase (waited for all the regions to come online and let it
> settle)
> - startup our application
> - wait for the importer job to hang (it only happened on the second run,
> which started 15 reducers; the first run was really small and only one key
> was generated, so just one reducer)
> - kill the hanging importer job (hadoop job -kill)
> - try to shutdown HBase (as I type this it is still producing dots on my
> console)
>
> The HBase master logs are here (includes shutdown attempt):
> http://pastebin.com/PYpPVcyK
> The jstacks are here:
> - HMaster: http://pastebin.com/Da6jCAuA (this includes two thread dumps,
> one during operation with the hanging clients and one during hanging
> shutdown)
> - RegionServer 1: http://pastebin.com/5dQXfxCn
> - RegionServer 2: http://pastebin.com/XWwBGXYC
> - RegionServer 3: http://pastebin.com/mDgWbYGV
> - RegionServer 4: http://pastebin.com/XDR14bth
>
> As you can see in the master logs, the shutdown cannot get a thread called
> Thread-10 to stop running. The trace for that thread looks like this:
> "Thread-10" prio=10 tid=0x4d218800 nid=0x1e73 in Object.wait()
> [0x427a7000]
>   java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0x2aaab364c9d0> (a java.lang.Object)
>at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:89)
>- locked <0x2aaab364c9d0> (a java.lang.Object)
>at org.apache.hadoop.hbase.Chore.run(Chore.java:76)
>
> I still have no clue what happened, but I will investigate a bit more
> tomorrow.
>
>
> Thanks for the responses.
>
>
> Friso
>
>
>
> On May 12, 2010, at 9:02 PM, Todd Lipcon wrote:
>
> > Hi Friso,
> >
> > Also, if you can capture a jstack of the regionservers at thie time
> > that would be great.
> >
> > -Todd
> >
> > On Wed, May 12, 2010 at 9:26 AM, Jean-Daniel Cryans 
> wrote:
> >> Friso,
> >>
> >> Unfortunately it's hard to determine the cause with the provided
> >> information, the client call you pasted is pretty much normal i.e. the
> >> client is waiting to receive a result from a region server.
> >>
> >> The fact that you can't shut down the master when this happens is very
> >> concerning. Do you still have those logs around? Same for the region
> >> servers? Can you post this in pastebin or on a web server?
> >>
> >> Also, feel free to come chat with us on IRC, it's always easier to
> >> debug when live. #hbase on freenode
> >>
> >> J-D
> >>
> >> On Wed, May 12, 2010 at 8:31 AM, Friso van Vollenhoven
> >>  wrote:
> >>> Hi all,
> >>>
> >>> I am using Hadoop (0.20.2) and HBase to periodically import data (every
> 15 minutes). There are a number of import processes, but generally they all
> create a sequence file on HDFS, which is then run through a MapReduce job.
> The MapReduce uses the identity mapper (the input file is a Hadoop sequence
> file) and a specialized reducer that does the following:
> >>> - Combine the values for a key into one value
> >>> - Do a Get from HBase to retrieve existing values for the same key
> >>> - Combine the existing value from HBase and the new one into one value
> again
> >>> - Put the final value into HBase under the same key (thus 'overwrite'
> the existing row; I keep only one version)
> >>>
> >>> After I upgraded HBase to the 0.20.4 release, the reducers sometimes
> start hanging on a Get. When the jobs start, some reducers run to completion
> fine, but after a while the last reducers will start to hang. Eventually the
> reducers are killed of by Hadoop (after 600 secs).
> >>>
> >>> I did a thread dump for one of the hanging reducers. It looks like
> this:
> >>> "main" prio=10 tid=0x48083800 nid=0x4c93 in Object.wait()
> [0x420ca000]
> >>>   java.lang.Thread.State: WAITING (on object monitor)
> >>>at java.lang.Object.wait(Native Method)
> >>>- waiting on <0x2eb50d70> (a
> org.apache.hadoop.hbase.ipc.HBaseClient$Call)
> >>>at java.lang.Object.wait(Object.java:485)
> >>>   

Re: HBase client hangs after upgrade to 0.20.4 when used from reducer

2010-05-14 Thread Friso van Vollenhoven
Hi Todd,

The row counting works fine. It is quite slow, I have to say. But I have never 
used row counting from the shell before, so I don't know what performance to 
expect from it or how it is implemented. It's just that a regular scan from 
software is way faster.

Also, from our application we do a full table scan to populate an in-memory 
index of row keys, because we need to be able to quickly determine if a certain 
key exists or not. I triggered this scan from our UI while there were hanging 
reducers. This also works fine. There are close to 5 million records in the 
table and I checked in the web interface that the table is divided across all 4 
region servers, so this process should hit them all.

The earlier jstacks of the region servers were taken when the reducers 
(clients) were hanging, before the shutdown was requested. Some 3 to 5 reduce 
tasks hang, not all of them, but surely more than just one.

Because of your question about what is locked up (client or region server), I 
SSH'ed into each of the worker machines after giving HBase the shutdown signal 
(so the shutdown sequence started) and tried to see if the region server was 
still running and if so,  shutdown each individual region server manually 
(doing 'hbase-daemon.sh stop regionserver' on each, I'm glad there are only 4 
nodes). I found that:
- one of the region servers actually shut down normally (worker3)
- two region servers shut down normally after the hbase-daemon.sh command 
(worker4 and worker1)
- one region server does not shut down (worker2)

I put some additional info on pastebin.

Here is the jstack of worker2 (the hanging one): http://pastebin.com/5V0UZi7N
There are two jstack outputs, one from before the shutdown command was given 
and one (starting at line 946) from after the shutdown command was given.

Here are the logs of that region server: http://pastebin.com/qCXSKR2A
I set the log level for org.apache.hadoop.hbase to DEBUG before doing all this, 
so it's more verbose (I don't know if this helps).

So, it appears that it is one of the region servers that is locked up, but only 
for some connections while it can still serve other connections normally. From 
the locked up region server logs, it looks like the shutdown sequence runs 
completely, but the server just won't die afterwards (because of running 
non-daemon threads; maybe it should just do a System.exit() if all cleanup is 
successful). At least nothing gets corrupted, which is nice. Of course I am 
still trying to find out why things get locked up in the first place.

I did this test twice today. During the first run it was a different region 
server that was hanging, so I think it has nothing to do with a problem related 
to that specific machine.

My next step is to go through code (including HBase's, so It will take me some 
time...) and see what exactly happens in our scenario, because from my current 
knowledge the jstack outputs don't mean enough to me.



Friso




On May 13, 2010, at 7:09 PM, Todd Lipcon wrote:

> Hi Friso,
> 
> When did you take the jstack dumps of the region servers? Was it when the
> reduce tasks were still hanging?
> 
> Do all of the reduce tasks hang or is it just one that gets stuck?
> 
> If, once the reduce tasks are hung, you open the hbase shell and run "count
> 'mytable', 10" does it successfully count the rows?
> 
> (I'm trying to determine if the client is locked up, or one of the RSes is
> locked up)
> 
> Enjoy your holiday!
> 
> -Todd
> 
> On Thu, May 13, 2010 at 12:38 AM, Friso van Vollenhoven <
> fvanvollenho...@xebia.com> wrote:
> 
>> Hi,
>> 
>> I was kind of hoping that this was a known thing and I was just overlooking
>> something. Apparently it requires more investigation.
>> 
>> I reproduced the situation and gathered some more info. I still have to
>> look through everything myself a bit more (today is a national holiday in
>> The Netherlands, so not much working), but I posted some info to pastebin.
>> 
>> I did the following sequence (with HBase 0.20.4):
>> - startup HBase (waited for all the regions to come online and let it
>> settle)
>> - startup our application
>> - wait for the importer job to hang (it only happened on the second run,
>> which started 15 reducers; the first run was really small and only one key
>> was generated, so just one reducer)
>> - kill the hanging importer job (hadoop job -kill)
>> - try to shutdown HBase (as I type this it is still producing dots on my
>> console)
>> 
>> The HBase master logs are here (includes shutdown attempt):
>> http://pastebin.com/PYpPVcyK
>> The jstacks are here:
>> - HMaster: http://pastebin.com/Da6jCAuA (this includes two thread dumps,
>> one during operation with the hanging clients and one during hanging
>> shutdown)
>> - RegionServer 1: http://pastebin.com/5dQXfxCn
>> - RegionServer 2: http://pastebin.com/XWwBGXYC
>> - RegionServer 3: http://pastebin.com/mDgWbYGV
>> - RegionServer 4: http://pastebin.com/XDR14bth
>> 
>> As you can see in the 

Re: HBase client hangs after upgrade to 0.20.4 when used from reducer

2010-05-14 Thread Todd Lipcon
It appears like we might be stuck in an infinite loop here:

"IPC Server handler 9 on 60020" daemon prio=10 tid=0x2aaeb42f7800
nid=0x6508 runnable [0x445bb000]
   java.lang.Thread.State: RUNNABLE
at
org.apache.hadoop.hbase.regionserver.ExplicitColumnTracker.checkColumn(ExplicitColumnTracker.java:128)
at
org.apache.hadoop.hbase.regionserver.ScanQueryMatcher.match(ScanQueryMatcher.java:165)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:176)

It's holding a lock that some other threads are blocked on. In both of your
pastes, there are some threads stuck here.

JD, any thoughts? Looks like you made some changes to this code for 0.20.4.

-Todd

On Fri, May 14, 2010 at 6:56 AM, Friso van Vollenhoven <
fvanvollenho...@xebia.com> wrote:

> Hi Todd,
>
> The row counting works fine. It is quite slow, I have to say. But I have
> never used row counting from the shell before, so I don't know what
> performance to expect from it or how it is implemented. It's just that a
> regular scan from software is way faster.
>
> Also, from our application we do a full table scan to populate an in-memory
> index of row keys, because we need to be able to quickly determine if a
> certain key exists or not. I triggered this scan from our UI while there
> were hanging reducers. This also works fine. There are close to 5 million
> records in the table and I checked in the web interface that the table is
> divided across all 4 region servers, so this process should hit them all.
>
> The earlier jstacks of the region servers were taken when the reducers
> (clients) were hanging, before the shutdown was requested. Some 3 to 5
> reduce tasks hang, not all of them, but surely more than just one.
>
> Because of your question about what is locked up (client or region server),
> I SSH'ed into each of the worker machines after giving HBase the shutdown
> signal (so the shutdown sequence started) and tried to see if the region
> server was still running and if so,  shutdown each individual region server
> manually (doing 'hbase-daemon.sh stop regionserver' on each, I'm glad there
> are only 4 nodes). I found that:
> - one of the region servers actually shut down normally (worker3)
> - two region servers shut down normally after the hbase-daemon.sh command
> (worker4 and worker1)
> - one region server does not shut down (worker2)
>
> I put some additional info on pastebin.
>
> Here is the jstack of worker2 (the hanging one):
> http://pastebin.com/5V0UZi7N
> There are two jstack outputs, one from before the shutdown command was
> given and one (starting at line 946) from after the shutdown command was
> given.
>
> Here are the logs of that region server: http://pastebin.com/qCXSKR2A
> I set the log level for org.apache.hadoop.hbase to DEBUG before doing all
> this, so it's more verbose (I don't know if this helps).
>
> So, it appears that it is one of the region servers that is locked up, but
> only for some connections while it can still serve other connections
> normally. From the locked up region server logs, it looks like the shutdown
> sequence runs completely, but the server just won't die afterwards (because
> of running non-daemon threads; maybe it should just do a System.exit() if
> all cleanup is successful). At least nothing gets corrupted, which is nice.
> Of course I am still trying to find out why things get locked up in the
> first place.
>
> I did this test twice today. During the first run it was a different region
> server that was hanging, so I think it has nothing to do with a problem
> related to that specific machine.
>
> My next step is to go through code (including HBase's, so It will take me
> some time...) and see what exactly happens in our scenario, because from my
> current knowledge the jstack outputs don't mean enough to me.
>
>
>
> Friso
>
>
>
>
> On May 13, 2010, at 7:09 PM, Todd Lipcon wrote:
>
> > Hi Friso,
> >
> > When did you take the jstack dumps of the region servers? Was it when the
> > reduce tasks were still hanging?
> >
> > Do all of the reduce tasks hang or is it just one that gets stuck?
> >
> > If, once the reduce tasks are hung, you open the hbase shell and run
> "count
> > 'mytable', 10" does it successfully count the rows?
> >
> > (I'm trying to determine if the client is locked up, or one of the RSes
> is
> > locked up)
> >
> > Enjoy your holiday!
> >
> > -Todd
> >
> > On Thu, May 13, 2010 at 12:38 AM, Friso van Vollenhoven <
> > fvanvollenho...@xebia.com> wrote:
> >
> >> Hi,
> >>
> >> I was kind of hoping that this was a known thing and I was just
> overlooking
> >> something. Apparently it requires more investigation.
> >>
> >> I reproduced the situation and gathered some more info. I still have to
> >> look through everything myself a bit more (today is a national holiday
> in
> >> The Netherlands, so not much working), but I posted some info to
> pastebin.
> >>
> >> I did the following sequence (with HBase 0.20.4):
> >> - st

Re: HBase client hangs after upgrade to 0.20.4 when used from reducer

2010-05-14 Thread Jean-Daniel Cryans
It could be tho it means it's not covered by the unit tests and all
the testing done on that branch since late February didn't catch this
up.

J-D

On Fri, May 14, 2010 at 4:07 PM, Todd Lipcon  wrote:
> It appears like we might be stuck in an infinite loop here:
>
> "IPC Server handler 9 on 60020" daemon prio=10 tid=0x2aaeb42f7800
> nid=0x6508 runnable [0x445bb000]
>   java.lang.Thread.State: RUNNABLE
>        at
> org.apache.hadoop.hbase.regionserver.ExplicitColumnTracker.checkColumn(ExplicitColumnTracker.java:128)
>        at
> org.apache.hadoop.hbase.regionserver.ScanQueryMatcher.match(ScanQueryMatcher.java:165)
>        at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:176)
>
> It's holding a lock that some other threads are blocked on. In both of your
> pastes, there are some threads stuck here.
>
> JD, any thoughts? Looks like you made some changes to this code for 0.20.4.
>
> -Todd
>
> On Fri, May 14, 2010 at 6:56 AM, Friso van Vollenhoven <
> fvanvollenho...@xebia.com> wrote:
>
>> Hi Todd,
>>
>> The row counting works fine. It is quite slow, I have to say. But I have
>> never used row counting from the shell before, so I don't know what
>> performance to expect from it or how it is implemented. It's just that a
>> regular scan from software is way faster.
>>
>> Also, from our application we do a full table scan to populate an in-memory
>> index of row keys, because we need to be able to quickly determine if a
>> certain key exists or not. I triggered this scan from our UI while there
>> were hanging reducers. This also works fine. There are close to 5 million
>> records in the table and I checked in the web interface that the table is
>> divided across all 4 region servers, so this process should hit them all.
>>
>> The earlier jstacks of the region servers were taken when the reducers
>> (clients) were hanging, before the shutdown was requested. Some 3 to 5
>> reduce tasks hang, not all of them, but surely more than just one.
>>
>> Because of your question about what is locked up (client or region server),
>> I SSH'ed into each of the worker machines after giving HBase the shutdown
>> signal (so the shutdown sequence started) and tried to see if the region
>> server was still running and if so,  shutdown each individual region server
>> manually (doing 'hbase-daemon.sh stop regionserver' on each, I'm glad there
>> are only 4 nodes). I found that:
>> - one of the region servers actually shut down normally (worker3)
>> - two region servers shut down normally after the hbase-daemon.sh command
>> (worker4 and worker1)
>> - one region server does not shut down (worker2)
>>
>> I put some additional info on pastebin.
>>
>> Here is the jstack of worker2 (the hanging one):
>> http://pastebin.com/5V0UZi7N
>> There are two jstack outputs, one from before the shutdown command was
>> given and one (starting at line 946) from after the shutdown command was
>> given.
>>
>> Here are the logs of that region server: http://pastebin.com/qCXSKR2A
>> I set the log level for org.apache.hadoop.hbase to DEBUG before doing all
>> this, so it's more verbose (I don't know if this helps).
>>
>> So, it appears that it is one of the region servers that is locked up, but
>> only for some connections while it can still serve other connections
>> normally. From the locked up region server logs, it looks like the shutdown
>> sequence runs completely, but the server just won't die afterwards (because
>> of running non-daemon threads; maybe it should just do a System.exit() if
>> all cleanup is successful). At least nothing gets corrupted, which is nice.
>> Of course I am still trying to find out why things get locked up in the
>> first place.
>>
>> I did this test twice today. During the first run it was a different region
>> server that was hanging, so I think it has nothing to do with a problem
>> related to that specific machine.
>>
>> My next step is to go through code (including HBase's, so It will take me
>> some time...) and see what exactly happens in our scenario, because from my
>> current knowledge the jstack outputs don't mean enough to me.
>>
>>
>>
>> Friso
>>
>>
>>
>>
>> On May 13, 2010, at 7:09 PM, Todd Lipcon wrote:
>>
>> > Hi Friso,
>> >
>> > When did you take the jstack dumps of the region servers? Was it when the
>> > reduce tasks were still hanging?
>> >
>> > Do all of the reduce tasks hang or is it just one that gets stuck?
>> >
>> > If, once the reduce tasks are hung, you open the hbase shell and run
>> "count
>> > 'mytable', 10" does it successfully count the rows?
>> >
>> > (I'm trying to determine if the client is locked up, or one of the RSes
>> is
>> > locked up)
>> >
>> > Enjoy your holiday!
>> >
>> > -Todd
>> >
>> > On Thu, May 13, 2010 at 12:38 AM, Friso van Vollenhoven <
>> > fvanvollenho...@xebia.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> I was kind of hoping that this was a known thing and I was just
>> overlooking
>> >> something. Apparently it requires more 

Re: HBase client hangs after upgrade to 0.20.4 when used from reducer

2010-05-14 Thread Stack
The below looks same as blockage seen in logs posted to hbase-2545 by
Kris Jirapinyo.

Todd has made a fix and added a test.

We'll roll a 0.20.5 hbase with this fix and a fix for hbase-2541
(missing licenses from the head of some source files) soon as we get
confirmation that Todd's fix works for Kris Jirapinyo's seemingly
similar issue.

Thanks,
St.Ack

On Fri, May 14, 2010 at 9:07 AM, Todd Lipcon  wrote:
> It appears like we might be stuck in an infinite loop here:
>
> "IPC Server handler 9 on 60020" daemon prio=10 tid=0x2aaeb42f7800
> nid=0x6508 runnable [0x445bb000]
>   java.lang.Thread.State: RUNNABLE
>        at
> org.apache.hadoop.hbase.regionserver.ExplicitColumnTracker.checkColumn(ExplicitColumnTracker.java:128)
>        at
> org.apache.hadoop.hbase.regionserver.ScanQueryMatcher.match(ScanQueryMatcher.java:165)
>        at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:176)
>
> It's holding a lock that some other threads are blocked on. In both of your
> pastes, there are some threads stuck here.
>
> JD, any thoughts? Looks like you made some changes to this code for 0.20.4.
>
> -Todd
>
> On Fri, May 14, 2010 at 6:56 AM, Friso van Vollenhoven <
> fvanvollenho...@xebia.com> wrote:
>
>> Hi Todd,
>>
>> The row counting works fine. It is quite slow, I have to say. But I have
>> never used row counting from the shell before, so I don't know what
>> performance to expect from it or how it is implemented. It's just that a
>> regular scan from software is way faster.
>>
>> Also, from our application we do a full table scan to populate an in-memory
>> index of row keys, because we need to be able to quickly determine if a
>> certain key exists or not. I triggered this scan from our UI while there
>> were hanging reducers. This also works fine. There are close to 5 million
>> records in the table and I checked in the web interface that the table is
>> divided across all 4 region servers, so this process should hit them all.
>>
>> The earlier jstacks of the region servers were taken when the reducers
>> (clients) were hanging, before the shutdown was requested. Some 3 to 5
>> reduce tasks hang, not all of them, but surely more than just one.
>>
>> Because of your question about what is locked up (client or region server),
>> I SSH'ed into each of the worker machines after giving HBase the shutdown
>> signal (so the shutdown sequence started) and tried to see if the region
>> server was still running and if so,  shutdown each individual region server
>> manually (doing 'hbase-daemon.sh stop regionserver' on each, I'm glad there
>> are only 4 nodes). I found that:
>> - one of the region servers actually shut down normally (worker3)
>> - two region servers shut down normally after the hbase-daemon.sh command
>> (worker4 and worker1)
>> - one region server does not shut down (worker2)
>>
>> I put some additional info on pastebin.
>>
>> Here is the jstack of worker2 (the hanging one):
>> http://pastebin.com/5V0UZi7N
>> There are two jstack outputs, one from before the shutdown command was
>> given and one (starting at line 946) from after the shutdown command was
>> given.
>>
>> Here are the logs of that region server: http://pastebin.com/qCXSKR2A
>> I set the log level for org.apache.hadoop.hbase to DEBUG before doing all
>> this, so it's more verbose (I don't know if this helps).
>>
>> So, it appears that it is one of the region servers that is locked up, but
>> only for some connections while it can still serve other connections
>> normally. From the locked up region server logs, it looks like the shutdown
>> sequence runs completely, but the server just won't die afterwards (because
>> of running non-daemon threads; maybe it should just do a System.exit() if
>> all cleanup is successful). At least nothing gets corrupted, which is nice.
>> Of course I am still trying to find out why things get locked up in the
>> first place.
>>
>> I did this test twice today. During the first run it was a different region
>> server that was hanging, so I think it has nothing to do with a problem
>> related to that specific machine.
>>
>> My next step is to go through code (including HBase's, so It will take me
>> some time...) and see what exactly happens in our scenario, because from my
>> current knowledge the jstack outputs don't mean enough to me.
>>
>>
>>
>> Friso
>>
>>
>>
>>
>> On May 13, 2010, at 7:09 PM, Todd Lipcon wrote:
>>
>> > Hi Friso,
>> >
>> > When did you take the jstack dumps of the region servers? Was it when the
>> > reduce tasks were still hanging?
>> >
>> > Do all of the reduce tasks hang or is it just one that gets stuck?
>> >
>> > If, once the reduce tasks are hung, you open the hbase shell and run
>> "count
>> > 'mytable', 10" does it successfully count the rows?
>> >
>> > (I'm trying to determine if the client is locked up, or one of the RSes
>> is
>> > locked up)
>> >
>> > Enjoy your holiday!
>> >
>> > -Todd
>> >
>> > On Thu, May 13, 2010 at 12:38 AM, Fris

Re: HBase client hangs after upgrade to 0.20.4 when used from reducer

2010-05-14 Thread Andrew Purtell
That has been a problem before. But I see Jon has already filed a jira for this.


> From: Todd Lipcon 
> Subject: Re: HBase client hangs after upgrade to 0.20.4 when used from reducer
>
> It appears like we might be stuck in an infinite loop here:
> 
> "IPC Server handler 9 on 60020" daemon prio=10
> tid=0x2aaeb42f7800
> nid=0x6508 runnable [0x445bb000]
>    java.lang.Thread.State: RUNNABLE
>         at
> org.apache.hadoop.hbase.regionserver.ExplicitColumnTracker.checkColumn(ExplicitColumnTracker.java:128)
>         at
> org.apache.hadoop.hbase.regionserver.ScanQueryMatcher.match(ScanQueryMatcher.java:165)
>         at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:176)







Re: HBase client hangs after upgrade to 0.20.4 when used from reducer

2010-05-17 Thread Friso van Vollenhoven
I dropped in the patched jar (from 
http://people.apache.org/~stack/hbase-0.20.4-2524.jar) into my HBase 
installation. All reducers run to completion (for multiple consecutive jobs) 
and all appears to be fixed now. Also from the thread dumps, I think it is safe 
to say that mine is the same problem as hbase-2545.

I am going back to 0.20.3 for now and will wait for 0.20.5 to become available 
and upgrade to that version.

Thanks a lot for all the help!


Best regards,
Friso



On May 14, 2010, at 8:33 PM, Stack wrote:

> The below looks same as blockage seen in logs posted to hbase-2545 by
> Kris Jirapinyo.
>
> Todd has made a fix and added a test.
>
> We'll roll a 0.20.5 hbase with this fix and a fix for hbase-2541
> (missing licenses from the head of some source files) soon as we get
> confirmation that Todd's fix works for Kris Jirapinyo's seemingly
> similar issue.
>
> Thanks,
> St.Ack
>
> On Fri, May 14, 2010 at 9:07 AM, Todd Lipcon  wrote:
>> It appears like we might be stuck in an infinite loop here:
>>
>> "IPC Server handler 9 on 60020" daemon prio=10 tid=0x2aaeb42f7800
>> nid=0x6508 runnable [0x445bb000]
>>  java.lang.Thread.State: RUNNABLE
>>   at
>> org.apache.hadoop.hbase.regionserver.ExplicitColumnTracker.checkColumn(ExplicitColumnTracker.java:128)
>>   at
>> org.apache.hadoop.hbase.regionserver.ScanQueryMatcher.match(ScanQueryMatcher.java:165)
>>   at
>> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:176)
>>
>> It's holding a lock that some other threads are blocked on. In both of your
>> pastes, there are some threads stuck here.
>>
>> JD, any thoughts? Looks like you made some changes to this code for 0.20.4.
>>
>> -Todd
>>
>> On Fri, May 14, 2010 at 6:56 AM, Friso van Vollenhoven <
>> fvanvollenho...@xebia.com> wrote:
>>
>>> Hi Todd,
>>>
>>> The row counting works fine. It is quite slow, I have to say. But I have
>>> never used row counting from the shell before, so I don't know what
>>> performance to expect from it or how it is implemented. It's just that a
>>> regular scan from software is way faster.
>>>
>>> Also, from our application we do a full table scan to populate an in-memory
>>> index of row keys, because we need to be able to quickly determine if a
>>> certain key exists or not. I triggered this scan from our UI while there
>>> were hanging reducers. This also works fine. There are close to 5 million
>>> records in the table and I checked in the web interface that the table is
>>> divided across all 4 region servers, so this process should hit them all.
>>>
>>> The earlier jstacks of the region servers were taken when the reducers
>>> (clients) were hanging, before the shutdown was requested. Some 3 to 5
>>> reduce tasks hang, not all of them, but surely more than just one.
>>>
>>> Because of your question about what is locked up (client or region server),
>>> I SSH'ed into each of the worker machines after giving HBase the shutdown
>>> signal (so the shutdown sequence started) and tried to see if the region
>>> server was still running and if so,  shutdown each individual region server
>>> manually (doing 'hbase-daemon.sh stop regionserver' on each, I'm glad there
>>> are only 4 nodes). I found that:
>>> - one of the region servers actually shut down normally (worker3)
>>> - two region servers shut down normally after the hbase-daemon.sh command
>>> (worker4 and worker1)
>>> - one region server does not shut down (worker2)
>>>
>>> I put some additional info on pastebin.
>>>
>>> Here is the jstack of worker2 (the hanging one):
>>> http://pastebin.com/5V0UZi7N
>>> There are two jstack outputs, one from before the shutdown command was
>>> given and one (starting at line 946) from after the shutdown command was
>>> given.
>>>
>>> Here are the logs of that region server: http://pastebin.com/qCXSKR2A
>>> I set the log level for org.apache.hadoop.hbase to DEBUG before doing all
>>> this, so it's more verbose (I don't know if this helps).
>>>
>>> So, it appears that it is one of the region servers that is locked up, but
>>> only for some connections while it can still serve other connections
>>> normally. From the locked up region server logs, it looks like the shutdown
>>> sequence runs completely, but the server just won't die afterwards (because
>>> of running non-daemon threads; maybe it should just do a System.exit() if
>>> all cleanup is successful). At least nothing gets corrupted, which is nice.
>>> Of course I am still trying to find out why things get locked up in the
>>> first place.
>>>
>>> I did this test twice today. During the first run it was a different region
>>> server that was hanging, so I think it has nothing to do with a problem
>>> related to that specific machine.
>>>
>>> My next step is to go through code (including HBase's, so It will take me
>>> some time...) and see what exactly happens in our scenario, because from my
>>> current knowledge the jstack outputs don't mean enough to me.
>>>
>>>
>>>
>>> Fris