So I have another symptom which is quite odd. When trying to take a snapshot of 
the the table with no writes coming in (i stopped thrift) it continually times 
out when trying to flush (i don’t believe i have the option of non flush in 
.94). Every single time I get:

ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: 
org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot { 
ss=backup_weaver table=weaver_events type=FLUSH } had an error.  Procedure 
backup_weaver { 
waiting=[ip-10-227-42-142.us-west-2.compute.internal,60020,1415302661297, 
ip-10-227-42-252.us-west-2.compute.internal,60020,1415304752318, 
ip-10-231-21-106.us-west-2.compute.internal,60020,1415306503049, 
ip-10-230-130-102.us-west-2.compute.internal,60020,1415296951057, 
ip-10-231-138-119.us-west-2.compute.internal,60020,1415303920176, 
ip-10-224-53-183.us-west-2.compute.internal,60020,1415311138483, 
ip-10-250-1-140.us-west-2.compute.internal,60020,1415311984665, 
ip-10-227-40-150.us-west-2.compute.internal,60020,1415313275623, 
ip-10-231-139-198.us-west-2.compute.internal,60020,1415295324957, 
ip-10-250-77-76.us-west-2.compute.internal,60020,1415297345932, 
ip-10-248-42-35.us-west-2.compute.internal,60020,1415312717768, 
ip-10-227-45-74.us-west-2.compute.internal,60020,1415296135484, 
ip-10-227-43-49.us-west-2.compute.internal,60020,1415303176867, 
ip-10-230-130-121.us-west-2.compute.internal,60020,1415294726028, 
ip-10-224-49-168.us-west-2.compute.internal,60020,1415312488614, 
ip-10-227-0-82.us-west-2.compute.internal,60020,1415301974178, 
ip-10-224-0-167.us-west-2.compute.internal,60020,1415309549108] done=[] }
        at 
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:362)
        at 
org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2313)
        at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:354)
        at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1434)
Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via 
timer-java.util.Timer@239e8159:org.apache.hadoop.hbase.errorhandling.TimeoutException:
 Timeout elapsed! Source:Timeout caused Foreign Exception Start:1415319201016, 
End:1415319261016, diff:60000, max:60000 ms
        at 
org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
        at 
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:285)
        at 
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:352)
        ... 6 more
Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout 
elapsed! Source:Timeout caused Foreign Exception Start:1415319201016, 
End:1415319261016, diff:60000, max:60000 ms
        at 
org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)


I do not have a single write coming in so how in the world could these tables 
not be flushed? I could understand a error maybe the first time or two, but how 
could it not be flushed after a couple requests? Now I can’t even get the data 
off the node to a new cluster. Any help would be greatly appreciated. 

Thanks,
-Pere



On Nov 6, 2014, at 2:09 PM, Nick Dimiduk <ndimi...@gmail.com> wrote:

> One other thought: you might try tracing your requests to see where the
> slowness happens. Recent versions of PerformanceEvaluation support this
> feature and can be used directly or as an example for adding tracing to
> your application.
> 
> On Thursday, November 6, 2014, Pere Kyle <p...@whisper.sh> wrote:
> 
>> Bryan,
>> 
>> Thanks again for the incredibly useful reply.
>> 
>> I have confirmed that the callQueueLen is in fact 0, with a max value of 2
>> in the last week (in ganglia)
>> 
>> hbase.hstore.compaction.max was set to 15 on the nodes, from a previous 7.
>> 
>> Freezes (laggy responses) on the cluster are frequent and affect both
>> reads and writes. I noticed iowait on the nodes that spikes.
>> 
>> The cluster goes between a state of working 100% to nothing
>> serving/timeouts for no discernible reason.
>> 
>> Looking through the logs I have tons of responseTooSlow, this is the only
>> regular occurrence in the logs:
>> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
>> 03:54:31,640 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 39
>> on 60020): (responseTooSlow):
>> {"processingtimems":14573,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@c67b2ac),
>> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
>> 10.231.139.198:57223
>> ","starttimems":1415246057066,"queuetimems":20640,"class":"HRegionServer","responsesize":0,"method":"multi"}
>> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
>> 03:54:31,640 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 42
>> on 60020): (responseTooSlow):
>> {"processingtimems":45660,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@6c034090),
>> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
>> 10.231.21.106:41126
>> ","starttimems":1415246025979,"queuetimems":202,"class":"HRegionServer","responsesize":0,"method":"multi"}
>> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
>> 03:54:31,642 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 46
>> on 60020): (responseTooSlow):
>> {"processingtimems":14620,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@4fc3bb1f),
>> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
>> 10.230.130.102:54068
>> ","starttimems":1415246057021,"queuetimems":27565,"class":"HRegionServer","responsesize":0,"method":"multi"}
>> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
>> 03:54:31,642 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 35
>> on 60020): (responseTooSlow):
>> {"processingtimems":13431,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@3b321922),
>> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
>> 10.227.42.252:60493
>> ","starttimems":1415246058210,"queuetimems":1134,"class":"HRegionServer","responsesize":0,"method":"multi"}
>> On Nov 6, 2014, at 12:38 PM, Bryan Beaudreault <bbeaudrea...@hubspot.com
>> <javascript:;>> wrote:
>> 
>>> blockingStoreFiles
>> 
>> 

Reply via email to