So I have another symptom which is quite odd. When trying to take a snapshot of
the the table with no writes coming in (i stopped thrift) it continually times
out when trying to flush (i don’t believe i have the option of non flush in
.94). Every single time I get:
ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException:
org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot {
ss=backup_weaver table=weaver_events type=FLUSH } had an error. Procedure
backup_weaver {
waiting=[ip-10-227-42-142.us-west-2.compute.internal,60020,1415302661297,
ip-10-227-42-252.us-west-2.compute.internal,60020,1415304752318,
ip-10-231-21-106.us-west-2.compute.internal,60020,1415306503049,
ip-10-230-130-102.us-west-2.compute.internal,60020,1415296951057,
ip-10-231-138-119.us-west-2.compute.internal,60020,1415303920176,
ip-10-224-53-183.us-west-2.compute.internal,60020,1415311138483,
ip-10-250-1-140.us-west-2.compute.internal,60020,1415311984665,
ip-10-227-40-150.us-west-2.compute.internal,60020,1415313275623,
ip-10-231-139-198.us-west-2.compute.internal,60020,1415295324957,
ip-10-250-77-76.us-west-2.compute.internal,60020,1415297345932,
ip-10-248-42-35.us-west-2.compute.internal,60020,1415312717768,
ip-10-227-45-74.us-west-2.compute.internal,60020,1415296135484,
ip-10-227-43-49.us-west-2.compute.internal,60020,1415303176867,
ip-10-230-130-121.us-west-2.compute.internal,60020,1415294726028,
ip-10-224-49-168.us-west-2.compute.internal,60020,1415312488614,
ip-10-227-0-82.us-west-2.compute.internal,60020,1415301974178,
ip-10-224-0-167.us-west-2.compute.internal,60020,1415309549108] done=[] }
at
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:362)
at
org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2313)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:354)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1434)
Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via
timer-java.util.Timer@239e8159:org.apache.hadoop.hbase.errorhandling.TimeoutException:
Timeout elapsed! Source:Timeout caused Foreign Exception Start:1415319201016,
End:1415319261016, diff:60000, max:60000 ms
at
org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
at
org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:285)
at
org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:352)
... 6 more
Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
elapsed! Source:Timeout caused Foreign Exception Start:1415319201016,
End:1415319261016, diff:60000, max:60000 ms
at
org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
I do not have a single write coming in so how in the world could these tables
not be flushed? I could understand a error maybe the first time or two, but how
could it not be flushed after a couple requests? Now I can’t even get the data
off the node to a new cluster. Any help would be greatly appreciated.
Thanks,
-Pere
On Nov 6, 2014, at 2:09 PM, Nick Dimiduk <[email protected]> wrote:
> One other thought: you might try tracing your requests to see where the
> slowness happens. Recent versions of PerformanceEvaluation support this
> feature and can be used directly or as an example for adding tracing to
> your application.
>
> On Thursday, November 6, 2014, Pere Kyle <[email protected]> wrote:
>
>> Bryan,
>>
>> Thanks again for the incredibly useful reply.
>>
>> I have confirmed that the callQueueLen is in fact 0, with a max value of 2
>> in the last week (in ganglia)
>>
>> hbase.hstore.compaction.max was set to 15 on the nodes, from a previous 7.
>>
>> Freezes (laggy responses) on the cluster are frequent and affect both
>> reads and writes. I noticed iowait on the nodes that spikes.
>>
>> The cluster goes between a state of working 100% to nothing
>> serving/timeouts for no discernible reason.
>>
>> Looking through the logs I have tons of responseTooSlow, this is the only
>> regular occurrence in the logs:
>> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
>> 03:54:31,640 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 39
>> on 60020): (responseTooSlow):
>> {"processingtimems":14573,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@c67b2ac),
>> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
>> 10.231.139.198:57223
>> ","starttimems":1415246057066,"queuetimems":20640,"class":"HRegionServer","responsesize":0,"method":"multi"}
>> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
>> 03:54:31,640 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 42
>> on 60020): (responseTooSlow):
>> {"processingtimems":45660,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@6c034090),
>> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
>> 10.231.21.106:41126
>> ","starttimems":1415246025979,"queuetimems":202,"class":"HRegionServer","responsesize":0,"method":"multi"}
>> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
>> 03:54:31,642 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 46
>> on 60020): (responseTooSlow):
>> {"processingtimems":14620,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@4fc3bb1f),
>> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
>> 10.230.130.102:54068
>> ","starttimems":1415246057021,"queuetimems":27565,"class":"HRegionServer","responsesize":0,"method":"multi"}
>> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
>> 03:54:31,642 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 35
>> on 60020): (responseTooSlow):
>> {"processingtimems":13431,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@3b321922),
>> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
>> 10.227.42.252:60493
>> ","starttimems":1415246058210,"queuetimems":1134,"class":"HRegionServer","responsesize":0,"method":"multi"}
>> On Nov 6, 2014, at 12:38 PM, Bryan Beaudreault <[email protected]
>> <javascript:;>> wrote:
>>
>>> blockingStoreFiles
>>
>>