So I have another symptom which is quite odd. When trying to take a snapshot of the the table with no writes coming in (i stopped thrift) it continually times out when trying to flush (i don’t believe i have the option of non flush in .94). Every single time I get:
ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot { ss=backup_weaver table=weaver_events type=FLUSH } had an error. Procedure backup_weaver { waiting=[ip-10-227-42-142.us-west-2.compute.internal,60020,1415302661297, ip-10-227-42-252.us-west-2.compute.internal,60020,1415304752318, ip-10-231-21-106.us-west-2.compute.internal,60020,1415306503049, ip-10-230-130-102.us-west-2.compute.internal,60020,1415296951057, ip-10-231-138-119.us-west-2.compute.internal,60020,1415303920176, ip-10-224-53-183.us-west-2.compute.internal,60020,1415311138483, ip-10-250-1-140.us-west-2.compute.internal,60020,1415311984665, ip-10-227-40-150.us-west-2.compute.internal,60020,1415313275623, ip-10-231-139-198.us-west-2.compute.internal,60020,1415295324957, ip-10-250-77-76.us-west-2.compute.internal,60020,1415297345932, ip-10-248-42-35.us-west-2.compute.internal,60020,1415312717768, ip-10-227-45-74.us-west-2.compute.internal,60020,1415296135484, ip-10-227-43-49.us-west-2.compute.internal,60020,1415303176867, ip-10-230-130-121.us-west-2.compute.internal,60020,1415294726028, ip-10-224-49-168.us-west-2.compute.internal,60020,1415312488614, ip-10-227-0-82.us-west-2.compute.internal,60020,1415301974178, ip-10-224-0-167.us-west-2.compute.internal,60020,1415309549108] done=[] } at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:362) at org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2313) at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:354) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1434) Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via timer-java.util.Timer@239e8159:org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1415319201016, End:1415319261016, diff:60000, max:60000 ms at org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85) at org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:285) at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:352) ... 6 more Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1415319201016, End:1415319261016, diff:60000, max:60000 ms at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) I do not have a single write coming in so how in the world could these tables not be flushed? I could understand a error maybe the first time or two, but how could it not be flushed after a couple requests? Now I can’t even get the data off the node to a new cluster. Any help would be greatly appreciated. Thanks, -Pere On Nov 6, 2014, at 2:09 PM, Nick Dimiduk <ndimi...@gmail.com> wrote: > One other thought: you might try tracing your requests to see where the > slowness happens. Recent versions of PerformanceEvaluation support this > feature and can be used directly or as an example for adding tracing to > your application. > > On Thursday, November 6, 2014, Pere Kyle <p...@whisper.sh> wrote: > >> Bryan, >> >> Thanks again for the incredibly useful reply. >> >> I have confirmed that the callQueueLen is in fact 0, with a max value of 2 >> in the last week (in ganglia) >> >> hbase.hstore.compaction.max was set to 15 on the nodes, from a previous 7. >> >> Freezes (laggy responses) on the cluster are frequent and affect both >> reads and writes. I noticed iowait on the nodes that spikes. >> >> The cluster goes between a state of working 100% to nothing >> serving/timeouts for no discernible reason. >> >> Looking through the logs I have tons of responseTooSlow, this is the only >> regular occurrence in the logs: >> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06 >> 03:54:31,640 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 39 >> on 60020): (responseTooSlow): >> {"processingtimems":14573,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@c67b2ac), >> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":" >> 10.231.139.198:57223 >> ","starttimems":1415246057066,"queuetimems":20640,"class":"HRegionServer","responsesize":0,"method":"multi"} >> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06 >> 03:54:31,640 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 42 >> on 60020): (responseTooSlow): >> {"processingtimems":45660,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@6c034090), >> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":" >> 10.231.21.106:41126 >> ","starttimems":1415246025979,"queuetimems":202,"class":"HRegionServer","responsesize":0,"method":"multi"} >> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06 >> 03:54:31,642 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 46 >> on 60020): (responseTooSlow): >> {"processingtimems":14620,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@4fc3bb1f), >> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":" >> 10.230.130.102:54068 >> ","starttimems":1415246057021,"queuetimems":27565,"class":"HRegionServer","responsesize":0,"method":"multi"} >> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06 >> 03:54:31,642 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler 35 >> on 60020): (responseTooSlow): >> {"processingtimems":13431,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@3b321922), >> rpc version=1, client version=29, methodsFingerPrint=-540141542","client":" >> 10.227.42.252:60493 >> ","starttimems":1415246058210,"queuetimems":1134,"class":"HRegionServer","responsesize":0,"method":"multi"} >> On Nov 6, 2014, at 12:38 PM, Bryan Beaudreault <bbeaudrea...@hubspot.com >> <javascript:;>> wrote: >> >>> blockingStoreFiles >> >>