[ 
https://issues.apache.org/jira/browse/ACCUMULO-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968616#comment-13968616
 ] 

Keith Turner commented on ACCUMULO-2388:
----------------------------------------

I ran continuous ingest on a 20 node AWS cluster using 1.6.0-RC1  and 6 of my 
17 ingest clients died with this issue.  About 11 hours after stating 6 of the 
clients died around the same time because of an internal error on the same 
tserver.  The tserver had many  
{{org.apache.accumulo.tserver.HoldTimeoutException}} around that time. 

{noformat}
[cluster@ip-10-1-2-24 logs]$ grep "Commits held" * | egrep 
[0-9][0-9][0-9][.][0-9][0-9]
tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:08:24,800 
[tserver.TabletServerResourceManager] DEBUG: Commits held for 184.93 secs
{noformat}

Looking at the tserver logs it seems that walog writes taking too long were the 
culprit.  The tablet server ran for hours, but excessively long walog times 
only occurred around this short period of time.

{noformat}
[cluster@ip-10-1-2-24 logs]$ grep writeTime * | egrep -v 
"writeTime:[1-2]?[0-9]?[0-9]?[0-9]?[0-9]m"
tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:03:21,790 
[log.TabletServerLogger] DEBUG:  wrote MinC finish  60370: writeTime:39247ms 
tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:08:24,665 
[log.TabletServerLogger] DEBUG:  wrote MinC finish  60410: writeTime:58963ms 
tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:08:24,748 
[log.TabletServerLogger] DEBUG:  wrote MinC finish  60412: writeTime:58104ms 
tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:08:24,748 
[log.TabletServerLogger] DEBUG:  wrote MinC finish  60413: writeTime:57303ms 
tserver_ip-10-1-2-24.ec2.internal.debug.log:2014-04-12 11:08:24,748 
[log.TabletServerLogger] DEBUG:  wrote MinC finish  60411: writeTime:58922ms 
{noformat}

{noformat}
[cluster@ip-10-1-2-24 logs]$ grep UpSess 
tserver_ip-10-1-2-24.ec2.internal.debug.log | egrep -v "lt=[0-9]?[0-9][.][0-9]"
2014-04-12 11:04:21,370 [tserver.TabletServer] DEBUG: UpSess 10.1.2.24:41076 
16,893 in 113.972s, at=[0 0 0.00 32] ft=113.917s(pt=0.005s lt=113.564s 
ct=0.348s)
2014-04-12 11:04:21,528 [tserver.TabletServer] DEBUG: UpSess 10.1.2.18:45239 
25,402 in 114.076s, at=[0 0 0.00 32] ft=113.998s(pt=0.009s lt=113.485s 
ct=0.504s)
2014-04-12 11:04:34,246 [tserver.TabletServer] DEBUG: UpSess 10.1.2.14:34925 
25,495 in 127.707s, at=[0 0 0.00 32] ft=127.524s(pt=0.008s lt=127.188s 
ct=0.328s)
2014-04-12 11:05:12,730 [tserver.TabletServer] DEBUG: UpSess 10.1.2.27:47808 
41,559 in 110.530s, at=[0 1 0.03 32] ft=110.297s(pt=0.027s lt=109.579s 
ct=0.691s)
2014-04-12 11:05:12,731 [tserver.TabletServer] DEBUG: UpSess 10.1.2.16:36129 
42,009 in 110.808s, at=[0 0 0.00 32] ft=110.699s(pt=0.027s lt=109.802s 
ct=0.870s)
2014-04-12 11:05:13,091 [tserver.TabletServer] DEBUG: UpSess 10.1.2.19:38228 
41,187 in 114.117s, at=[0 1 0.03 32] ft=114.022s(pt=0.050s lt=113.574s 
ct=0.398s)
2014-04-12 11:05:39,362 [tserver.TabletServer] DEBUG: UpSess 10.1.2.22:37561 
41,799 in 124.262s, at=[0 0 0.00 32] ft=124.169s(pt=0.019s lt=123.414s 
ct=0.736s)
2014-04-12 11:06:41,348 [tserver.TabletServer] DEBUG: UpSess 10.1.2.17:57689 
33,547 in 130.710s, at=[0 1 0.06 32] ft=130.631s(pt=0.014s lt=130.119s 
ct=0.498s)
2014-04-12 11:06:53,563 [tserver.TabletServer] DEBUG: UpSess 10.1.2.14:35186 
33,503 in 146.692s, at=[0 1 0.03 32] ft=146.587s(pt=0.017s lt=146.300s 
ct=0.270s)
2014-04-12 11:06:58,482 [tserver.TabletServer] DEBUG: UpSess 10.1.2.18:45239 
33,394 in 156.948s, at=[0 1 0.03 32] ft=156.845s(pt=0.017s lt=156.575s 
ct=0.253s)
2014-04-12 11:07:05,899 [tserver.TabletServer] DEBUG: UpSess 10.1.2.13:40356 
41,804 in 226.959s, at=[0 1 0.03 32] ft=226.786s(pt=0.020s lt=226.168s 
ct=0.598s)
2014-04-12 11:07:06,025 [tserver.TabletServer] DEBUG: UpSess 10.1.2.16:36129 
7,962 in 113.265s, at=[0 1 0.03 32] ft=113.243s(pt=0.002s lt=113.082s ct=0.159s)
2014-04-12 11:07:06,044 [tserver.TabletServer] DEBUG: UpSess 10.1.2.19:38228 
8,314 in 111.536s, at=[0 0 0.00 32] ft=111.497s(pt=0.003s lt=111.318s ct=0.176s)
2014-04-12 11:07:06,143 [tserver.TabletServer] DEBUG: UpSess 10.1.2.27:47808 
16,831 in 110.390s, at=[0 0 0.00 32] ft=110.359s(pt=0.005s lt=110.081s 
ct=0.273s)
2014-04-12 11:07:14,716 [tserver.TabletServer] DEBUG: UpSess 10.1.2.23:54944 
8,485 in 138.654s, at=[0 0 0.00 32] ft=138.638s(pt=0.003s lt=138.596s ct=0.039s)
2014-04-12 11:08:25,465 [tserver.TabletServer] DEBUG: UpSess 10.1.2.24:41076 
41,923 in 244.077s, at=[0 1 0.03 32] ft=243.878s(pt=78.546s lt=164.889s 
ct=0.443s)
2014-04-12 11:08:25,891 [tserver.TabletServer] DEBUG: UpSess 10.1.2.20:41926 
41,910 in 235.374s, at=[0 1 0.03 32] ft=235.277s(pt=98.752s lt=135.996s 
ct=0.529s)
2014-04-12 11:08:25,891 [tserver.TabletServer] DEBUG: UpSess 10.1.2.29:41553 
41,691 in 228.196s, at=[0 1 0.03 32] ft=228.041s(pt=107.973s lt=119.670s 
ct=0.398s)
2014-04-12 11:08:26,307 [tserver.TabletServer] DEBUG: UpSess 10.1.2.15:50332 
42,013 in 228.612s, at=[0 0 0.00 32] ft=228.434s(pt=118.835s lt=109.381s 
ct=0.218s)
{noformat}

Since this was running on AWS maybe these walog hiccups are expected, not sure. 
  In any case the clients should not error.  I am going to retry w/ setting  
{{general.rpc.timout=240s}} in {{accumulo-site.xml}}

> Continuous Ingest clients die
> -----------------------------
>
>                 Key: ACCUMULO-2388
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2388
>             Project: Accumulo
>          Issue Type: Bug
>          Components: test, tserver
>         Environment: 1.6.0-SNAPSHOT (sha-1: 0da9a56)
> cdh4.5.0
>            Reporter: Mike Drob
>            Assignee: Mike Drob
>            Priority: Minor
>              Labels: 16_qa_bug
>             Fix For: 1.6.1
>
>         Attachments: tracer.debug.log, tserver1.log
>
>
> I was running continuous ingest on a 7 node cluster (5 slaves) and after 
> enabling HDFS agitation, my clients died.
> {code:title=ingest.err}
> Thread "org.apache.accumulo.test.continuous.ContinuousIngest" died 
> java.lang.reflect.InvocationTargetException
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.accumulo.start.Main$1.run(Main.java:137)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.reflect.UndeclaredThrowableException
> at $Proxy9.addMutation(Unknown Source)
> at 
> org.apache.accumulo.test.continuous.ContinuousIngest.main(ContinuousIngest.java:212)
> ... 6 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at 
> org.apache.accumulo.trace.instrument.TraceProxy$2.invoke(TraceProxy.java:43)
> ... 8 more
> Caused by: org.apache.accumulo.core.client.MutationsRejectedException: # 
> constraint violations : 0 security codes: {} # server errors 1 # exceptions 0
> at 
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)
> at 
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:258)
> at 
> org.apache.accumulo.core.client.impl.BatchWriterImpl.addMutation(BatchWriterImpl.java:43)
> ... 12 more
> {code}
> {code:title=ingest.out}
> UUID 1392844086463 f822a6a9-9592-4b3a-ab3b-1c172be20b96
> FLUSH 1392844135523 49047 6165 1000000 1000000
> FLUSH 1392844165594 30071 7787 2000000 1000000
> FLUSH 1392844195875 30281 7816 3000000 1000000
> FLUSH 1392844226787 30912 8086 4000000 1000000
> FLUSH 1392844257194 30407 7989 5000000 1000000
> FLUSH 1392844287518 30324 7743 6000000 1000000
> FLUSH 1392844325833 38315 10933 7000000 1000000
> FLUSH 1392844364708 38875 7916 8000000 1000000
> FLUSH 1392844395818 31110 8104 9000000 1000000
> 2014-02-19 13:16:57,444 [impl.TabletServerBatchWriter] ERROR: Server side 
> error on tserver1:10011: org.apache.thrift.TApplicationException: Internal 
> error processing applyUpdates
> 2014-02-19 13:16:57,446 [impl.TabletServerBatchWriter] ERROR: Failed to send 
> tablet server tserver1:10011 its batch : Error on server tserver1:10011
> org.apache.accumulo.core.client.impl.AccumuloServerException: Error on server 
> tserver1:10011
> at 
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:937)
> at 
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.access$1600(TabletServerBatchWriter.java:616)
> at 
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter$SendTask.send(TabletServerBatchWriter.java:801)
> at 
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter$SendTask.run(TabletServerBatchWriter.java:765)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at 
> org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
> at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.thrift.TApplicationException: Internal error processing 
> applyUpdates
> at 
> org.apache.thrift.TApplicationException.read(TApplicationException.java:108)
> at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
> at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_closeUpdate(TabletClientService.java:431)
> at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.closeUpdate(TabletClientService.java:417)
> at 
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:899)
> ... 11 more
> {code}
> {code:title=tserver.log}
> 2014-02-19 13:16:56,156 [util.TServerUtils$THsHaServer] WARN : Got an 
> IOException in internalRead!
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
> at sun.nio.ch.IOUtil.read(IOUtil.java:171)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
> at 
> org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:141)
> at 
> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:515)
> at 
> org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:355)
> at 
> org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:202)
> at 
> org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.select(TNonblockingServer.java:198)
> at 
> org.apache.thrift.server.TNonblockingServer$SelectAcceptThread.run(TNonblockingServer.java:154)
> {code}
> Note that this last message was not propagated to the monitor for some 
> reason, but that is likely a different issue. (I had been seeing other WARN 
> messages show up earlier.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to