This looks like network issue. Is there a way to check if cluster had network glitch when the app was running?
On Thu, May 12, 2016 at 2:19 PM, Vlad Rozov <[email protected]> wrote: > Hi Alex, > > After starting the application, set "com.datatorrent.netlet.*" to DEBUG > and see if this will provide more info on the actual cause of the failure. > > Thank you, > Vlad > > > On 5/11/16 09:47, McCullough, Alex wrote: > >> Hi Ram, >> >> Here are the details: >> >> rtsBuildRevisionrev: 2d66c73 branch: refs/tags/v3.3.0-incubating >> rtsBuildUserDataTorrent CI >> rtsBuildVersion3.3.0-RC5 from rev: 2d66c73 branch: >> refs/tags/v3.3.0-incubating by DataTorrent CI on 11.03.2016 @ 01:46:41 PST >> rtsVersion3.3.0-RC5 >> version3.3.1-dt20160309 >> >> >> Thanks, >> Alex >> >> >> >> >> On 5/11/16, 11:07 AM, "Munagala Ramanath" <[email protected]> wrote: >> >> Alex, what version of the platform are you running ? >>> >>> Ram >>> >>> On Wed, May 11, 2016 at 7:05 AM, McCullough, Alex < >>> [email protected]> wrote: >>> >>> Hey Everyone, >>>> >>>> I have an application that is “failing” after running for a number of >>>> hours, I was wondering if there is a standard way to determine the >>>> cause >>>> for failure. >>>> >>>> In STRAM events I see some final exceptions on containers at the end >>>> related to loss of socket ownership, when I click on the operator and >>>> look >>>> at the logs the last thing logged is a different error, both are listed >>>> below. >>>> >>>> In the app master logs I even see a different error. >>>> >>>> Is there a best practice to determine why an application becomes >>>> “Failed”? >>>> And any insight on the exceptions below? >>>> >>>> Thanks, >>>> Alex >>>> >>>> >>>> App Master Log Final Lines >>>> >>>> 2016-05-10 20:20:33,862 WARN org.apache.hadoop.hdfs.DFSClient: >>>> DFSOutputStream ResponseProcessor exception for block >>>> BP-88483743-10.24.28.46-1443641081815:blk_1167862818_94520797 >>>> java.io.IOException: Bad response ERROR for block >>>> BP-88483743-10.24.28.46-1443641081815:blk_1167862818_94520797 from >>>> datanode >>>> DatanodeInfoWithStorage[10.24.28.58:50010 >>>> ,DS-d0329c7e-59b4-4c6b-b321-59f8c013f113,DISK] >>>> at >>>> >>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1002) >>>> 2016-05-10 20:20:33,862 WARN org.apache.hadoop.hdfs.DFSClient: Error >>>> Recovery for block >>>> BP-88483743-10.24.28.46-1443641081815:blk_1167862818_94520797 in >>>> pipeline >>>> DatanodeInfoWithStorage[10.24.28.56:50010 >>>> ,DS-3664dd2d-8bf2-402a-badb-2016bce2c642,DISK], >>>> DatanodeInfoWithStorage[10.24.28.63:50010 >>>> ,DS-6c2824a3-a9f1-4cef-b3f2-4069e3a596e7,DISK], >>>> DatanodeInfoWithStorage[10.24.28.58:50010 >>>> ,DS-d0329c7e-59b4-4c6b-b321-59f8c013f113,DISK]: >>>> bad datanode DatanodeInfoWithStorage[10.24.28.58:50010 >>>> ,DS-d0329c7e-59b4-4c6b-b321-59f8c013f113,DISK] >>>> 2016-05-10 20:20:37,646 ERROR org.apache.hadoop.hdfs.DFSClient: Failed >>>> to >>>> close inode 96057235 >>>> java.io.EOFException: Premature EOF: no length prefix available >>>> at >>>> >>>> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2241) >>>> at >>>> >>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1264) >>>> at >>>> >>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1234) >>>> at >>>> >>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1375) >>>> at >>>> >>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1119) >>>> at >>>> >>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:622) >>>> >>>> >>>> >>>> Error displayed for several HDHT operators in STRAM Events: >>>> >>>> Stopped running due to an exception. >>>> com.datatorrent.netlet.NetletThrowable$NetletRuntimeException: >>>> java.lang.UnsupportedOperationException: Client does not own the socket >>>> any >>>> longer! >>>> at >>>> com.datatorrent.netlet.AbstractClient$1.offer(AbstractClient.java:364) >>>> at >>>> com.datatorrent.netlet.AbstractClient$1.offer(AbstractClient.java:354) >>>> at >>>> com.datatorrent.netlet.AbstractClient.send(AbstractClient.java:300) >>>> at >>>> >>>> com.datatorrent.netlet.AbstractLengthPrependerClient.write(AbstractLengthPrependerClient.java:236) >>>> at >>>> >>>> com.datatorrent.netlet.AbstractLengthPrependerClient.write(AbstractLengthPrependerClient.java:190) >>>> at >>>> >>>> com.datatorrent.stram.stream.BufferServerPublisher.put(BufferServerPublisher.java:135) >>>> at >>>> com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:51) >>>> at >>>> >>>> com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter$2.emit(AbstractTimedHdhtRecordWriter.java:92) >>>> at >>>> >>>> com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter$2.emit(AbstractTimedHdhtRecordWriter.java:89) >>>> at >>>> >>>> com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter.processTuple(AbstractTimedHdhtRecordWriter.java:78) >>>> at >>>> >>>> com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter$1.process(AbstractTimedHdhtRecordWriter.java:85) >>>> at >>>> >>>> com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter$1.process(AbstractTimedHdhtRecordWriter.java:82) >>>> at >>>> com.datatorrent.api.DefaultInputPort.put(DefaultInputPort.java:79) >>>> at >>>> >>>> com.datatorrent.stram.stream.BufferServerSubscriber$BufferReservoir.sweep(BufferServerSubscriber.java:265) >>>> at >>>> com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:252) >>>> at >>>> >>>> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1388) >>>> Caused by: java.lang.UnsupportedOperationException: Client does not own >>>> the socket any longer! >>>> ... 16 more >>>> >>>> Last lines in one of the stopped containers with above exception: >>>> 2016-05-11 08:09:43,044 WARN com.datatorrent.stram.RecoverableRpcProxy: >>>> RPC failure, attempting reconnect after 10000 ms (remaining 29498 ms) >>>> java.lang.reflect.UndeclaredThrowableException >>>> at com.sun.proxy.$Proxy18.processHeartbeat(Unknown Source) >>>> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) >>>> at >>>> >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> at java.lang.reflect.Method.invoke(Method.java:606) >>>> at >>>> >>>> com.datatorrent.stram.RecoverableRpcProxy.invoke(RecoverableRpcProxy.java:138) >>>> at com.sun.proxy.$Proxy18.processHeartbeat(Unknown Source) >>>> at >>>> >>>> com.datatorrent.stram.engine.StreamingContainer.heartbeatLoop(StreamingContainer.java:693) >>>> at >>>> >>>> com.datatorrent.stram.engine.StreamingContainer.main(StreamingContainer.java:312) >>>> Caused by: java.io.EOFException: End of File Exception between local >>>> host >>>> is: "mdcilabpdn04.kdc.capitalone.com/10.24.28.53"; destination host >>>> is: " >>>> mdcilabpdn06.kdc.capitalone.com":49859; : java.io.EOFException; For >>>> more >>>> details see: http://wiki.apache.org/hadoop/EOFException >>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) >>>> at >>>> >>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >>>> at >>>> >>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526) >>>> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) >>>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) >>>> at org.apache.hadoop.ipc.Client.call(Client.java:1476) >>>> at org.apache.hadoop.ipc.Client.call(Client.java:1403) >>>> at >>>> >>>> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:243) >>>> ... 8 more >>>> Caused by: java.io.EOFException >>>> at java.io.DataInputStream.readInt(DataInputStream.java:392) >>>> at >>>> >>>> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1075) >>>> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:970) >>>> ________________________________________________________ >>>> >>>> The information contained in this e-mail is confidential and/or >>>> proprietary to Capital One and/or its affiliates and may only be used >>>> solely in performance of work or services for Capital One. The >>>> information >>>> transmitted herewith is intended only for use by the individual or >>>> entity >>>> to which it is addressed. If the reader of this message is not the >>>> intended >>>> recipient, you are hereby notified that any review, retransmission, >>>> dissemination, distribution, copying or other use of, or taking of any >>>> action in reliance upon this information is strictly prohibited. If you >>>> have received this communication in error, please contact the sender and >>>> delete the material from your computer. >>>> >>>> ________________________________________________________ >> >> The information contained in this e-mail is confidential and/or >> proprietary to Capital One and/or its affiliates and may only be used >> solely in performance of work or services for Capital One. The information >> transmitted herewith is intended only for use by the individual or entity >> to which it is addressed. If the reader of this message is not the intended >> recipient, you are hereby notified that any review, retransmission, >> dissemination, distribution, copying or other use of, or taking of any >> action in reliance upon this information is strictly prohibited. If you >> have received this communication in error, please contact the sender and >> delete the material from your computer. >> > >
