This looks like network issue. Is there a way to check if cluster had
network glitch when the app was running?

On Thu, May 12, 2016 at 2:19 PM, Vlad Rozov <[email protected]> wrote:

> Hi Alex,
>
> After starting the application, set "com.datatorrent.netlet.*" to DEBUG
> and see if this will provide more info on the actual cause of the failure.
>
> Thank you,
> Vlad
>
>
> On 5/11/16 09:47, McCullough, Alex wrote:
>
>> Hi Ram,
>>
>> Here are the details:
>>
>> rtsBuildRevisionrev: 2d66c73 branch: refs/tags/v3.3.0-incubating
>> rtsBuildUserDataTorrent CI
>> rtsBuildVersion3.3.0-RC5 from rev: 2d66c73 branch:
>> refs/tags/v3.3.0-incubating by DataTorrent CI on 11.03.2016 @ 01:46:41 PST
>> rtsVersion3.3.0-RC5
>> version3.3.1-dt20160309
>>
>>
>> Thanks,
>> Alex
>>
>>
>>
>>
>> On 5/11/16, 11:07 AM, "Munagala Ramanath" <[email protected]> wrote:
>>
>> Alex, what version of the platform are you running ?
>>>
>>> Ram
>>>
>>> On Wed, May 11, 2016 at 7:05 AM, McCullough, Alex <
>>> [email protected]> wrote:
>>>
>>> Hey Everyone,
>>>>
>>>> I have an application that is “failing” after running for a number of
>>>> hours,  I was wondering if there is a standard way to determine the
>>>> cause
>>>> for failure.
>>>>
>>>> In STRAM events I see some final exceptions on containers at the end
>>>> related to loss of socket ownership, when I click on the operator and
>>>> look
>>>> at the logs the last thing logged is a different error, both are listed
>>>> below.
>>>>
>>>> In the app master logs I even see a different error.
>>>>
>>>> Is there a best practice to determine why an application becomes
>>>> “Failed”?
>>>> And any insight on the exceptions below?
>>>>
>>>> Thanks,
>>>> Alex
>>>>
>>>>
>>>> App Master Log Final Lines
>>>>
>>>> 2016-05-10 20:20:33,862 WARN org.apache.hadoop.hdfs.DFSClient:
>>>> DFSOutputStream ResponseProcessor exception for block
>>>> BP-88483743-10.24.28.46-1443641081815:blk_1167862818_94520797
>>>> java.io.IOException: Bad response ERROR for block
>>>> BP-88483743-10.24.28.46-1443641081815:blk_1167862818_94520797 from
>>>> datanode
>>>> DatanodeInfoWithStorage[10.24.28.58:50010
>>>> ,DS-d0329c7e-59b4-4c6b-b321-59f8c013f113,DISK]
>>>> at
>>>>
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1002)
>>>> 2016-05-10 20:20:33,862 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>> Recovery for block
>>>> BP-88483743-10.24.28.46-1443641081815:blk_1167862818_94520797 in
>>>> pipeline
>>>> DatanodeInfoWithStorage[10.24.28.56:50010
>>>> ,DS-3664dd2d-8bf2-402a-badb-2016bce2c642,DISK],
>>>> DatanodeInfoWithStorage[10.24.28.63:50010
>>>> ,DS-6c2824a3-a9f1-4cef-b3f2-4069e3a596e7,DISK],
>>>> DatanodeInfoWithStorage[10.24.28.58:50010
>>>> ,DS-d0329c7e-59b4-4c6b-b321-59f8c013f113,DISK]:
>>>> bad datanode DatanodeInfoWithStorage[10.24.28.58:50010
>>>> ,DS-d0329c7e-59b4-4c6b-b321-59f8c013f113,DISK]
>>>> 2016-05-10 20:20:37,646 ERROR org.apache.hadoop.hdfs.DFSClient: Failed
>>>> to
>>>> close inode 96057235
>>>> java.io.EOFException: Premature EOF: no length prefix available
>>>> at
>>>>
>>>> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2241)
>>>> at
>>>>
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1264)
>>>> at
>>>>
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1234)
>>>> at
>>>>
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1375)
>>>> at
>>>>
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1119)
>>>> at
>>>>
>>>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:622)
>>>>
>>>>
>>>>
>>>> Error displayed for several HDHT operators in STRAM Events:
>>>>
>>>> Stopped running due to an exception.
>>>> com.datatorrent.netlet.NetletThrowable$NetletRuntimeException:
>>>> java.lang.UnsupportedOperationException: Client does not own the socket
>>>> any
>>>> longer!
>>>>          at
>>>> com.datatorrent.netlet.AbstractClient$1.offer(AbstractClient.java:364)
>>>>          at
>>>> com.datatorrent.netlet.AbstractClient$1.offer(AbstractClient.java:354)
>>>>          at
>>>> com.datatorrent.netlet.AbstractClient.send(AbstractClient.java:300)
>>>>          at
>>>>
>>>> com.datatorrent.netlet.AbstractLengthPrependerClient.write(AbstractLengthPrependerClient.java:236)
>>>>          at
>>>>
>>>> com.datatorrent.netlet.AbstractLengthPrependerClient.write(AbstractLengthPrependerClient.java:190)
>>>>          at
>>>>
>>>> com.datatorrent.stram.stream.BufferServerPublisher.put(BufferServerPublisher.java:135)
>>>>          at
>>>> com.datatorrent.api.DefaultOutputPort.emit(DefaultOutputPort.java:51)
>>>>          at
>>>>
>>>> com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter$2.emit(AbstractTimedHdhtRecordWriter.java:92)
>>>>          at
>>>>
>>>> com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter$2.emit(AbstractTimedHdhtRecordWriter.java:89)
>>>>          at
>>>>
>>>> com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter.processTuple(AbstractTimedHdhtRecordWriter.java:78)
>>>>          at
>>>>
>>>> com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter$1.process(AbstractTimedHdhtRecordWriter.java:85)
>>>>          at
>>>>
>>>> com.capitalone.vault8.citadel.operators.AbstractTimedHdhtRecordWriter$1.process(AbstractTimedHdhtRecordWriter.java:82)
>>>>          at
>>>> com.datatorrent.api.DefaultInputPort.put(DefaultInputPort.java:79)
>>>>          at
>>>>
>>>> com.datatorrent.stram.stream.BufferServerSubscriber$BufferReservoir.sweep(BufferServerSubscriber.java:265)
>>>>          at
>>>> com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:252)
>>>>          at
>>>>
>>>> com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1388)
>>>> Caused by: java.lang.UnsupportedOperationException: Client does not own
>>>> the socket any longer!
>>>>          ... 16 more
>>>>
>>>> Last lines in one of the stopped containers with above exception:
>>>> 2016-05-11 08:09:43,044 WARN com.datatorrent.stram.RecoverableRpcProxy:
>>>> RPC failure, attempting reconnect after 10000 ms (remaining 29498 ms)
>>>> java.lang.reflect.UndeclaredThrowableException
>>>> at com.sun.proxy.$Proxy18.processHeartbeat(Unknown Source)
>>>> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>>>> at
>>>>
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>> at
>>>>
>>>> com.datatorrent.stram.RecoverableRpcProxy.invoke(RecoverableRpcProxy.java:138)
>>>> at com.sun.proxy.$Proxy18.processHeartbeat(Unknown Source)
>>>> at
>>>>
>>>> com.datatorrent.stram.engine.StreamingContainer.heartbeatLoop(StreamingContainer.java:693)
>>>> at
>>>>
>>>> com.datatorrent.stram.engine.StreamingContainer.main(StreamingContainer.java:312)
>>>> Caused by: java.io.EOFException: End of File Exception between local
>>>> host
>>>> is: "mdcilabpdn04.kdc.capitalone.com/10.24.28.53"; destination host
>>>> is: "
>>>> mdcilabpdn06.kdc.capitalone.com":49859; : java.io.EOFException; For
>>>> more
>>>> details see: http://wiki.apache.org/hadoop/EOFException
>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>> at
>>>>
>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>> at
>>>>
>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
>>>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1476)
>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1403)
>>>> at
>>>>
>>>> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:243)
>>>> ... 8 more
>>>> Caused by: java.io.EOFException
>>>> at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>> at
>>>>
>>>> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1075)
>>>> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:970)
>>>> ________________________________________________________
>>>>
>>>> The information contained in this e-mail is confidential and/or
>>>> proprietary to Capital One and/or its affiliates and may only be used
>>>> solely in performance of work or services for Capital One. The
>>>> information
>>>> transmitted herewith is intended only for use by the individual or
>>>> entity
>>>> to which it is addressed. If the reader of this message is not the
>>>> intended
>>>> recipient, you are hereby notified that any review, retransmission,
>>>> dissemination, distribution, copying or other use of, or taking of any
>>>> action in reliance upon this information is strictly prohibited. If you
>>>> have received this communication in error, please contact the sender and
>>>> delete the material from your computer.
>>>>
>>>> ________________________________________________________
>>
>> The information contained in this e-mail is confidential and/or
>> proprietary to Capital One and/or its affiliates and may only be used
>> solely in performance of work or services for Capital One. The information
>> transmitted herewith is intended only for use by the individual or entity
>> to which it is addressed. If the reader of this message is not the intended
>> recipient, you are hereby notified that any review, retransmission,
>> dissemination, distribution, copying or other use of, or taking of any
>> action in reliance upon this information is strictly prohibited. If you
>> have received this communication in error, please contact the sender and
>> delete the material from your computer.
>>
>
>

Reply via email to