Keep us posted once you caught the problem in the act. This would help to
debug/understand this problem tremendously.

Cheers,
Till

On Wed, Apr 15, 2020 at 8:44 AM Zhu Zhu <reed...@gmail.com> wrote:

> Sorry I made a mistake. Even if it's the case I had guessed, you will not
> get a log "Task {} is already in state FAILED." because that task was
> already unregistered before trying to update the state to JM. Unfortunately
> currently we have no log which can be used to prove it.
> Just to confirm that the line "FOG_PREDICTION_FUNCTION (15/20) (
> 3086efd0e57612710d0ea74138c01090) switched from RUNNING to FAILED" does
> not appear in the JM log, right? This might be an issue that the message
> was lost on network, which should be a rare case. Do you encounter it often?
>
> Thanks,
> Zhu Zhu
>
> Hanson, Bruce <bruce.han...@here.com> 于2020年4月15日周三 上午9:16写道:
>
>> Hi Zhu Zhu (and Till),
>>
>>
>>
>> Thanks for your thoughts on this problem. I do not see a message like the
>> one you mention "Task {} is already in state FAILED." I have attached a
>> file with all the task manager logs that we received at the time this
>> happened. As you see, there aren’t many. We turned on debug logging for
>> “org.apache.flink” on this job this afternoon so maybe we’ll find something
>> interesting if/when the issue happens again. I do hope we can catch it in
>> the act.
>>
>>
>>
>> -Bruce
>>
>>
>>
>> --
>>
>>
>>
>>
>>
>> *From: *Zhu Zhu <reed...@gmail.com>
>> *Date: *Monday, April 13, 2020 at 9:29 PM
>> *To: *Till Rohrmann <trohrm...@apache.org>
>> *Cc: *Aljoscha Krettek <aljos...@apache.org>, user <user@flink.apache.org>,
>> Gary Yao <g...@apache.org>
>> *Subject: *Re: Flink job didn't restart when a task failed
>>
>>
>>
>> Sorry for not following this ML earlier.
>>
>>
>>
>> I think the cause might be that the final state ('FAILED') update message
>> to JM is lost. TaskExecutor will simply fail the task (which does not take
>> effect in this case since the task is already FAILED) and will not update
>> the task state again in this case.
>>
>> @Bruce would you take a look at the TM log? If the guess is right, in
>> task manager logs there will be one line "Task {} is already in state
>> FAILED."
>>
>>
>>
>> Thanks,
>>
>> Zhu Zhu
>>
>>
>>
>> Till Rohrmann <trohrm...@apache.org> 于2020年4月10日周五 上午12:51写道:
>>
>> For future reference, here is the issue to track the reconciliation logic
>> [1].
>>
>>
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-17075
>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FFLINK-17075&data=01%7C01%7C%7Cfde409363b6546036cba08d7e02c5c1a%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=6nawlMBMgJftUqvFQJgPov1k%2B03DtkprV%2FnUfCpAm9M%3D&reserved=0>
>>
>>
>>
>> Cheers,
>>
>> Till
>>
>>
>>
>> On Thu, Apr 9, 2020 at 6:47 PM Till Rohrmann <trohrm...@apache.org>
>> wrote:
>>
>> Hi Bruce,
>>
>>
>>
>> what you are describing sounds indeed quite bad. Quite hard to say
>> whether we fixed such an issue in 1.10. It is definitely worth a try to
>> upgrade, though.
>>
>>
>>
>> In order to further debug the problem, it would be really great if you
>> could provide us with the log files of the JobMaster and the TaskExecutor.
>> Ideally on debug log level if you have them.
>>
>>
>>
>> One thing which we wanted to add is sending the current task statuses as
>> part of the heartbeat from the TM to the JM. Having this information would
>> allow us to reconcile a situation like you are describing.
>>
>>
>>
>> Cheers,
>>
>> Till
>>
>>
>>
>> On Thu, Apr 9, 2020 at 1:57 PM Aljoscha Krettek <aljos...@apache.org>
>> wrote:
>>
>> Hi,
>>
>> this indeed seems very strange!
>>
>> @Gary Could you maybe have a look at this since you work/worked quite a
>> bit on the scheduler?
>>
>> Best,
>> Aljoscha
>>
>> On 09.04.20 05:46, Hanson, Bruce wrote:
>> > Hello Flink folks:
>> >
>> > We had a problem with a Flink job the other day that I haven’t seen
>> before. One task encountered a failure and switched to FAILED (see the full
>> exception below). After the failure, the task said it was notifying the Job
>> Manager:
>> >
>> > 2020-04-06 08:21:04.329 [flink-akka.actor.default-dispatcher-55283]
>> level=INFO org.apache.flink.runtime.taskexecutor.TaskExecutor -
>> Un-registering task and sending final execution state FAILED to JobManager
>> for task FOG_PREDICTION_FUNCTION 3086efd0e57612710d0ea74138c01090.
>> >
>> > But I see no evidence that the Job Manager got the message. I would
>> expect with this type of failure that the Job Manager would restart the
>> job. In this case, the job carried on, hobbled, until the it stopped
>> processing data and our user had to manually restart the job. The job also
>> started experiencing checkpoint timeouts on every checkpoint due to this
>> operator stopping.
>> >
>> > Had the job restarted when this happened, I believe everything would
>> have been ok as the job had an appropriate restart strategy in place. The
>> Task Manager that this task was running on remained healthy and was
>> actively processing other tasks.
>> >
>> > It seems like this is some kind of a bug. Is this something anyone has
>> seen before? Could it be something that has been fixed if we went to Flink
>> 1.10?
>> >
>> > We are running Flink 1.7.2. I know it’s rather old now. We run a
>> managed environment where users can run their jobs, and are in the process
>> of upgrading to 1.10.
>> >
>> > This is the full exception that started the problem:
>> >
>> > 2020-04-06 08:21:04.297 [FOG_PREDICTION_FUNCTION (15/20)] level=INFO
>> org.apache.flink.runtime.taskmanager.Task  - FOG_PREDICTION_FUNCTION
>> (15/20) (3086efd0e57612710d0ea74138c01090) switched from RUNNING to FAILED.
>> > org.apache.flink.runtime.io
>> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Forg.apache.flink.runtime.io%2F&data=01%7C01%7C%7Cfde409363b6546036cba08d7e02c5c1a%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=%2F2CNQaW%2F6vni453P6Ym0EQ%2F4oLws5dAT63AnbXUYcz8%3D&reserved=0>.network.netty.exception.LocalTransportException:
>> Connection timed out (connection to '/100.112.98.121:36256
>> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2F100.112.98.121%3A36256%2F&data=01%7C01%7C%7Cfde409363b6546036cba08d7e02c5c1a%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=DjEV%2FEiAAF3EsvNiHSy8llRAUZ6svlid%2FHZ2%2Bjy0hKc%3D&reserved=0>
>> ')
>> >         at org.apache.flink.runtime.io
>> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Forg.apache.flink.runtime.io%2F&data=01%7C01%7C%7Cfde409363b6546036cba08d7e02c5c1a%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=%2F2CNQaW%2F6vni453P6Ym0EQ%2F4oLws5dAT63AnbXUYcz8%3D&reserved=0>
>> .network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:165)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:87)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1401)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:953)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:125)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:174)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>> >         at java.lang.Thread.run(Thread.java:748)
>> > Caused by: java.io.IOException: Connection timed out
>> >         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>> >         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>> >         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>> >         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>> >         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1108)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:345)
>> >         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
>> >         ... 6 common frames omitted
>> >
>> >
>> >
>> >
>> >
>> > [cid:image001.png@01D2B473.0F7F85E0]
>> >
>> > Bruce Hanson
>> > Principal Engineer
>> > M: +1 425 681 0422
>> >
>> > HERE Technologies
>> > 701 Pike Street, Suite 2000
>> > Seattle, WA 98101 USA
>> > 47° 36' 41" N 122° 19' 57" W
>> >
>> > [cid:image002.png@01D2B473.0F7F85E0]<http://360.here.com/>
>> [cid:image003.png@01D2B473.0F7F85E0] <https://www.twitter.com/here
>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.twitter.com%2Fhere&data=01%7C01%7C%7Cfde409363b6546036cba08d7e02c5c1a%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=K68MCdhRAoKiTaiOTORRPVaLdDWcGmxKQ2%2FwgO5Doj4%3D&reserved=0>>
>>   [cid:image004.png@01D2B473.0F7F85E0] <https://www.facebook.com/here
>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.facebook.com%2Fhere&data=01%7C01%7C%7Cfde409363b6546036cba08d7e02c5c1a%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=5HbYLJ6DFeQxpOI4nnpkAoLAIHWURFmNVxxWvUjb%2FIM%3D&reserved=0>>
>>    [cid:image005.png@01D2B473.0F7F85E0] <
>> https://www.linkedin.com/company/heremaps
>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fheremaps&data=01%7C01%7C%7Cfde409363b6546036cba08d7e02c5c1a%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=KdHyXNGAx9Bm2NA%2FvNoyh9iEiHYTFgKXVcJSvDkLPoc%3D&reserved=0>>
>>    [cid:image006.png@01D2B473.0F7F85E0] <https://www.instagram.com/here/
>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.instagram.com%2Fhere%2F&data=01%7C01%7C%7Cfde409363b6546036cba08d7e02c5c1a%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=F%2FsEQ%2FmN96ACKgWzWdI7fbdaH0o4xjYVJ%2F71LTdcMoc%3D&reserved=0>
>> >
>> >
>> >
>>
>>

Reply via email to