Hi Averell,
I have not used aws products, but if it is similar to YARN, or if you have
visited YARN's web ui.
Then you look at the YARN ApplicationMaster log to view the JM log, and the
container log is the tm log.
Thanks, vino.
Averell 于2018年8月27日周一 下午4:09写道:
> Hi Vino,
>
> Could you please
Hi Vino,
Could you please tell where I should find the JM and TM logs? I'm running on
an AWS EMR using yarn.
Thanks and best regards,
Averell
--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Hi Averell,
This problem is caused by a heartbeat timeout between JM and TM. You can
locate it by:
1) Check the network status of the node at the time, such as whether the
connection with other systems is equally problematic;
2) Check the tm log to see if there are more specific reasons;
3) View
Thank you Vino.
I put the message in a tag, and I don't know why it was not shown in the
email thread. I paste the error message below in this email.
Anyway, it seems that was an issue with enabling checkpointing. Now I am
able to get it turned on properly, and my job is getting restored
Hi Averell,
What is the error message? Do you seem to forget to post it?
As far as I know, if you enable checkpoints, it will automatically resume
when the job fails.
Thanks, vino.
Averell 于2018年8月27日周一 下午1:21写道:
> Thank you Vino.
>
> I sometimes got the error message like the one below. It
Thank you Vino.
I sometimes got the error message like the one below. It looks like my
executors got overloaded. Here I have another question: is there any
existing solution that allows me to have the job restored automatically?
Thanks and best regards,
Averell
--
Sent from:
Hi Averell,
The checkpoint is automatically triggered periodically according to the
checkpoint interval set by the user. I believe that you should have no
doubt about this.
There are many reasons for the Job failure.
The technical definition is that the Job does not normally enter the final
Hi Vino,
Regarding this statement "/Checkpoints are taken automatically and are used
for automatic restarting job in case of a failure/", I do not quite
understand the definition of a failure, and how to simulate that while
testing my application. Possible scenarios that I can think of:
(1)
This thread is also useful in this context:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/difference-between-checkpoints-amp-savepoints-td14787.html
Hi Henry,
In addition to Vino’s answer, there are several things to keep in mind about
“checkpoints vs savepoints".
Checkpoints are designed mostly for fault tolerance of running Flink job and
automatic recovery
that is why by default Flink manages their storage itself. Though it is correct
Hi Henry,
A good answer from stackoverflow:
Apache Flink's Checkpoints and Savepoints are similar in that way they both
are mechanisms for preserving internal state of Flink's applications.
Checkpoints are taken automatically and are used for automatic restarting
job in case of a failure.
Hi All, I check the documentation of Flink release 1.6, find that I can use checkpoints to resume the program either. As I encountered some problems when using savepoints, I have the following questions: 1. Can I use checkpoints only, but not use savepoints, because it can also use to resume
12 matches
Mail list logo