Question on getting the last succesfuly externalized checkpoint path for crashed jobs

2021-01-11 Thread DONG, Weike
Hi community, We are currently using* Externalized Checkpoints* to prevent abrupt YARN application failures, as it saves a "_metadata" file within the checkpoint folder which is essential for the job's cold recovery. As it is designed in Flink, the completed checkpoint paths are like *hdfs:///fli

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-16 Thread DONG, Weike
wrote: > Great, thanks a lot Weike. I think the first step would be to open a JIRA > issue, get assigned and then start on fixing it and opening a PR. > > Cheers, > Till > > On Fri, Oct 16, 2020 at 10:02 AM DONG, Weike > wrote: > >> Hi all, >> >> Than

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-16 Thread DONG, Weike
plit > assignments and for the LocationPreferenceSlotSelectionStrategy to > calculate how many TMs run on the same machine). > > Do you want to fix this issue? > > Cheers, > Till > > On Thu, Oct 15, 2020 at 11:38 AM DONG, Weike > wrote: > >> Hi Till and community, >> >&g

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-15 Thread DONG, Weike
high variance, i. e. normally it completes fast but occasionally some slow results would block the thread. So an unstable DNS server might have a great impact on the performance of Flink job startup. Best, Weike On Thu, Oct 15, 2020 at 5:19 PM DONG, Weike wrote: > Hi Till and commun

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-15 Thread DONG, Weike
k at them. My suspicion >> would be that there is some operation blocking the JobMaster's main thread >> which causes the registrations from the TMs to time out. Maybe the logs >> allow me to validate/falsify this suspicion. >> >> Cheers, >> Till >> >> O

Re: TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-12 Thread DONG, Weike
://gist.github.com/kylemeow/740c470d9b5a1ab3552376193920adce TaskManager-1-1: https://gist.github.com/kylemeow/41b9a8fe91975875c40afaf58276c2fe Thanks : ) Best regards, Weike On Mon, Oct 12, 2020 at 4:14 PM DONG, Weike wrote: > Hi community, > > Recently we have noticed a strange behavior

TaskManager takes abnormally long time to register with JobManager on Kubernetes for Flink 1.11.0

2020-10-12 Thread DONG, Weike
Hi community, Recently we have noticed a strange behavior for Flink jobs on Kubernetes per-job mode: when the parallelism increases, the time it takes for the TaskManagers to register with *JobManager *becomes abnormally long (for a task with parallelism of 50, it could take 60 ~ 120 seconds or ev

Re: Flink YARN app terminated before the client receives the result

2020-03-20 Thread DONG, Weike
gt;> remember whether a request is currently ongoing or not. >> >> Cheers, >> Till >> >> On Tue, Mar 17, 2020 at 9:01 AM DONG, Weike >> wrote: >> >>> Hi Tison & Till and all, >>> >>> I have uploaded the client, taskmanager an

Re: Flink YARN app terminated before the client receives the result

2020-03-17 Thread DONG, Weike
gt;> RestServer which then is not able to serve the response to the client. I'm >>> pulling in Aljoscha and Tison who introduced this change. They might be >>> able to verify my theory and propose a solution for it. >>> >>> [1] https://issues.apa

Re: Flink YARN app terminated before the client receives the result

2020-03-12 Thread DONG, Weike
hy the task executor > is killed? If it is killed by Yarn, you might get such info in Yarn > NM/RM logs. > > Best, > Yangze Guo > > Best, > Yangze Guo > > > On Fri, Mar 13, 2020 at 12:31 PM DONG, Weike > wrote: > > > > Hi, > > > > Recently

Flink YARN app terminated before the client receives the result

2020-03-12 Thread DONG, Weike
Hi, Recently I have encountered a strange behavior of Flink on YARN, which is that when I try to cancel a Flink job running in per-job mode on YARN using commands like "cancel -m yarn-cluster -yid application_1559388106022_9412 ed7e2e0ab0a7316c1b65df6047bc6aae" the client happily found and conne

Question on the SQL "GROUPING SETS" and "CUBE" syntax usability

2020-03-09 Thread DONG, Weike
Hi, >From the Flink 1.10 official document ( https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/sql/queries.html), we could see that GROUPING SETS is only supported in Batch mode. [image: image.png] However, we also found that in https://issues.apache.org/jira/browse/FLINK-1