Thank you, Kelly! On Thu, Nov 21, 2019 at 4:06 PM Kelly Smith <kell...@zillowgroup.com> wrote:
> Hi Piper, > > > > The repro is pretty simple: > > - Submit a job with parallelism set higher than YARN has resources to > support > > > > What this ends up looking like in the Flink UI is this: > > > > The Job is in a “RUNNING” state, but all of the tasks are in the > “SCHEDULED” state. The `jobmanager.numRunningJobs` metric that Flink emits > by default will increase by 1, but none of the tasks actually get scheduled > on any TM. > > > > > > What I’m looking for is a way to detect when I am in this state using > Flink metrics (ideally the count of tasks in each state for better > observability). > > > > Does that make sense? > > > > Thanks, > > Kelly > > > > *From: *Piper Piper <piperfl...@gmail.com> > *Date: *Thursday, November 21, 2019 at 12:59 PM > *To: *Kelly Smith <kell...@zillowgroup.com> > *Cc: *"user@flink.apache.org" <user@flink.apache.org> > *Subject: *Re: Metrics for Task States > > > > Hello Kelly, > > > > I thought that Flink scheduler only starts a job if all requested > containers/TMs are available and allotted to that job. > > > > How can I reproduce your issue on Flink with YARN? > > > > Thank you, > > > > Piper > > > > > > On Thu, Nov 21, 2019, 1:48 PM Kelly Smith <kell...@zillowgroup.com> wrote: > > I’ve been running Flink in production on EMR (YARN) for some time and have > found the metrics system to be quite useful, but there is one specific case > where I’m missing a signal for this scenario: > > > > - When a job has been submitted, but YARN does not have enough > resources to provide > > > > Observed: > > - Job is in RUNNING state > - All of the tasks for the job are in the (I believe) DEPLOYING state > > > > Is there a way to access these as metrics for monitoring the number of > tasks in each state for a given job (image below)? The metric I’m currently > using is the number of running jobs, but it misses this “unhealthy” > scenario. I realize that I could use application-level metrics (record > counts, etc) as a proxy for this, but I’m working on providing a streaming > platform and need all of my monitoring to be application agnostic. > > [image: cid:image001.png@01D5A059.19DB3EB0] > > > > I can’t find anything on it in the documentation. > > > > Thanks, > > Kelly > >