Re: [DISCUSS] FLIP-118: Improve Flink’s ID system

Steven Wu Mon, 30 Mar 2020 09:26:12 -0700

+1 on allowing user defined resourceId for taskmanager

On Sun, Mar 29, 2020 at 7:24 PM Yang Wang <danrtsey...@gmail.com> wrote:


> Hi Konstantin,
>
> I think it is a good idea. Currently, our users also report a similar issue
> with
> resourceId of standalone cluster. When we start a standalone cluster now,
> the `TaskManagerRunner` always generates a uuid for the resourceId. It will
> be used to register to the jobmanager and not convenient to match with the
> real
> taskmanager, especially in container environment.
>
> I think a probably solution is we could support the user defined
> resourceId.
> We could get it from the environment. For standalone on K8s, we could set
> the "RESOURCE_ID" env to the pod name so that it is easier to match the
> taskmanager with K8s pod.
>
> Moreover, i am afraid we could not set the pod name to the resourceId. I
> think
> you could set the "deployment.meta.name". Since the pod name is generated
> by
> K8s in the pattern {deployment.meta.nane}-{rc.uuid}-{uuid}. On the
> contrary, we
> will set the resourceId to the pod name.
>
>
> Best,
> Yang
>
> Konstantin Knauf <konstan...@ververica.com> 于2020年3月29日周日 下午8:06写道：
>
> > Hi Yangze, Hi Till,
> >
> > thanks you for working on this topic. I believe it will make debugging
> > large Apache Flink deployments much more feasible.
> >
> > I was wondering whether it would make sense to allow the user to specify
> > the Resource ID in standalone setups?  For example, many users still
> > implicitly use standalone clusters on Kubernetes (the native support is
> > still experimental) and in these cases it would be interesting to also
> set
> > the PodName as the ResourceID. What do you think?
> >
> > Cheers,
> >
> > Kosntantin
> >
> > On Thu, Mar 26, 2020 at 6:49 PM Till Rohrmann <trohrm...@apache.org>
> > wrote:
> >
> > > Hi Yangze,
> > >
> > > thanks for creating this FLIP. I think it is a very good improvement
> > > helping our users and ourselves understanding better what's going on in
> > > Flink.
> > >
> > > Creating the ResourceIDs with host information/pod name is a good idea.
> > >
> > > Also deriving ExecutionGraph IDs from their superset ID is a good idea.
> > >
> > > The InstanceID is used for fencing purposes. I would not make it a
> > > composition of the ResourceID + a monotonically increasing number. The
> > > problem is that in case of a RM failure the InstanceIDs would start
> from
> > 0
> > > again and this could lead to collisions.
> > >
> > > Logging more information on how the different runtime IDs are
> correlated
> > is
> > > also a good idea.
> > >
> > > Two other ideas for simplifying the ids are the following:
> > >
> > > * The SlotRequestID was introduced because the SlotPool was a separate
> > > RpcEndpoint a while ago. With this no longer being the case I think we
> > > could remove the SlotRequestID and replace it with the AllocationID.
> > > * Instead of creating new SlotRequestIDs for multi task slots one could
> > > derive them from the SlotRequestID used for requesting the underlying
> > > AllocatedSlot.
> > >
> > > Given that the slot sharing logic will most likely be reworked with the
> > > pipelined region scheduling, we might be able to resolve these two
> points
> > > as part of the pipelined region scheduling effort.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Thu, Mar 26, 2020 at 10:51 AM Yangze Guo <karma...@gmail.com>
> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > We would like to start a discussion thread on "FLIP-118: Improve
> > > > Flink’s ID system"[1].
> > > >
> > > > This FLIP mainly discusses the following issues, target to enhance
> the
> > > > readability of IDs in log and help user to debug in case of failures:
> > > >
> > > > - Enhance the readability of the string literals of IDs. Most of them
> > > > are hashcodes, e.g. ExecutionAttemptID, which do not provide much
> > > > meaningful information and are hard to recognize and compare for
> > > > users.
> > > > - Log the ID’s lineage information to make debugging more convenient.
> > > > Currently, the log fails to always show the lineage information
> > > > between IDs. Finding out relationships between entities identified by
> > > > given IDs is a common demand, e.g., slot of which AllocationID is
> > > > assigned to satisfy slot request of with SlotRequestID. Absence of
> > > > such lineage information, it’s impossible to track the end to end
> > > > lifecycle of an Execution or a Task now, which makes debugging
> > > > difficult.
> > > >
> > > > Key changes proposed in the FLIP are as follows:
> > > >
> > > > - Add location information to distributed components
> > > > - Add topology information to graph components
> > > > - Log the ID’s lineage information
> > > > - Expose the identifier of distributing component to user
> > > >
> > > > Please find more details in the FLIP wiki document [1]. Looking
> forward
> > > to
> > > > your feedbacks.
> > > >
> > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=148643521
> > > >
> > > > Best,
> > > > Yangze Guo
> > > >
> > >
> >
> >
> > --
> >
> > Konstantin Knauf | Head of Product
> >
> > +49 160 91394525
> >
> >
> > Follow us @VervericaData Ververica <https://www.ververica.com/>
> >
> >
> > --
> >
> > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> > Conference
> >
> > Stream Processing | Event Driven | Real Time
> >
> > --
> >
> > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> >
> > --
> > Ververica GmbH
> > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> > (Tony) Cheng
> >
>

Re: [DISCUSS] FLIP-118: Improve Flink’s ID system

Reply via email to