Re: [DISCUSS] FLIP-108: Add GPU support in Flink

Yangze Guo Wed, 01 Apr 2020 02:15:39 -0700

Thank you all for your participation! I'll start voting for this FLIP.

Best,
Yangze Guo


On Wed, Apr 1, 2020 at 4:55 PM Stephan Ewen <[email protected]> wrote:
>
> Sounds good!
>
> On Tue, Mar 31, 2020 at 4:32 AM Yangze Guo <[email protected]> wrote:
>
> > Hi everyone,
> > I've updated the FLIP accordingly. The key change is replacing two
> > resource allocation interfaces to config options.
> >
> > If there are no further comments, I would like to start a voting
> > thread by tomorrow.
> >
> > Best,
> > Yangze Guo
> >
> > On Mon, Mar 30, 2020 at 9:15 PM Till Rohrmann <[email protected]>
> > wrote:
> > >
> > > If there is no need for the ExternalResourceDriver on the RM side, then
> > it
> > > is always a good idea to keep it simple and don't introduce it. One can
> > > always change things once one realizes that there is a need for it.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Mon, Mar 30, 2020 at 12:00 PM Yangze Guo <[email protected]> wrote:
> > >
> > > > Hi @Till, @Xintong
> > > >
> > > > I think even without the credential concerns, replacing the interfaces
> > > > with configuration options is a good idea from my side.
> > > > - Currently, I don't see any external resource does not compatible
> > > > with this mechanism
> > > > - It reduces the burden of users to implement a plugin themselves.
> > > > WDYT?
> > > >
> > > > Best,
> > > > Yangze Guo
> > > >
> > > > On Mon, Mar 30, 2020 at 5:44 PM Xintong Song <[email protected]>
> > > > wrote:
> > > > >
> > > > > I also agree that the pluggable ExternalResourceDriver should be
> > loaded
> > > > by
> > > > > the cluster class loader. Despite the plugin might be implemented by
> > > > users,
> > > > > external resources (as part of task executor resources) should be
> > cluster
> > > > > configurations, unlike job-level user codes such as UDFs, because the
> > > > task
> > > > > executors belongs to the cluster rather than jobs.
> > > > >
> > > > >
> > > > > IIUC, the concern Stephan raised is about the potential credential
> > > > problem
> > > > > when executing user codes on RM with cluster class loader. The
> > concern
> > > > > makes sense to me, and I think what Yangze suggested should be a good
> > > > > approach trying to prevent such credential problems. The only
> > purpose we
> > > > > tried to execute user codes (i.e.
> > getKubernetes/YarnExternalResource) on
> > > > RM
> > > > > was that, we need to set these key-value pairs to pod/container
> > requests.
> > > > > Replacing the interfaces getKubernetes/YarnExternalResource with
> > > > > configuration options
> > > > > 'external-resource.{resourceName}.yarn/kubernetes.key/amount',
> > > > > we can still fulfill that purpose, without the credential risks.
> > > > >
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Mar 30, 2020 at 5:17 PM Till Rohrmann <[email protected]>
> > > > wrote:
> > > > >
> > > > > > At the moment the RM does not have a user code class loader and I
> > agree
> > > > > > with Stephan that it should stay like this. This, however, does not
> > > > mean
> > > > > > that we cannot support pluggable components in the RM. As long as
> > the
> > > > > > plugins are on the system's class path, it should be fine for the
> > RM to
> > > > > > load them. For example, we could add external resources via Flink's
> > > > plugin
> > > > > > mechanism or something similar.
> > > > > >
> > > > > > A very simple implementation of such an ExternalResourceDriver
> > could
> > > > be a
> > > > > > class which simply returns what is written in the flink-conf.yaml
> > > > under a
> > > > > > given key.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Mon, Mar 30, 2020 at 5:39 AM Yangze Guo <[email protected]>
> > wrote:
> > > > > >
> > > > > > > Hi, Stephan,
> > > > > > >
> > > > > > > I see your concern and I totally agree with you.
> > > > > > >
> > > > > > > The interface on RM side is now `Map<String key, String/Long
> > value>
> > > > > > > getYarn/KubernetesExternalResource()`. The only valid
> > information RM
> > > > > > > get from it is the configuration key of that external resource in
> > > > > > > Yarn/K8s. The "String/Long value" would be the same as the
> > > > > > > external-resource.{resourceName}.amount.
> > > > > > > So, I think it makes sense to replace these two interfaces with
> > two
> > > > > > > configs, i.e.
> > external-resource.{resourceName}.yarn/kubernetes.key.
> > > > We
> > > > > > > may lose some extensibility, but AFAIK it could work with common
> > > > > > > external resources like GPU, FPGA. WDYT?
> > > > > > >
> > > > > > > Best,
> > > > > > > Yangze Guo
> > > > > > >
> > > > > > > On Fri, Mar 27, 2020 at 7:59 PM Stephan Ewen <[email protected]>
> > > > wrote:
> > > > > > > >
> > > > > > > > Maybe one final comment: It is probably not an issue, but let's
> > > > try and
> > > > > > > > keep user code (via user code classloader) out of the
> > > > ResourceManager,
> > > > > > if
> > > > > > > > possible.
> > > > > > > >
> > > > > > > > As background:
> > > > > > > >
> > > > > > > > There were thoughts in the past to support setups where the RM
> > > > must run
> > > > > > > > with "superuser" credentials, but we cannot run JM/TM with
> > these
> > > > > > > > credentials, as the user code might access them otherwise.
> > > > > > > > This is actually possible today, you can run the RM in a
> > different
> > > > JVM
> > > > > > or
> > > > > > > > in a different container, and give it more credentials than
> > JMs /
> > > > TMs.
> > > > > > > But
> > > > > > > > for this to be feasible, we cannot allow any user-defined code
> > to
> > > > be in
> > > > > > > the
> > > > > > > > JVM, because that instantaneously breaks the isolation of
> > > > credentials.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Mar 27, 2020 at 4:01 AM Yangze Guo <[email protected]
> > >
> > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the feedback, @Till and @Xintong.
> > > > > > > > >
> > > > > > > > > Regarding separating the interface, I'm also +1 with it.
> > > > > > > > >
> > > > > > > > > Regarding the resource allocation interface, true, it's
> > > > dangerous to
> > > > > > > > > give much access to user codes. Changing the return type to
> > > > > > Map<String
> > > > > > > > > key, String/Long value> makes sense to me. AFAIK, it is
> > > > compatible
> > > > > > > > > with all the first-party supported resources for
> > > > Yarn/Kubernetes. It
> > > > > > > > > could also free us from the potential dependency issue as
> > well.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Yangze Guo
> > > > > > > > >
> > > > > > > > > On Fri, Mar 27, 2020 at 10:42 AM Xintong Song <
> > > > [email protected]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Thanks for updating the FLIP, Yangze.
> > > > > > > > > >
> > > > > > > > > > I agree with Till that we probably want to separate the
> > > > K8s/Yarn
> > > > > > > > > decorator
> > > > > > > > > > calls. Users can still configure one driver class, and we
> > can
> > > > use
> > > > > > > > > > `instanceof` to check whether the driver implemented
> > K8s/Yarn
> > > > > > > specific
> > > > > > > > > > interfaces.
> > > > > > > > > >
> > > > > > > > > > Moreover, I'm not sure about exposing entire
> > > > `ContainerRequest` /
> > > > > > > `Pod`
> > > > > > > > > > (`AbstractKubernetesStepDecorator` directly manipulates on
> > > > `Pod`)
> > > > > > to
> > > > > > > user
> > > > > > > > > > codes. It gives more access to user codes than needed for
> > > > defining
> > > > > > > > > external
> > > > > > > > > > resource, which might cause problems. Instead, I would
> > suggest
> > > > to
> > > > > > > have
> > > > > > > > > > interface like `Map<String key, String value>
> > > > > > > > > > getYarn/KubernetesExternalResource()` and assemble them
> > into
> > > > > > > > > > `ContainerRequest` / `Pod` in
> > Yarn/KubernetesResourceManager.
> > > > > > > > > >
> > > > > > > > > > Thank you~
> > > > > > > > > >
> > > > > > > > > > Xintong Song
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Mar 27, 2020 at 1:10 AM Till Rohrmann <
> > > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi everyone,
> > > > > > > > > > >
> > > > > > > > > > > I'm a bit late to the party. I think the current proposal
> > > > looks
> > > > > > > good.
> > > > > > > > > > >
> > > > > > > > > > > Concerning the ExternalResourceDriver interface defined
> > in
> > > > the
> > > > > > FLIP
> > > > > > > > > [1], I
> > > > > > > > > > > would suggest to not include the decorator calls for
> > > > Kubernetes
> > > > > > and
> > > > > > > > > Yarn in
> > > > > > > > > > > the base interface. Instead I would suggest to segregate
> > the
> > > > > > > deployment
> > > > > > > > > > > specific decorator calls into separate interfaces. That
> > way
> > > > an
> > > > > > > > > > > ExternalResourceDriver does not have to support all
> > > > deployments
> > > > > > > from
> > > > > > > > > the
> > > > > > > > > > > very beginning. Moreover, some resources might not be
> > > > supported
> > > > > > by
> > > > > > > a
> > > > > > > > > > > specific deployment target and the natural way to express
> > > > this
> > > > > > > would
> > > > > > > > > be to
> > > > > > > > > > > not implement the respective deployment specific
> > interface.
> > > > > > > > > > >
> > > > > > > > > > > Moreover, having void
> > > > > > > > > > > addExternalResourceToRequest(AMRMClient.ContainerRequest
> > > > > > > > > containerRequest)
> > > > > > > > > > > in the ExternalResourceDriver interface would require
> > Hadoop
> > > > on
> > > > > > > Flink's
> > > > > > > > > > > classpath whenever the external resource driver is being
> > > > used.
> > > > > > > > > > >
> > > > > > > > > > > [1]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-108%3A+Add+GPU+support+in+Flink
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Till
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 26, 2020 at 12:45 PM Stephan Ewen <
> > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Nice, thanks a lot!
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Mar 26, 2020 at 10:21 AM Yangze Guo <
> > > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the suggestion, @Stephan, @Becket and
> > > > @Xintong.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I've updated the FLIP accordingly. I do not add a
> > > > > > > > > > > > > ResourceInfoProvider. Instead, I introduce the
> > > > > > > > > ExternalResourceDriver,
> > > > > > > > > > > > > which takes the responsibility of all relevant
> > > > operations on
> > > > > > > both
> > > > > > > > > RM
> > > > > > > > > > > > > and TM sides.
> > > > > > > > > > > > > After a rethink about decoupling the management of
> > > > external
> > > > > > > > > resources
> > > > > > > > > > > > > from TaskExecutor, I think we could do the same
> > thing on
> > > > the
> > > > > > > > > > > > > ResourceManager side. We do not need to add a
> > specific
> > > > > > > allocation
> > > > > > > > > > > > > logic to the ResourceManager each time we add a
> > specific
> > > > > > > external
> > > > > > > > > > > > > resource.
> > > > > > > > > > > > > - For Yarn, we need the ExternalResourceDriver to
> > edit
> > > > the
> > > > > > > > > > > > > containerRequest.
> > > > > > > > > > > > > - For Kubenetes, ExternalResourceDriver could
> > provide a
> > > > > > > decorator
> > > > > > > > > for
> > > > > > > > > > > > > the TM pod.
> > > > > > > > > > > > >
> > > > > > > > > > > > > In this way, just like MetricReporter, we allow
> > users to
> > > > > > define
> > > > > > > > > their
> > > > > > > > > > > > > custom ExternalResourceDriver. It is more extensible
> > and
> > > > fits
> > > > > > > the
> > > > > > > > > > > > > separation of concerns. For more details, please
> > take a
> > > > look
> > > > > > at
> > > > > > > > > [1].
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-108%3A+Add+GPU+support+in+Flink
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > > Yangze Guo
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Mar 25, 2020 at 7:32 PM Stephan Ewen <
> > > > > > [email protected]
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This sounds good to go ahead from my side.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I like the approach that Becket suggested - in that
> > > > case
> > > > > > the
> > > > > > > core
> > > > > > > > > > > > > > abstraction that everyone would need to understand
> > > > would be
> > > > > > > > > "external
> > > > > > > > > > > > > > resource allocation" and the
> > "ResourceInfoProvider",
> > > > and
> > > > > > the
> > > > > > > GPU
> > > > > > > > > > > > specific
> > > > > > > > > > > > > > code would be a specific implementation only known
> > to
> > > > that
> > > > > > > > > component
> > > > > > > > > > > > that
> > > > > > > > > > > > > > allocates the external resource. That fits the
> > > > separation
> > > > > > of
> > > > > > > > > concerns
> > > > > > > > > > > > > well.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I also understand that it should not be
> > > > over-engineered in
> > > > > > > the
> > > > > > > > > first
> > > > > > > > > > > > > > version, so some simplification makes sense, and
> > then
> > > > > > > gradually
> > > > > > > > > > > expand
> > > > > > > > > > > > > from
> > > > > > > > > > > > > > there.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So +1 to go ahead with what was suggested above
> > > > (Xintong /
> > > > > > > > > Becket)
> > > > > > > > > > > from
> > > > > > > > > > > > > my
> > > > > > > > > > > > > > side.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, Mar 23, 2020 at 6:55 AM Xintong Song <
> > > > > > > > > [email protected]>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the comments, Stephan & Becket.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > @Stephan
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I see your concern, and I completely agree with
> > you
> > > > that
> > > > > > we
> > > > > > > > > should
> > > > > > > > > > > > > first
> > > > > > > > > > > > > > > think about the "library" / "plugin" /
> > "extension"
> > > > style
> > > > > > if
> > > > > > > > > > > possible.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If GPUs are sliced and assigned during
> > scheduling,
> > > > there
> > > > > > > may be
> > > > > > > > > > > > reason,
> > > > > > > > > > > > > > > > although it looks that it would belong to the
> > slot
> > > > > > then.
> > > > > > > Is
> > > > > > > > > that
> > > > > > > > > > > > > what we
> > > > > > > > > > > > > > > > are doing here?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > In the current proposal, we do not have the GPUs
> > > > sliced
> > > > > > and
> > > > > > > > > > > assigned
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > slots, because it could be problematic without
> > > > dynamic
> > > > > > slot
> > > > > > > > > > > > allocation.
> > > > > > > > > > > > > > > E.g., the number of GPUs might not be evenly
> > > > divisible by
> > > > > > > the
> > > > > > > > > > > number
> > > > > > > > > > > > of
> > > > > > > > > > > > > > > slots.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I think it makes sense to eventually have the
> > GPUs
> > > > > > > assigned to
> > > > > > > > > > > slots.
> > > > > > > > > > > > > Even
> > > > > > > > > > > > > > > then, we might still need a TM level GPUManager
> > (or
> > > > > > > > > > > ResourceProvider
> > > > > > > > > > > > > like
> > > > > > > > > > > > > > > Becket suggested). For memory, in each slot we
> > can
> > > > simply
> > > > > > > > > request
> > > > > > > > > > > the
> > > > > > > > > > > > > > > amount of memory, leaving it to JVM / OS to
> > decide
> > > > which
> > > > > > > memory
> > > > > > > > > > > > > (address)
> > > > > > > > > > > > > > > should be assigned. For GPU, and potentially
> > other
> > > > > > > resources
> > > > > > > > > like
> > > > > > > > > > > > > FPGA, we
> > > > > > > > > > > > > > > need to explicitly specify which GPU (index)
> > should
> > > > be
> > > > > > > used.
> > > > > > > > > > > > > Therefore, we
> > > > > > > > > > > > > > > need some component at the TM level to coordinate
> > > > which
> > > > > > > slot
> > > > > > > > > uses
> > > > > > > > > > > > which
> > > > > > > > > > > > > > > GPU.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > IMO, unless we say Flink will not support
> > slot-level
> > > > GPU
> > > > > > > > > slicing at
> > > > > > > > > > > > > least
> > > > > > > > > > > > > > > in the foreseeable future, I don't see a good
> > way to
> > > > > > avoid
> > > > > > > > > touching
> > > > > > > > > > > > > the TM
> > > > > > > > > > > > > > > core. To that end, I think Becket's suggestion
> > > > points to
> > > > > > a
> > > > > > > good
> > > > > > > > > > > > > direction,
> > > > > > > > > > > > > > > that supports more features (GPU, FPGA, etc.)
> > with
> > > > less
> > > > > > > > > coupling to
> > > > > > > > > > > > > the TM
> > > > > > > > > > > > > > > core (only needs to understand the general
> > > > interfaces).
> > > > > > The
> > > > > > > > > > > detailed
> > > > > > > > > > > > > > > implementation for specific resource types can
> > even
> > > > be
> > > > > > > > > encapsulated
> > > > > > > > > > > > as
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > > library.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > @Becket
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for sharing your thought on the final
> > state.
> > > > > > > Despite the
> > > > > > > > > > > > > details how
> > > > > > > > > > > > > > > the interfaces should look like, I think this is
> > a
> > > > really
> > > > > > > good
> > > > > > > > > > > > > abstraction
> > > > > > > > > > > > > > > for supporting general resource types.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I'd like to further clarify that, the following
> > three
> > > > > > > things
> > > > > > > > > are
> > > > > > > > > > > all
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > the "Flink core" needs to understand.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >    - The *amount* of resource, for scheduling.
> > > > Actually,
> > > > > > we
> > > > > > > > > already
> > > > > > > > > > > > > have
> > > > > > > > > > > > > > >    the Resource class in ResourceProfile and
> > > > ResourceSpec
> > > > > > > for
> > > > > > > > > > > > extended
> > > > > > > > > > > > > > >    resource. It's just not really used.
> > > > > > > > > > > > > > >    - The *info*, that Flink provides to the
> > > > operators /
> > > > > > > user
> > > > > > > > > codes.
> > > > > > > > > > > > > > >    - The *provider*, which generates the info
> > based
> > > > on
> > > > > > the
> > > > > > > > > amount.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The "core" does not need to understand the
> > specific
> > > > > > > > > implementation
> > > > > > > > > > > > > details
> > > > > > > > > > > > > > > of the above three. They can even be implemented
> > in a
> > > > > > > 3rd-party
> > > > > > > > > > > > > library.
> > > > > > > > > > > > > > > Similar to how we allow users to define their
> > custom
> > > > > > > > > > > MetricReporter.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Mar 23, 2020 at 8:45 AM Becket Qin <
> > > > > > > > > [email protected]>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the comment, Stephan.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   - If everything becomes a "core feature", it
> > will
> > > > > > make
> > > > > > > the
> > > > > > > > > > > > project
> > > > > > > > > > > > > hard
> > > > > > > > > > > > > > > > > to develop in the future. Thinking "library"
> > /
> > > > > > > "plugin" /
> > > > > > > > > > > > > "extension"
> > > > > > > > > > > > > > > > style
> > > > > > > > > > > > > > > > > where possible helps.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Completely agree. It is much more important to
> > > > design a
> > > > > > > > > mechanism
> > > > > > > > > > > > > than
> > > > > > > > > > > > > > > > focusing on a specific case. Here is what I am
> > > > thinking
> > > > > > > to
> > > > > > > > > fully
> > > > > > > > > > > > > support
> > > > > > > > > > > > > > > > custom resource management:
> > > > > > > > > > > > > > > > 1. On the JM / RM side, use ResourceProfile and
> > > > > > > ResourceSpec
> > > > > > > > > to
> > > > > > > > > > > > > define
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > resource and the amount required. They will be
> > > > used to
> > > > > > > find
> > > > > > > > > > > > suitable
> > > > > > > > > > > > > TMs
> > > > > > > > > > > > > > > > slots to run the tasks. At this point, the
> > > > resources
> > > > > > are
> > > > > > > only
> > > > > > > > > > > > > measured by
> > > > > > > > > > > > > > > > amount, i.e. they do not have individual ID.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 2. On the TM side, have something like
> > > > > > > > > *"ResourceInfoProvider"*
> > > > > > > > > > > to
> > > > > > > > > > > > > > > identify
> > > > > > > > > > > > > > > > and provides the detail information of the
> > > > individual
> > > > > > > > > resource,
> > > > > > > > > > > > e.g.
> > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > ID.. It is important because the operator may
> > have
> > > > to
> > > > > > > > > explicitly
> > > > > > > > > > > > > interact
> > > > > > > > > > > > > > > > with the physical resource it uses. The
> > > > > > > ResourceInfoProvider
> > > > > > > > > > > might
> > > > > > > > > > > > > look
> > > > > > > > > > > > > > > > like something below.
> > > > > > > > > > > > > > > > interface ResourceInfoProvider<INFO> {
> > > > > > > > > > > > > > > >     Map<AbstractID, INFO>
> > > > > > retrieveResourceInfo(OperatorId
> > > > > > > > > opId,
> > > > > > > > > > > > > > > > ResourceProfile resourceProfile);
> > > > > > > > > > > > > > > > }
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > - There could be several
> > "*ResourceInfoProvider*"
> > > > > > > configured
> > > > > > > > > on
> > > > > > > > > > > the
> > > > > > > > > > > > > TM to
> > > > > > > > > > > > > > > > retrieve the information for different
> > resources.
> > > > > > > > > > > > > > > > - The TM will be responsible to assign those
> > > > individual
> > > > > > > > > resources
> > > > > > > > > > > > to
> > > > > > > > > > > > > each
> > > > > > > > > > > > > > > > operator according to their requested amount.
> > > > > > > > > > > > > > > > - The operators will be able to get the
> > > > ResourceInfo
> > > > > > from
> > > > > > > > > their
> > > > > > > > > > > > > > > > RuntimeContext.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > If we agree this is a reasonable final state.
> > We
> > > > can
> > > > > > > adapt
> > > > > > > > > the
> > > > > > > > > > > > > current
> > > > > > > > > > > > > > > FLIP
> > > > > > > > > > > > > > > > to it. In fact it does not sound a big change
> > to
> > > > me.
> > > > > > All
> > > > > > > the
> > > > > > > > > > > > proposed
> > > > > > > > > > > > > > > > configuration can be as is, it is just that
> > Flink
> > > > > > itself
> > > > > > > > > won't
> > > > > > > > > > > care
> > > > > > > > > > > > > about
> > > > > > > > > > > > > > > > them, instead a GPUInfoProviver implementing
> > the
> > > > > > > > > > > > ResourceInfoProvider
> > > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > > use them.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Jiangjie (Becket) Qin
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Mon, Mar 23, 2020 at 1:47 AM Stephan Ewen <
> > > > > > > > > [email protected]>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi all!
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > The main point I wanted to throw into the
> > > > discussion
> > > > > > > is the
> > > > > > > > > > > > > following:
> > > > > > > > > > > > > > > > >   - With more and more use cases, more and
> > more
> > > > tools
> > > > > > > go
> > > > > > > > > into
> > > > > > > > > > > > Flink
> > > > > > > > > > > > > > > > >   - If everything becomes a "core feature",
> > it
> > > > will
> > > > > > > make
> > > > > > > > > the
> > > > > > > > > > > > > project
> > > > > > > > > > > > > > > hard
> > > > > > > > > > > > > > > > > to develop in the future. Thinking "library"
> > /
> > > > > > > "plugin" /
> > > > > > > > > > > > > "extension"
> > > > > > > > > > > > > > > > style
> > > > > > > > > > > > > > > > > where possible helps.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >   - A good thought experiment is always: How
> > many
> > > > > > > future
> > > > > > > > > > > > developers
> > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > interact with this code (and possibly
> > understand
> > > > it
> > > > > > > > > partially),
> > > > > > > > > > > > > even if
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > features they touch have nothing to do with
> > GPU
> > > > > > > support. If
> > > > > > > > > > > many
> > > > > > > > > > > > > > > > > contributors to unrelated features will have
> > to
> > > > touch
> > > > > > > it
> > > > > > > > > and
> > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > it,
> > > > > > > > > > > > > > > > > then let's think if there is a different
> > > > solution.
> > > > > > > Maybe
> > > > > > > > > there
> > > > > > > > > > > is
> > > > > > > > > > > > > not,
> > > > > > > > > > > > > > > > but
> > > > > > > > > > > > > > > > > then we should be sure why.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >   - That led me to raising this issue: If
> > the GPU
> > > > > > > manager
> > > > > > > > > > > > becomes a
> > > > > > > > > > > > > > > core
> > > > > > > > > > > > > > > > > service in the TaskManager, Environment,
> > > > > > > RuntimeContext,
> > > > > > > > > etc.
> > > > > > > > > > > > then
> > > > > > > > > > > > > > > > everyone
> > > > > > > > > > > > > > > > > developing TM and streaming tasks need to
> > > > understand
> > > > > > > the
> > > > > > > > > GPU
> > > > > > > > > > > > > manager.
> > > > > > > > > > > > > > > > That
> > > > > > > > > > > > > > > > > seems oddly specific, is my impression.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Access to configuration seems not the right
> > > > reason to
> > > > > > > do
> > > > > > > > > that.
> > > > > > > > > > > We
> > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > expose the Flink configuration from the
> > > > > > RuntimeContext
> > > > > > > > > anyways.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If GPUs are sliced and assigned during
> > > > scheduling,
> > > > > > > there
> > > > > > > > > may be
> > > > > > > > > > > > > reason,
> > > > > > > > > > > > > > > > > although it looks that it would belong to the
> > > > slot
> > > > > > > then. Is
> > > > > > > > > > > that
> > > > > > > > > > > > > what
> > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > are doing here?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > Stephan
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Fri, Mar 20, 2020 at 2:58 AM Xintong Song
> > <
> > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >  Thanks for the feedback, Becket.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > IMO, eventually an operator should only see
> > > > info of
> > > > > > > GPUs
> > > > > > > > > that
> > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > dedicated
> > > > > > > > > > > > > > > > > > for it, instead of all GPUs on the
> > > > > > machine/container
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > > > current
> > > > > > > > > > > > > > > > > design.
> > > > > > > > > > > > > > > > > > It does not make sense to let the user who
> > > > writes a
> > > > > > > UDF
> > > > > > > > > to
> > > > > > > > > > > > worry
> > > > > > > > > > > > > > > about
> > > > > > > > > > > > > > > > > > coordination among multiple operators
> > running
> > > > on
> > > > > > the
> > > > > > > same
> > > > > > > > > > > > > machine.
> > > > > > > > > > > > > > > And
> > > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > > we want to limit the GPU info an operator
> > > > sees, we
> > > > > > > > > should not
> > > > > > > > > > > > > let the
> > > > > > > > > > > > > > > > > > operator to instantiate GPUManager, which
> > > > means we
> > > > > > > have
> > > > > > > > > to
> > > > > > > > > > > > expose
> > > > > > > > > > > > > > > > > something
> > > > > > > > > > > > > > > > > > through runtime context, either GPU info or
> > > > some
> > > > > > > kind of
> > > > > > > > > > > > limited
> > > > > > > > > > > > > > > access
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > the GPUManager.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Thu, Mar 19, 2020 at 5:48 PM Becket Qin
> > <
> > > > > > > > > > > > [email protected]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > It probably make sense for us to first
> > agree
> > > > on
> > > > > > the
> > > > > > > > > final
> > > > > > > > > > > > > state.
> > > > > > > > > > > > > > > More
> > > > > > > > > > > > > > > > > > > specifically, will the resource info be
> > > > exposed
> > > > > > > through
> > > > > > > > > > > > runtime
> > > > > > > > > > > > > > > > context
> > > > > > > > > > > > > > > > > > > eventually?
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > If that is the final state and we have a
> > > > seamless
> > > > > > > > > migration
> > > > > > > > > > > > > story
> > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > > FLIP to that final state, Personally I
> > think
> > > > it
> > > > > > is
> > > > > > > OK
> > > > > > > > > to
> > > > > > > > > > > > > expose the
> > > > > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > > > > info in the runtime context.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Jiangjie (Becket) Qin
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Mon, Mar 16, 2020 at 11:21 AM Xintong
> > > > Song <
> > > > > > > > > > > > > > > [email protected]
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > @Yangze,
> > > > > > > > > > > > > > > > > > > > I think what Stephan means (@Stephan,
> > > > please
> > > > > > > correct
> > > > > > > > > me
> > > > > > > > > > > if
> > > > > > > > > > > > > I'm
> > > > > > > > > > > > > > > > wrong)
> > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > that, we might not need to hold and
> > > > maintain
> > > > > > the
> > > > > > > > > > > GPUManager
> > > > > > > > > > > > > as a
> > > > > > > > > > > > > > > > > > service
> > > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > TaskManagerServices or RuntimeContext.
> > An
> > > > > > > > > alternative is
> > > > > > > > > > > to
> > > > > > > > > > > > > > > create
> > > > > > > > > > > > > > > > /
> > > > > > > > > > > > > > > > > > > > retrieve the GPUManager only in the
> > > > operators
> > > > > > > that
> > > > > > > > > need
> > > > > > > > > > > it,
> > > > > > > > > > > > > e.g.,
> > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > static method `GPUManager.get()`.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > @Stephan,
> > > > > > > > > > > > > > > > > > > > I agree with you on excluding
> > GPUManager
> > > > from
> > > > > > > > > > > > > > > TaskManagerServices.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >    - For the first step, where we
> > provide
> > > > > > unified
> > > > > > > > > > > TM-level
> > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > > > > information
> > > > > > > > > > > > > > > > > > > >    to all operators, it should be fine
> > to
> > > > have
> > > > > > > > > operators
> > > > > > > > > > > > > access /
> > > > > > > > > > > > > > > > > > > >    lazy-initiate GPUManager by
> > themselves.
> > > > > > > > > > > > > > > > > > > >    - In future, we might have some more
> > > > > > > fine-grained
> > > > > > > > > GPU
> > > > > > > > > > > > > > > > management,
> > > > > > > > > > > > > > > > > > > where
> > > > > > > > > > > > > > > > > > > >    we need to maintain GPUManager as a
> > > > service
> > > > > > > and
> > > > > > > > > put
> > > > > > > > > > > GPU
> > > > > > > > > > > > > info
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > slot
> > > > > > > > > > > > > > > > > > > >    profiles. But at least for now it's
> > not
> > > > > > > necessary
> > > > > > > > > to
> > > > > > > > > > > > > introduce
> > > > > > > > > > > > > > > > > such
> > > > > > > > > > > > > > > > > > > >    complexity.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > However, I have some concerns on
> > excluding
> > > > > > > GPUManager
> > > > > > > > > > > from
> > > > > > > > > > > > > > > > > > RuntimeContext
> > > > > > > > > > > > > > > > > > > > and let operators access it directly.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >    - Configurations needed for
> > creating the
> > > > > > > > > GPUManager is
> > > > > > > > > > > > not
> > > > > > > > > > > > > > > > always
> > > > > > > > > > > > > > > > > > > >    available for operators.
> > > > > > > > > > > > > > > > > > > >    - If later we want to have
> > fine-grained
> > > > > > > control
> > > > > > > > > over
> > > > > > > > > > > GPU
> > > > > > > > > > > > > > > (e.g.,
> > > > > > > > > > > > > > > > > > > >    operators in each slot can only see
> > GPUs
> > > > > > > reserved
> > > > > > > > > for
> > > > > > > > > > > > that
> > > > > > > > > > > > > > > > slot),
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > >    approach cannot be easily extended.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I would suggest to wrap the GPUManager
> > > > behind
> > > > > > > > > > > > RuntimeContext
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > only
> > > > > > > > > > > > > > > > > > > > expose the GPUInfo to users. For now,
> > we
> > > > can
> > > > > > > declare
> > > > > > > > > a
> > > > > > > > > > > > method
> > > > > > > > > > > > > > > > > > > > `getGPUInfo()` in RuntimeContext, with
> > a
> > > > > > default
> > > > > > > > > > > definition
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > calls
> > > > > > > > > > > > > > > > > > > > `GPUManager.get()` to get the
> > > > lazily-created
> > > > > > > > > GPUManager.
> > > > > > > > > > > If
> > > > > > > > > > > > > later
> > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > want
> > > > > > > > > > > > > > > > > > > > to create / retrieve GPUManager in a
> > > > different
> > > > > > > way,
> > > > > > > > > we
> > > > > > > > > > > can
> > > > > > > > > > > > > simply
> > > > > > > > > > > > > > > > > > change
> > > > > > > > > > > > > > > > > > > > how `getGPUInfo` is implemented,
> > without
> > > > > > needing
> > > > > > > to
> > > > > > > > > > > change
> > > > > > > > > > > > > any
> > > > > > > > > > > > > > > > public
> > > > > > > > > > > > > > > > > > > > interfaces.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Sat, Mar 14, 2020 at 10:09 AM Yangze
> > > > Guo <
> > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > @Shephan
> > > > > > > > > > > > > > > > > > > > > Do you mean Minicluster? Yes, it
> > makes
> > > > sense
> > > > > > to
> > > > > > > > > share
> > > > > > > > > > > the
> > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > > Manager
> > > > > > > > > > > > > > > > > > > > > in such scenario.
> > > > > > > > > > > > > > > > > > > > > If that's what you worry about, I'm
> > +1
> > > > for
> > > > > > > holding
> > > > > > > > > > > > > > > > > > > > > GPUManager(ExternalResourceManagers)
> > in
> > > > > > > > > TaskExecutor
> > > > > > > > > > > > > instead of
> > > > > > > > > > > > > > > > > > > > > TaskManagerServices.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Regarding the
> > > > RuntimeContext/FunctionContext,
> > > > > > > it
> > > > > > > > > just
> > > > > > > > > > > > > holds the
> > > > > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > > > > > > info instead of the GPU Manager.
> > AFAIK,
> > > > it's
> > > > > > > the
> > > > > > > > > only
> > > > > > > > > > > > > place we
> > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > > > pass GPU info to the
> > > > > > > > > RichFunction/UserDefinedFunction.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > > > > Yangze Guo
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > On Sat, Mar 14, 2020 at 4:06 AM Isaac
> > > > > > Godfried
> > > > > > > <
> > > > > > > > > > > > > > > > > [email protected]
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > ---- On Fri, 13 Mar 2020 15:58:20
> > +0000
> > > > > > > > > > > > [email protected]
> > > > > > > > > > > > > > > wrote
> > > > > > > > > > > > > > > > > > ----
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > Can we somehow keep this out
> > of the
> > > > > > > > > TaskManager
> > > > > > > > > > > > > services
> > > > > > > > > > > > > > > > > > > > > > > I fear that we could not. IMO,
> > the
> > > > > > > > > GPUManager(or
> > > > > > > > > > > > > > > > > > > > > > > ExternalServicesManagers in
> > future)
> > > > is
> > > > > > > > > conceptually
> > > > > > > > > > > > > one of
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > task
> > > > > > > > > > > > > > > > > > > > > > > manager services, just like
> > > > MemoryManager
> > > > > > > > > before
> > > > > > > > > > > > 1.10.
> > > > > > > > > > > > > > > > > > > > > > > - It maintains/holds the GPU
> > > > resource at
> > > > > > TM
> > > > > > > > > level
> > > > > > > > > > > and
> > > > > > > > > > > > > all
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > operators allocate the GPU
> > resources
> > > > from
> > > > > > > it.
> > > > > > > > > So,
> > > > > > > > > > > it
> > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > > > exclusive to a single
> > TaskExecutor.
> > > > > > > > > > > > > > > > > > > > > > > - We could add a collection
> > called
> > > > > > > > > > > > > ExternalResourceManagers
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > hold
> > > > > > > > > > > > > > > > > > > > > > > all managers of other external
> > > > resources
> > > > > > > in the
> > > > > > > > > > > > future.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Can you help me understand why this
> > > > needs
> > > > > > the
> > > > > > > > > > > addition
> > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > > TaskMagerServices
> > > > > > > > > > > > > > > > > > > > > > or in the RuntimeContext?
> > > > > > > > > > > > > > > > > > > > > > Are you worried about the case when
> > > > > > multiple
> > > > > > > Task
> > > > > > > > > > > > > Executors
> > > > > > > > > > > > > > > run
> > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > > > > > > > JVM? That's not common, but
> > wouldn't it
> > > > > > > actually
> > > > > > > > > be
> > > > > > > > > > > > good
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > case
> > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > share the GPU Manager, given that
> > the
> > > > GPU
> > > > > > is
> > > > > > > > > shared?
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > > > > Stephan
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > ---------------------------
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > What parts need information about
> > > > this?
> > > > > > > > > > > > > > > > > > > > > > > In this FLIP, operators need the
> > > > > > > information.
> > > > > > > > > Thus,
> > > > > > > > > > > > we
> > > > > > > > > > > > > > > expose
> > > > > > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > > > > > > > > information to the
> > > > > > > > > RuntimeContext/FunctionContext.
> > > > > > > > > > > > The
> > > > > > > > > > > > > slot
> > > > > > > > > > > > > > > > > > profile
> > > > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > > > > not aware of GPU resources as
> > GPU is
> > > > TM
> > > > > > > level
> > > > > > > > > > > > resource
> > > > > > > > > > > > > now.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > Can the GPU Manager be a "self
> > > > > > contained"
> > > > > > > > > thing
> > > > > > > > > > > > that
> > > > > > > > > > > > > > > simply
> > > > > > > > > > > > > > > > > > takes
> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > configuration, and then abstracts
> > > > > > > everything
> > > > > > > > > > > > > internally?
> > > > > > > > > > > > > > > > > > > > > > > Yes, we just pass the path/args
> > of
> > > > the
> > > > > > > discover
> > > > > > > > > > > > script
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > > > many
> > > > > > > > > > > > > > > > > > > > > > > GPUs per TM to it. It takes the
> > > > > > > responsibility
> > > > > > > > > to
> > > > > > > > > > > get
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > > > > > > > > information and expose them to
> > the
> > > > > > > > > > > > > > > > > RuntimeContext/FunctionContext
> > > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > > > > Operators. Meanwhile, we'd
> > better not
> > > > > > allow
> > > > > > > > > > > operators
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > directly
> > > > > > > > > > > > > > > > > > > > > > > access GPUManager, it should get
> > what
> > > > > > they
> > > > > > > want
> > > > > > > > > > > from
> > > > > > > > > > > > > > > Context.
> > > > > > > > > > > > > > > > > We
> > > > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > > > > > then decouple the
> > > > > > interface/implementation
> > > > > > > of
> > > > > > > > > > > > > GPUManager
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > Public
> > > > > > > > > > > > > > > > > > > > > > > API.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > > > > > > Yangze Guo
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 13, 2020 at 7:26 PM
> > > > Stephan
> > > > > > > Ewen <
> > > > > > > > > > > > > > > > [email protected]
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > It sounds fine to initially
> > start
> > > > with
> > > > > > > GPU
> > > > > > > > > > > specific
> > > > > > > > > > > > > > > support
> > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > think
> > > > > > > > > > > > > > > > > > > > > > > about
> > > > > > > > > > > > > > > > > > > > > > > > generalizing this once we
> > better
> > > > > > > understand
> > > > > > > > > the
> > > > > > > > > > > > > space.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > About the implementation
> > suggested
> > > > in
> > > > > > > > > FLIP-108:
> > > > > > > > > > > > > > > > > > > > > > > > - Can we somehow keep this out
> > of
> > > > the
> > > > > > > > > TaskManager
> > > > > > > > > > > > > > > services?
> > > > > > > > > > > > > > > > > > > > Anything
> > > > > > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > > > > > > > have to pull through all
> > layers of
> > > > the
> > > > > > TM
> > > > > > > > > makes
> > > > > > > > > > > the
> > > > > > > > > > > > > TM
> > > > > > > > > > > > > > > > > > components
> > > > > > > > > > > > > > > > > > > > yet
> > > > > > > > > > > > > > > > > > > > > > > more
> > > > > > > > > > > > > > > > > > > > > > > > complex and harder to maintain.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > - What parts need information
> > about
> > > > > > this?
> > > > > > > > > > > > > > > > > > > > > > > > -> do the slot profiles need
> > > > > > information
> > > > > > > > > about
> > > > > > > > > > > the
> > > > > > > > > > > > > GPU?
> > > > > > > > > > > > > > > > > > > > > > > > -> Can the GPU Manager be a
> > "self
> > > > > > > contained"
> > > > > > > > > > > thing
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > simply
> > > > > > > > > > > > > > > > > > > > takes
> > > > > > > > > > > > > > > > > > > > > > > > the configuration, and then
> > > > abstracts
> > > > > > > > > everything
> > > > > > > > > > > > > > > > internally?
> > > > > > > > > > > > > > > > > > > > > Operators
> > > > > > > > > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > > > > > > > > access it via
> > "GPUManager.get()"
> > > > or so?
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 4, 2020 at 4:19 AM
> > > > Yangze
> > > > > > > Guo <
> > > > > > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > Thanks for all the feedbacks.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > @Becket
> > > > > > > > > > > > > > > > > > > > > > > > > Regarding the WebUI and
> > GPUInfo,
> > > > > > you're
> > > > > > > > > right,
> > > > > > > > > > > > > I'll add
> > > > > > > > > > > > > > > > > them
> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > Public API section.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > @Stephan @Becket
> > > > > > > > > > > > > > > > > > > > > > > > > Regarding the general
> > extended
> > > > > > resource
> > > > > > > > > > > > mechanism,
> > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > second
> > > > > > > > > > > > > > > > > > > > > Xintong's
> > > > > > > > > > > > > > > > > > > > > > > > > suggestion.
> > > > > > > > > > > > > > > > > > > > > > > > > - It's better to leverage
> > > > > > > ResourceProfile
> > > > > > > > > and
> > > > > > > > > > > > > > > > ResourceSpec
> > > > > > > > > > > > > > > > > > > after
> > > > > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > > > > > > > > supporting fine-grained GPU
> > > > > > > scheduling. As
> > > > > > > > > a
> > > > > > > > > > > > first
> > > > > > > > > > > > > step
> > > > > > > > > > > > > > > > > > > > proposal, I
> > > > > > > > > > > > > > > > > > > > > > > > > prefer to not include it in
> > the
> > > > scope
> > > > > > > of
> > > > > > > > > this
> > > > > > > > > > > > FLIP.
> > > > > > > > > > > > > > > > > > > > > > > > > - Regarding the "Extended
> > > > Resource
> > > > > > > > > Manager",
> > > > > > > > > > > if I
> > > > > > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > > > > > > > > > > correctly, it just a code
> > > > refactoring
> > > > > > > atm,
> > > > > > > > > we
> > > > > > > > > > > > could
> > > > > > > > > > > > > > > > extract
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > open/close/allocateExtendResources of
> > > > > > > > > > > GPUManager
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > > > > interface. If
> > > > > > > > > > > > > > > > > > > > > > > > > that is the case, +1 to do it
> > > > during
> > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > @Xingbo
> > > > > > > > > > > > > > > > > > > > > > > > > As Xintong said, we looked
> > into
> > > > how
> > > > > > > Spark
> > > > > > > > > > > > supports
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > general
> > > > > > > > > > > > > > > > > > > > > "Custom
> > > > > > > > > > > > > > > > > > > > > > > > > Resource Scheduling" before
> > and
> > > > > > > decided to
> > > > > > > > > > > > > introduce a
> > > > > > > > > > > > > > > > > common
> > > > > > > > > > > > > > > > > > > > > resource
> > > > > > > > > > > > > > > > > > > > > > > > > configuration
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > >
> > > > schema(taskmanager.resource.{resourceName}.amount/discovery-script)
> > > > > > > > > > > > > > > > > > > > > > > > > to make it more extensible. I
> > > > think
> > > > > > the
> > > > > > > > > > > > "resource"
> > > > > > > > > > > > > is a
> > > > > > > > > > > > > > > > > > proper
> > > > > > > > > > > > > > > > > > > > > level
> > > > > > > > > > > > > > > > > > > > > > > > > to contain all the configs of
> > > > > > extended
> > > > > > > > > > > resources.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > > > > > > > > Yangze Guo
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 4, 2020 at 10:48
> > AM
> > > > > > Xingbo
> > > > > > > > > Huang <
> > > > > > > > > > > > > > > > > > > [email protected]
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > Thanks a lot for the FLIP,
> > > > Yangze.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > There is no doubt that GPU
> > > > resource
> > > > > > > > > > > management
> > > > > > > > > > > > > > > support
> > > > > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > > > > > > > greatly
> > > > > > > > > > > > > > > > > > > > > > > > > > facilitate the development
> > of
> > > > > > > AI-related
> > > > > > > > > > > > > applications
> > > > > > > > > > > > > > > > by
> > > > > > > > > > > > > > > > > > > > PyFlink
> > > > > > > > > > > > > > > > > > > > > > > users.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > I have only one comment
> > about
> > > > this
> > > > > > > wiki:
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > Regarding the names of
> > several
> > > > GPU
> > > > > > > > > > > > > configurations, I
> > > > > > > > > > > > > > > > > think
> > > > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > > > > better
> > > > > > > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > > > > > delete the resource field
> > > > makes it
> > > > > > > > > consistent
> > > > > > > > > > > > > with
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > names
> > > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > > other
> > > > > > > > > > > > > > > > > > > > > > > > > > resource-related
> > > > configurations in
> > > > > > > > > > > > > TaskManagerOption.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > e.g.
> > > > > > > > > > > > > taskmanager.resource.gpu.discovery-script.path
> > > > > > > > > > > > > > > ->
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > taskmanager.gpu.discovery-script.path
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > Xingbo
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song <
> > > > > > [email protected]>
> > > > > > > > > > > > > 于2020年3月4日周三
> > > > > > > > > > > > > > > > > > 上午10:39写道：
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > @Stephan, @Becket,
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > Actually, Yangze, Yang
> > and I
> > > > also
> > > > > > > had
> > > > > > > > > an
> > > > > > > > > > > > > offline
> > > > > > > > > > > > > > > > > > discussion
> > > > > > > > > > > > > > > > > > > > > about
> > > > > > > > > > > > > > > > > > > > > > > > > making
> > > > > > > > > > > > > > > > > > > > > > > > > > > the "GPU Support" as some
> > > > general
> > > > > > > > > "Extended
> > > > > > > > > > > > > > > Resource
> > > > > > > > > > > > > > > > > > > > Support".
> > > > > > > > > > > > > > > > > > > > > We
> > > > > > > > > > > > > > > > > > > > > > > > > believe
> > > > > > > > > > > > > > > > > > > > > > > > > > > supporting extended
> > > > resources in
> > > > > > a
> > > > > > > > > general
> > > > > > > > > > > > > > > mechanism
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > > definitely
> > > > > > > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > > > > > > good
> > > > > > > > > > > > > > > > > > > > > > > > > > > and extensible way. The
> > > > reason we
> > > > > > > > > propose
> > > > > > > > > > > > this
> > > > > > > > > > > > > FLIP
> > > > > > > > > > > > > > > > > > > narrowing
> > > > > > > > > > > > > > > > > > > > > its
> > > > > > > > > > > > > > > > > > > > > > > scope
> > > > > > > > > > > > > > > > > > > > > > > > > > > down to GPU alone, is
> > mainly
> > > > for
> > > > > > > the
> > > > > > > > > > > concern
> > > > > > > > > > > > on
> > > > > > > > > > > > > > > extra
> > > > > > > > > > > > > > > > > > > efforts
> > > > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > > > > review
> > > > > > > > > > > > > > > > > > > > > > > > > > > capacity needed for a
> > general
> > > > > > > > > mechanism.
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > To come up with a well
> > > > design on
> > > > > > a
> > > > > > > > > general
> > > > > > > > > > > > > extended
> > > > > > > > > > > > > > > > > > > resource
> > > > > > > > > > > > > > > > > > > > > > > management
> > > > > > > > > > > > > > > > > > > > > > > > > > > mechanism, we would need
> > to
> > > > > > > investigate
> > > > > > > > > > > more
> > > > > > > > > > > > > on how
> > > > > > > > > > > > > > > > > > people
> > > > > > > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > > > > > > > > > different
> > > > > > > > > > > > > > > > > > > > > > > > > > > kind of resources in
> > > > practice.
> > > > > > For
> > > > > > > > > GPU, we
> > > > > > > > > > > > > learnt
> > > > > > > > > > > > > > > > such
> > > > > > > > > > > > > > > > > > > > > knowledge
> > > > > > > > > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > > experts, Becket and his
> > team
> > > > > > > members.
> > > > > > > > > But
> > > > > > > > > > > for
> > > > > > > > > > > > > FPGA,
> > > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > > > other
> > > > > > > > > > > > > > > > > > > > > > > potential
> > > > > > > > > > > > > > > > > > > > > > > > > > > extended resources, we
> > don't
> > > > have
> > > > > > > such
> > > > > > > > > > > > > convenient
> > > > > > > > > > > > > > > > > > > information
> > > > > > > > > > > > > > > > > > > > > > > sources,
> > > > > > > > > > > > > > > > > > > > > > > > > > > making the investigation
> > > > requires
> > > > > > > more
> > > > > > > > > > > > efforts,
> > > > > > > > > > > > > > > > which I
> > > > > > > > > > > > > > > > > > > tend
> > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > > think
> > > > > > > > > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > > > > > > > > not necessary atm.
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > On the other hand, we
> > also
> > > > looked
> > > > > > > into
> > > > > > > > > how
> > > > > > > > > > > > > Spark
> > > > > > > > > > > > > > > > > > supports a
> > > > > > > > > > > > > > > > > > > > > general
> > > > > > > > > > > > > > > > > > > > > > > > > "Custom
> > > > > > > > > > > > > > > > > > > > > > > > > > > Resource Scheduling".
> > > > Assuming we
> > > > > > > want
> > > > > > > > > to
> > > > > > > > > > > > have
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > similar
> > > > > > > > > > > > > > > > > > > > > general
> > > > > > > > > > > > > > > > > > > > > > > > > extended
> > > > > > > > > > > > > > > > > > > > > > > > > > > resource mechanism in the
> > > > future,
> > > > > > > we
> > > > > > > > > > > believe
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > current
> > > > > > > > > > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > > > > > > > > > > support
> > > > > > > > > > > > > > > > > > > > > > > > > > > design can be easily
> > > > extended, in
> > > > > > > an
> > > > > > > > > > > > > incremental
> > > > > > > > > > > > > > > way
> > > > > > > > > > > > > > > > > > > without
> > > > > > > > > > > > > > > > > > > > > too
> > > > > > > > > > > > > > > > > > > > > > > many
> > > > > > > > > > > > > > > > > > > > > > > > > > > reworks.
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > - The most important
> > part is
> > > > > > > probably
> > > > > > > > > user
> > > > > > > > > > > > > > > > interfaces.
> > > > > > > > > > > > > > > > > > > Spark
> > > > > > > > > > > > > > > > > > > > > > > offers
> > > > > > > > > > > > > > > > > > > > > > > > > > > configuration options to
> > > > define
> > > > > > the
> > > > > > > > > amount,
> > > > > > > > > > > > > > > discovery
> > > > > > > > > > > > > > > > > > > script
> > > > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > > > > vendor
> > > > > > > > > > > > > > > > > > > > > > > > > > > (on
> > > > > > > > > > > > > > > > > > > > > > > > > > > k8s) in a per resource
> > type
> > > > bias
> > > > > > > [1],
> > > > > > > > > which
> > > > > > > > > > > > is
> > > > > > > > > > > > > very
> > > > > > > > > > > > > > > > > > similar
> > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > > what
> > > > > > > > > > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > > > > > > > > > > proposed in this FLIP. I
> > > > think
> > > > > > > it's not
> > > > > > > > > > > > > necessary
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > expose
> > > > > > > > > > > > > > > > > > > > > > > config
> > > > > > > > > > > > > > > > > > > > > > > > > > > options
> > > > > > > > > > > > > > > > > > > > > > > > > > > in the general way atm,
> > > > since we
> > > > > > > do not
> > > > > > > > > > > have
> > > > > > > > > > > > > > > supports
> > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > other
> > > > > > > > > > > > > > > > > > > > > > > > > resource
> > > > > > > > > > > > > > > > > > > > > > > > > > > types now. If later we
> > > > decided to
> > > > > > > have
> > > > > > > > > per
> > > > > > > > > > > > > resource
> > > > > > > > > > > > > > > > > type
> > > > > > > > > > > > > > > > > > > > config
> > > > > > > > > > > > > > > > > > > > > > > > > > > options, we
> > > > > > > > > > > > > > > > > > > > > > > > > > > can have backwards
> > > > compatibility
> > > > > > > on the
> > > > > > > > > > > > current
> > > > > > > > > > > > > > > > > proposed
> > > > > > > > > > > > > > > > > > > > > options
> > > > > > > > > > > > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > > > > > > > > > > simple key mapping.
> > > > > > > > > > > > > > > > > > > > > > > > > > > - For the GPU Manager, if
> > > > later
> > > > > > > needed
> > > > > > > > > we
> > > > > > > > > > > can
> > > > > > > > > > > > > > > change
> > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > > > > > > "Extended
> > > > > > > > > > > > > > > > > > > > > > > > > > > Resource Manager" (or
> > > > whatever it
> > > > > > > is
> > > > > > > > > > > called).
> > > > > > > > > > > > > That
> > > > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > > > > pure
> > > > > > > > > > > > > > > > > > > > > > > > > > > component-internal
> > > > refactoring.
> > > > > > > > > > > > > > > > > > > > > > > > > > > - For ResourceProfile and
> > > > > > > ResourceSpec,
> > > > > > > > > > > there
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > > > > > > > fields for
> > > > > > > > > > > > > > > > > > > > > > > > > > > general extended
> > resource.
> > > > We can
> > > > > > > of
> > > > > > > > > course
> > > > > > > > > > > > > > > leverage
> > > > > > > > > > > > > > > > > them
> > > > > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > > > > > > supporting
> > > > > > > > > > > > > > > > > > > > > > > > > > > fine grained GPU
> > scheduling.
> > > > That
> > > > > > > is
> > > > > > > > > also
> > > > > > > > > > > not
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > scope
> > > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > > > > > > > > first
> > > > > > > > > > > > > > > > > > > > > > > > > > > step proposal, and would
> > > > require
> > > > > > > > > FLIP-56 to
> > > > > > > > > > > > be
> > > > > > > > > > > > > > > > finished
> > > > > > > > > > > > > > > > > > > > first.
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > To summary up, I agree
> > with
> > > > > > Becket
> > > > > > > that
> > > > > > > > > > > have
> > > > > > > > > > > > a
> > > > > > > > > > > > > > > > separate
> > > > > > > > > > > > > > > > > > > FLIP
> > > > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > > general extended resource
> > > > > > > mechanism,
> > > > > > > > > and
> > > > > > > > > > > keep
> > > > > > > > > > > > > it in
> > > > > > > > > > > > > > > > > mind
> > > > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > > > > discussing
> > > > > > > > > > > > > > > > > > > > > > > > > > > and implementing the
> > current
> > > > one.
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > https://spark.apache.org/docs/3.0.0-preview/configuration.html#custom-resource-scheduling-and-configuration-overview
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 4, 2020 at
> > 9:18
> > > > AM
> > > > > > > Becket
> > > > > > > > > Qin <
> > > > > > > > > > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > That's a good point,
> > > > Stephan.
> > > > > > It
> > > > > > > > > makes
> > > > > > > > > > > > total
> > > > > > > > > > > > > > > sense
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > generalize
> > > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > > > resource management to
> > > > support
> > > > > > > custom
> > > > > > > > > > > > > resources.
> > > > > > > > > > > > > > > > > Having
> > > > > > > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > > > > > > allows
> > > > > > > > > > > > > > > > > > > > > > > > > users
> > > > > > > > > > > > > > > > > > > > > > > > > > > > to add new resources by
> > > > > > > themselves.
> > > > > > > > > The
> > > > > > > > > > > > > general
> > > > > > > > > > > > > > > > > > resource
> > > > > > > > > > > > > > > > > > > > > > > management
> > > > > > > > > > > > > > > > > > > > > > > > > may
> > > > > > > > > > > > > > > > > > > > > > > > > > > > involve two different
> > > > aspects:
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. The custom resource
> > type
> > > > > > > > > definition.
> > > > > > > > > > > It
> > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > supported
> > > > > > > > > > > > > > > > > > > by
> > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > extended
> > > > > > > > > > > > > > > > > > > > > > > > > > > > resources in
> > > > ResourceProfile
> > > > > > and
> > > > > > > > > > > > > ResourceSpec.
> > > > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > > > will
> > > > > > > > > > > > > > > > > > > > > likely
> > > > > > > > > > > > > > > > > > > > > > > cover
> > > > > > > > > > > > > > > > > > > > > > > > > > > > majority of the cases.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. The custom resource
> > > > > > allocation
> > > > > > > > > logic,
> > > > > > > > > > > > > i.e. how
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > assign
> > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > resources
> > > > > > > > > > > > > > > > > > > > > > > > > > > > to different tasks,
> > > > operators,
> > > > > > > and
> > > > > > > > > so on.
> > > > > > > > > > > > > This
> > > > > > > > > > > > > > > may
> > > > > > > > > > > > > > > > > > > require
> > > > > > > > > > > > > > > > > > > > > two
> > > > > > > > > > > > > > > > > > > > > > > > > levels /
> > > > > > > > > > > > > > > > > > > > > > > > > > > > steps:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > a. Subtask level - make
> > > > sure
> > > > > > the
> > > > > > > > > subtasks
> > > > > > > > > > > > > are put
> > > > > > > > > > > > > > > > > into
> > > > > > > > > > > > > > > > > > > > > > > suitable
> > > > > > > > > > > > > > > > > > > > > > > > > > > slots.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > It is done by the
> > global
> > > > RM and
> > > > > > > is
> > > > > > > > > not
> > > > > > > > > > > > > > > customizable
> > > > > > > > > > > > > > > > > > right
> > > > > > > > > > > > > > > > > > > > > now.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > b. Operator level -
> > map the
> > > > > > exact
> > > > > > > > > > > resource
> > > > > > > > > > > > > to the
> > > > > > > > > > > > > > > > > > > operators
> > > > > > > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > > > > > > TM.
> > > > > > > > > > > > > > > > > > > > > > > > > > > e.g.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > GPU 1 for operator A,
> > GPU
> > > > 2 for
> > > > > > > > > operator
> > > > > > > > > > > B.
> > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > step
> > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > > needed
> > > > > > > > > > > > > > > > > > > > > > > > > assuming
> > > > > > > > > > > > > > > > > > > > > > > > > > > > the global RM does not
> > > > > > > distinguish
> > > > > > > > > > > > individual
> > > > > > > > > > > > > > > > > resources
> > > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > > > > > > > > > > type.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > It is true for memory,
> > but
> > > > not
> > > > > > > for
> > > > > > > > > GPU.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > The GPU manager is
> > > > designed to
> > > > > > > do 2.b
> > > > > > > > > > > here.
> > > > > > > > > > > > > So it
> > > > > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > > > > > > > discover the
> > > > > > > > > > > > > > > > > > > > > > > > > > > > physical GPU
> > information
> > > > and
> > > > > > > > > bind/match
> > > > > > > > > > > > them
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > each
> > > > > > > > > > > > > > > > > > > > > operators.
> > > > > > > > > > > > > > > > > > > > > > > > > Making
> > > > > > > > > > > > > > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > > > > > > > > > > > general will fill in
> > the
> > > > > > missing
> > > > > > > > > piece to
> > > > > > > > > > > > > support
> > > > > > > > > > > > > > > > > > custom
> > > > > > > > > > > > > > > > > > > > > resource
> > > > > > > > > > > > > > > > > > > > > > > > > type
> > > > > > > > > > > > > > > > > > > > > > > > > > > > definition. But I'd
> > avoid
> > > > > > > calling it
> > > > > > > > > a
> > > > > > > > > > > > > "External
> > > > > > > > > > > > > > > > > > Resource
> > > > > > > > > > > > > > > > > > > > > > > Manager" to
> > > > > > > > > > > > > > > > > > > > > > > > > > > avoid
> > > > > > > > > > > > > > > > > > > > > > > > > > > > confusion with RM,
> > maybe
> > > > > > > something
> > > > > > > > > like
> > > > > > > > > > > > > "Operator
> > > > > > > > > > > > > > > > > > > Resource
> > > > > > > > > > > > > > > > > > > > > > > Assigner"
> > > > > > > > > > > > > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > > > > > > > > > > > > be more accurate. So
> > for
> > > > each
> > > > > > > > > resource
> > > > > > > > > > > type
> > > > > > > > > > > > > users
> > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > > > > > > > > > optional
> > > > > > > > > > > > > > > > > > > > > > > > > > > > "Operator Resource
> > > > Assigner" in
> > > > > > > the
> > > > > > > > > TM.
> > > > > > > > > > > For
> > > > > > > > > > > > > > > memory,
> > > > > > > > > > > > > > > > > > users
> > > > > > > > > > > > > > > > > > > > > don't
> > > > > > > > > > > > > > > > > > > > > > > need
> > > > > > > > > > > > > > > > > > > > > > > > > > > this,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > but for other extended
> > > > > > resources,
> > > > > > > > > users
> > > > > > > > > > > may
> > > > > > > > > > > > > need
> > > > > > > > > > > > > > > > > that.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Personally I think a
> > > > pluggable
> > > > > > > > > "Operator
> > > > > > > > > > > > > Resource
> > > > > > > > > > > > > > > > > > > Assigner"
> > > > > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > > > > > > achievable
> > > > > > > > > > > > > > > > > > > > > > > > > > > > in this FLIP. But I am
> > > > also OK
> > > > > > > with
> > > > > > > > > > > having
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > separate
> > > > > > > > > > > > > > > > > > > > > > > FLIP
> > > > > > > > > > > > > > > > > > > > > > > > > > > because
> > > > > > > > > > > > > > > > > > > > > > > > > > > > the interface between
> > the
> > > > > > > "Operator
> > > > > > > > > > > > Resource
> > > > > > > > > > > > > > > > > Assigner"
> > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > > operator
> > > > > > > > > > > > > > > > > > > > > > > > > may
> > > > > > > > > > > > > > > > > > > > > > > > > > > > take a while to settle
> > > > down if
> > > > > > we
> > > > > > > > > want to
> > > > > > > > > > > > > make it
> > > > > > > > > > > > > > > > > > > generic.
> > > > > > > > > > > > > > > > > > > > > But I
> > > > > > > > > > > > > > > > > > > > > > > > > think
> > > > > > > > > > > > > > > > > > > > > > > > > > > our
> > > > > > > > > > > > > > > > > > > > > > > > > > > > implementation should
> > take
> > > > this
> > > > > > > > > future
> > > > > > > > > > > work
> > > > > > > > > > > > > into
> > > > > > > > > > > > > > > > > > > > > consideration so
> > > > > > > > > > > > > > > > > > > > > > > > > that we
> > > > > > > > > > > > > > > > > > > > > > > > > > > > don't need to break
> > > > backwards
> > > > > > > > > > > compatibility
> > > > > > > > > > > > > once
> > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > > > > > that.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Jiangjie (Becket) Qin
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 4, 2020 at
> > > > 12:27 AM
> > > > > > > > > Stephan
> > > > > > > > > > > > Ewen
> > > > > > > > > > > > > <
> > > > > > > > > > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you for writing
> > > > this
> > > > > > > FLIP.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > I cannot really give
> > much
> > > > > > input
> > > > > > > > > into
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > mechanics
> > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > > GPU-aware
> > > > > > > > > > > > > > > > > > > > > > > > > > > > scheduling
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > and GPU allocation,
> > as I
> > > > have
> > > > > > > no
> > > > > > > > > > > > experience
> > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > that.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > One thought I had
> > when
> > > > > > reading
> > > > > > > the
> > > > > > > > > > > > > proposal is
> > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > > > makes
> > > > > > > > > > > > > > > > > > > > > > > sense to
> > > > > > > > > > > > > > > > > > > > > > > > > > > look
> > > > > > > > > > > > > > > > > > > > > > > > > > > > at
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > the "GPU Manager" as
> > an
> > > > > > > "External
> > > > > > > > > > > > Resource
> > > > > > > > > > > > > > > > > Manager",
> > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > > > > > > > > is one
> > > > > > > > > > > > > > > > > > > > > > > > > > > such
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > resource.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > The way I understand
> > the
> > > > > > > > > > > ResourceProfile
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > ResourceSpec,
> > > > > > > > > > > > > > > > > > > > > > > that is
> > > > > > > > > > > > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > is done there.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > It has the advantage
> > > > that it
> > > > > > > looks
> > > > > > > > > more
> > > > > > > > > > > > > > > > extensible.
> > > > > > > > > > > > > > > > > > > Maybe
> > > > > > > > > > > > > > > > > > > > > > > there is
> > > > > > > > > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Resource, a
> > specialized
> > > > > > NVIDIA
> > > > > > > GPU
> > > > > > > > > > > > > Resource,
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > FPGA
> > > > > > > > > > > > > > > > > > > > > > > Resource, a
> > > > > > > > > > > > > > > > > > > > > > > > > > > Alibaba
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > TPU Resource, etc.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Stephan
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 3, 2020
> > at
> > > > 7:57
> > > > > > AM
> > > > > > > > > Becket
> > > > > > > > > > > > Qin <
> > > > > > > > > > > > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the FLIP
> > > > Yangze.
> > > > > > > GPU
> > > > > > > > > > > > resource
> > > > > > > > > > > > > > > > > management
> > > > > > > > > > > > > > > > > > > > > support
> > > > > > > > > > > > > > > > > > > > > > > is a
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > must-have
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > for machine
> > learning
> > > > use
> > > > > > > cases.
> > > > > > > > > > > > Actually
> > > > > > > > > > > > > it
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > one
> > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > mostly
> > > > > > > > > > > > > > > > > > > > > > > > > > > asked
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > question from the
> > > > users who
> > > > > > > are
> > > > > > > > > > > > > interested in
> > > > > > > > > > > > > > > > > using
> > > > > > > > > > > > > > > > > > > > Flink
> > > > > > > > > > > > > > > > > > > > > > > for ML.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Some quick
> > comments /
> > > > > > > questions
> > > > > > > > > to
> > > > > > > > > > > the
> > > > > > > > > > > > > wiki.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. The WebUI /
> > REST API
> > > > > > > should
> > > > > > > > > > > probably
> > > > > > > > > > > > > also
> > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > mentioned in
> > > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > > public
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > interface section.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. Is the data
> > > > structure
> > > > > > that
> > > > > > > > > holds
> > > > > > > > > > > GPU
> > > > > > > > > > > > > info
> > > > > > > > > > > > > > > > > also a
> > > > > > > > > > > > > > > > > > > > > public
> > > > > > > > > > > > > > > > > > > > > > > API?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Jiangjie (Becket)
> > Qin
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 3,
> > 2020 at
> > > > > > 10:15
> > > > > > > AM
> > > > > > > > > > > Xintong
> > > > > > > > > > > > > Song
> > > > > > > > > > > > > > > <
> > > > > > > > > > > > > > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for
> > drafting
> > > > the
> > > > > > > FLIP
> > > > > > > > > and
> > > > > > > > > > > > > kicking
> > > > > > > > > > > > > > > off
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > discussion,
> > > > > > > > > > > > > > > > > > > > > > > > > > > Yangze.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Big +1 for this
> > > > feature.
> > > > > > > > > Supporting
> > > > > > > > > > > > > using
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > > Flink
> > > > > > > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > significant,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > especially for
> > the ML
> > > > > > > > > scenarios.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I've reviewed the
> > > > FLIP
> > > > > > wiki
> > > > > > > > > doc and
> > > > > > > > > > > > it
> > > > > > > > > > > > > > > looks
> > > > > > > > > > > > > > > > > good
> > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > me. I
> > > > > > > > > > > > > > > > > > > > > > > > > think
> > > > > > > > > > > > > > > > > > > > > > > > > > > > it's a
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > very good first
> > step
> > > > for
> > > > > > > > > Flink's
> > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > supports.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Mar 2,
> > 2020
> > > > at
> > > > > > > 12:06 PM
> > > > > > > > > > > > Yangze
> > > > > > > > > > > > > Guo
> > > > > > > > > > > > > > > <
> > > > > > > > > > > > > > > > > > > > > > > [email protected]
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > We would like
> > to
> > > > start
> > > > > > a
> > > > > > > > > > > discussion
> > > > > > > > > > > > > > > thread
> > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > > > > > "FLIP-108:
> > > > > > > > > > > > > > > > > > > > > > > Add
> > > > > > > > > > > > > > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > support in
> > > > Flink"[1].
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This FLIP
> > mainly
> > > > > > > discusses
> > > > > > > > > the
> > > > > > > > > > > > > following
> > > > > > > > > > > > > > > > > > issues:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Enable user
> > to
> > > > > > > configure
> > > > > > > > > how
> > > > > > > > > > > many
> > > > > > > > > > > > > GPUs
> > > > > > > > > > > > > > > > in a
> > > > > > > > > > > > > > > > > > > task
> > > > > > > > > > > > > > > > > > > > > > > executor
> > > > > > > > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > forward such
> > > > > > > requirements to
> > > > > > > > > the
> > > > > > > > > > > > > external
> > > > > > > > > > > > > > > > > > > resource
> > > > > > > > > > > > > > > > > > > > > > > managers
> > > > > > > > > > > > > > > > > > > > > > > > > (for
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > Kubernetes/Yarn/Mesos
> > > > > > > > > setups).
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Provide
> > > > information
> > > > > > of
> > > > > > > > > > > available
> > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > > > resources
> > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > > > > operators.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Key changes
> > > > proposed in
> > > > > > > the
> > > > > > > > > FLIP
> > > > > > > > > > > > are
> > > > > > > > > > > > > as
> > > > > > > > > > > > > > > > > > follows:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Forward GPU
> > > > resource
> > > > > > > > > > > requirements
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > Yarn/Kubernetes.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Introduce
> > > > GPUManager
> > > > > > as
> > > > > > > > > one of
> > > > > > > > > > > > the
> > > > > > > > > > > > > task
> > > > > > > > > > > > > > > > > > manager
> > > > > > > > > > > > > > > > > > > > > > > services to
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > discover
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > and expose GPU
> > > > resource
> > > > > > > > > > > information
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > context
> > > > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > > > > > > functions.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Introduce the
> > > > default
> > > > > > > > > script
> > > > > > > > > > > for
> > > > > > > > > > > > > GPU
> > > > > > > > > > > > > > > > > > discovery,
> > > > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > > > > which we
> > > > > > > > > > > > > > > > > > > > > > > > > > > > provide
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the privilege
> > mode
> > > > to
> > > > > > > help
> > > > > > > > > user
> > > > > > > > > > > to
> > > > > > > > > > > > > > > achieve
> > > > > > > > > > > > > > > > > > > > > worker-level
> > > > > > > > > > > > > > > > > > > > > > > > > isolation
> > > > > > > > > > > > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > standalone
> > mode.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Please find
> > more
> > > > > > details
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > > FLIP
> > > > > > > > > > > > > wiki
> > > > > > > > > > > > > > > > > > > document
> > > > > > > > > > > > > > > > > > > > > [1].
> > > > > > > > > > > > > > > > > > > > > > > > > Looking
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > forward
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > your feedbacks.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-108%3A+Add+GPU+support+in+Flink
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yangze Guo
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> >

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

Reply via email to