Re: [DISCUSS] FLIP-108: Add GPU support in Flink

Becket Qin Sun, 22 Mar 2020 17:45:47 -0700

Thanks for the comment, Stephan.

  - If everything becomes a "core feature", it will make the project hard
> to develop in the future. Thinking "library" / "plugin" / "extension" style
> where possible helps.



Completely agree. It is much more important to design a mechanism than
focusing on a specific case. Here is what I am thinking to fully support
custom resource management:
1. On the JM / RM side, use ResourceProfile and ResourceSpec to define the
resource and the amount required. They will be used to find suitable TMs
slots to run the tasks. At this point, the resources are only measured by
amount, i.e. they do not have individual ID.

2. On the TM side, have something like *"ResourceInfoProvider"* to identify
and provides the detail information of the individual resource, e.g. GPU
ID.. It is important because the operator may have to explicitly interact
with the physical resource it uses. The ResourceInfoProvider might look
like something below.
interface ResourceInfoProvider<INFO> {
    Map<AbstractID, INFO> retrieveResourceInfo(OperatorId opId,
ResourceProfile resourceProfile);
}

- There could be several "*ResourceInfoProvider*" configured on the TM to
retrieve the information for different resources.
- The TM will be responsible to assign those individual resources to each
operator according to their requested amount.
- The operators will be able to get the ResourceInfo from their
RuntimeContext.

If we agree this is a reasonable final state. We can adapt the current FLIP
to it. In fact it does not sound a big change to me. All the proposed
configuration can be as is, it is just that Flink itself won't care about
them, instead a GPUInfoProviver implementing the ResourceInfoProvider will
use them.

Thanks,

Jiangjie (Becket) Qin

On Mon, Mar 23, 2020 at 1:47 AM Stephan Ewen <se...@apache.org> wrote:

> Hi all!
>
> The main point I wanted to throw into the discussion is the following:
>   - With more and more use cases, more and more tools go into Flink
>   - If everything becomes a "core feature", it will make the project hard
> to develop in the future. Thinking "library" / "plugin" / "extension" style
> where possible helps.
>
>   - A good thought experiment is always: How many future developers have to
> interact with this code (and possibly understand it partially), even if the
> features they touch have nothing to do with GPU support. If many
> contributors to unrelated features will have to touch it and understand it,
> then let's think if there is a different solution. Maybe there is not, but
> then we should be sure why.
>
>   - That led me to raising this issue: If the GPU manager becomes a core
> service in the TaskManager, Environment, RuntimeContext, etc. then everyone
> developing TM and streaming tasks need to understand the GPU manager. That
> seems oddly specific, is my impression.
>
> Access to configuration seems not the right reason to do that. We should
> expose the Flink configuration from the RuntimeContext anyways.
>
> If GPUs are sliced and assigned during scheduling, there may be reason,
> although it looks that it would belong to the slot then. Is that what we
> are doing here?
>
> Best,
> Stephan
>
>
> On Fri, Mar 20, 2020 at 2:58 AM Xintong Song <tonysong...@gmail.com>
> wrote:
>
> >  Thanks for the feedback, Becket.
> >
> > IMO, eventually an operator should only see info of GPUs that are
> dedicated
> > for it, instead of all GPUs on the machine/container in the current
> design.
> > It does not make sense to let the user who writes a UDF to worry about
> > coordination among multiple operators running on the same machine. And if
> > we want to limit the GPU info an operator sees, we should not let the
> > operator to instantiate GPUManager, which means we have to expose
> something
> > through runtime context, either GPU info or some kind of limited access
> to
> > the GPUManager.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Thu, Mar 19, 2020 at 5:48 PM Becket Qin <becket....@gmail.com> wrote:
> >
> > > It probably make sense for us to first agree on the final state. More
> > > specifically, will the resource info be exposed through runtime context
> > > eventually?
> > >
> > > If that is the final state and we have a seamless migration story from
> > this
> > > FLIP to that final state, Personally I think it is OK to expose the GPU
> > > info in the runtime context.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Mon, Mar 16, 2020 at 11:21 AM Xintong Song <tonysong...@gmail.com>
> > > wrote:
> > >
> > > > @Yangze,
> > > > I think what Stephan means (@Stephan, please correct me if I'm wrong)
> > is
> > > > that, we might not need to hold and maintain the GPUManager as a
> > service
> > > in
> > > > TaskManagerServices or RuntimeContext. An alternative is to create /
> > > > retrieve the GPUManager only in the operators that need it, e.g.,
> with
> > a
> > > > static method `GPUManager.get()`.
> > > >
> > > > @Stephan,
> > > > I agree with you on excluding GPUManager from TaskManagerServices.
> > > >
> > > >    - For the first step, where we provide unified TM-level GPU
> > > information
> > > >    to all operators, it should be fine to have operators access /
> > > >    lazy-initiate GPUManager by themselves.
> > > >    - In future, we might have some more fine-grained GPU management,
> > > where
> > > >    we need to maintain GPUManager as a service and put GPU info in
> slot
> > > >    profiles. But at least for now it's not necessary to introduce
> such
> > > >    complexity.
> > > >
> > > > However, I have some concerns on excluding GPUManager from
> > RuntimeContext
> > > > and let operators access it directly.
> > > >
> > > >    - Configurations needed for creating the GPUManager is not always
> > > >    available for operators.
> > > >    - If later we want to have fine-grained control over GPU (e.g.,
> > > >    operators in each slot can only see GPUs reserved for that slot),
> > the
> > > >    approach cannot be easily extended.
> > > >
> > > > I would suggest to wrap the GPUManager behind RuntimeContext and only
> > > > expose the GPUInfo to users. For now, we can declare a method
> > > > `getGPUInfo()` in RuntimeContext, with a default definition that
> calls
> > > > `GPUManager.get()` to get the lazily-created GPUManager. If later we
> > want
> > > > to create / retrieve GPUManager in a different way, we can simply
> > change
> > > > how `getGPUInfo` is implemented, without needing to change any public
> > > > interfaces.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Sat, Mar 14, 2020 at 10:09 AM Yangze Guo <karma...@gmail.com>
> > wrote:
> > > >
> > > > > @Shephan
> > > > > Do you mean Minicluster? Yes, it makes sense to share the GPU
> Manager
> > > > > in such scenario.
> > > > > If that's what you worry about, I'm +1 for holding
> > > > > GPUManager(ExternalResourceManagers) in TaskExecutor instead of
> > > > > TaskManagerServices.
> > > > >
> > > > > Regarding the RuntimeContext/FunctionContext, it just holds the GPU
> > > > > info instead of the GPU Manager. AFAIK, it's the only place we
> could
> > > > > pass GPU info to the RichFunction/UserDefinedFunction.
> > > > >
> > > > > Best,
> > > > > Yangze Guo
> > > > >
> > > > > On Sat, Mar 14, 2020 at 4:06 AM Isaac Godfried <
> is...@paddlesoft.net
> > >
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > ---- On Fri, 13 Mar 2020 15:58:20 +0000 se...@apache.org wrote
> > ----
> > > > > >
> > > > > > > > Can we somehow keep this out of the TaskManager services
> > > > > > > I fear that we could not. IMO, the GPUManager(or
> > > > > > > ExternalServicesManagers in future) is conceptually one of the
> > task
> > > > > > > manager services, just like MemoryManager before 1.10.
> > > > > > > - It maintains/holds the GPU resource at TM level and all of
> the
> > > > > > > operators allocate the GPU resources from it. So, it should be
> > > > > > > exclusive to a single TaskExecutor.
> > > > > > > - We could add a collection called ExternalResourceManagers to
> > hold
> > > > > > > all managers of other external resources in the future.
> > > > > > >
> > > > > >
> > > > > > Can you help me understand why this needs the addition in
> > > > > TaskMagerServices
> > > > > > or in the RuntimeContext?
> > > > > > Are you worried about the case when multiple Task Executors run
> in
> > > the
> > > > > same
> > > > > > JVM? That's not common, but wouldn't it actually be good in that
> > case
> > > > to
> > > > > > share the GPU Manager, given that the GPU is shared?
> > > > > >
> > > > > > Thanks,
> > > > > > Stephan
> > > > > >
> > > > > > ---------------------------
> > > > > >
> > > > > >
> > > > > > > What parts need information about this?
> > > > > > > In this FLIP, operators need the information. Thus, we expose
> GPU
> > > > > > > information to the RuntimeContext/FunctionContext. The slot
> > profile
> > > > is
> > > > > > > not aware of GPU resources as GPU is TM level resource now.
> > > > > > >
> > > > > > > > Can the GPU Manager be a "self contained" thing that simply
> > takes
> > > > the
> > > > > > > configuration, and then abstracts everything internally?
> > > > > > > Yes, we just pass the path/args of the discover script and how
> > many
> > > > > > > GPUs per TM to it. It takes the responsibility to get the GPU
> > > > > > > information and expose them to the
> RuntimeContext/FunctionContext
> > > of
> > > > > > > Operators. Meanwhile, we'd better not allow operators to
> directly
> > > > > > > access GPUManager, it should get what they want from Context.
> We
> > > > could
> > > > > > > then decouple the interface/implementation of GPUManager and
> > Public
> > > > > > > API.
> > > > > > >
> > > > > > > Best,
> > > > > > > Yangze Guo
> > > > > > >
> > > > > > > On Fri, Mar 13, 2020 at 7:26 PM Stephan Ewen <se...@apache.org
> >
> > > > wrote:
> > > > > > > >
> > > > > > > > It sounds fine to initially start with GPU specific support
> and
> > > > think
> > > > > > > about
> > > > > > > > generalizing this once we better understand the space.
> > > > > > > >
> > > > > > > > About the implementation suggested in FLIP-108:
> > > > > > > > - Can we somehow keep this out of the TaskManager services?
> > > > Anything
> > > > > we
> > > > > > > > have to pull through all layers of the TM makes the TM
> > components
> > > > yet
> > > > > > > more
> > > > > > > > complex and harder to maintain.
> > > > > > > >
> > > > > > > > - What parts need information about this?
> > > > > > > > -> do the slot profiles need information about the GPU?
> > > > > > > > -> Can the GPU Manager be a "self contained" thing that
> simply
> > > > takes
> > > > > > > > the configuration, and then abstracts everything internally?
> > > > > Operators
> > > > > > > can
> > > > > > > > access it via "GPUManager.get()" or so?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Mar 4, 2020 at 4:19 AM Yangze Guo <
> karma...@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for all the feedbacks.
> > > > > > > > >
> > > > > > > > > @Becket
> > > > > > > > > Regarding the WebUI and GPUInfo, you're right, I'll add
> them
> > to
> > > > the
> > > > > > > > > Public API section.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > @Stephan @Becket
> > > > > > > > > Regarding the general extended resource mechanism, I second
> > > > > Xintong's
> > > > > > > > > suggestion.
> > > > > > > > > - It's better to leverage ResourceProfile and ResourceSpec
> > > after
> > > > we
> > > > > > > > > supporting fine-grained GPU scheduling. As a first step
> > > > proposal, I
> > > > > > > > > prefer to not include it in the scope of this FLIP.
> > > > > > > > > - Regarding the "Extended Resource Manager", if I
> understand
> > > > > > > > > correctly, it just a code refactoring atm, we could extract
> > the
> > > > > > > > > open/close/allocateExtendResources of GPUManager to that
> > > > > interface. If
> > > > > > > > > that is the case, +1 to do it during implementation.
> > > > > > > > >
> > > > > > > > > @Xingbo
> > > > > > > > > As Xintong said, we looked into how Spark supports a
> general
> > > > > "Custom
> > > > > > > > > Resource Scheduling" before and decided to introduce a
> common
> > > > > resource
> > > > > > > > > configuration
> > > > > > > > >
> > > > schema(taskmanager.resource.{resourceName}.amount/discovery-script)
> > > > > > > > > to make it more extensible. I think the "resource" is a
> > proper
> > > > > level
> > > > > > > > > to contain all the configs of extended resources.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Yangze Guo
> > > > > > > > >
> > > > > > > > > On Wed, Mar 4, 2020 at 10:48 AM Xingbo Huang <
> > > hxbks...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Thanks a lot for the FLIP, Yangze.
> > > > > > > > > >
> > > > > > > > > > There is no doubt that GPU resource management support
> will
> > > > > greatly
> > > > > > > > > > facilitate the development of AI-related applications by
> > > > PyFlink
> > > > > > > users.
> > > > > > > > > >
> > > > > > > > > > I have only one comment about this wiki:
> > > > > > > > > >
> > > > > > > > > > Regarding the names of several GPU configurations, I
> think
> > it
> > > > is
> > > > > > > better
> > > > > > > > > to
> > > > > > > > > > delete the resource field makes it consistent with the
> > names
> > > of
> > > > > other
> > > > > > > > > > resource-related configurations in TaskManagerOption.
> > > > > > > > > >
> > > > > > > > > > e.g. taskmanager.resource.gpu.discovery-script.path ->
> > > > > > > > > > taskmanager.gpu.discovery-script.path
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > >
> > > > > > > > > > Xingbo
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Xintong Song <tonysong...@gmail.com> 于2020年3月4日周三
> > 上午10:39写道：
> > > > > > > > > >
> > > > > > > > > > > @Stephan, @Becket,
> > > > > > > > > > >
> > > > > > > > > > > Actually, Yangze, Yang and I also had an offline
> > discussion
> > > > > about
> > > > > > > > > making
> > > > > > > > > > > the "GPU Support" as some general "Extended Resource
> > > > Support".
> > > > > We
> > > > > > > > > believe
> > > > > > > > > > > supporting extended resources in a general mechanism is
> > > > > definitely
> > > > > > > a
> > > > > > > > > good
> > > > > > > > > > > and extensible way. The reason we propose this FLIP
> > > narrowing
> > > > > its
> > > > > > > scope
> > > > > > > > > > > down to GPU alone, is mainly for the concern on extra
> > > efforts
> > > > > and
> > > > > > > > > review
> > > > > > > > > > > capacity needed for a general mechanism.
> > > > > > > > > > >
> > > > > > > > > > > To come up with a well design on a general extended
> > > resource
> > > > > > > management
> > > > > > > > > > > mechanism, we would need to investigate more on how
> > people
> > > > use
> > > > > > > > > different
> > > > > > > > > > > kind of resources in practice. For GPU, we learnt such
> > > > > knowledge
> > > > > > > from
> > > > > > > > > the
> > > > > > > > > > > experts, Becket and his team members. But for FPGA, or
> > > other
> > > > > > > potential
> > > > > > > > > > > extended resources, we don't have such convenient
> > > information
> > > > > > > sources,
> > > > > > > > > > > making the investigation requires more efforts, which I
> > > tend
> > > > to
> > > > > > > think
> > > > > > > > > is
> > > > > > > > > > > not necessary atm.
> > > > > > > > > > >
> > > > > > > > > > > On the other hand, we also looked into how Spark
> > supports a
> > > > > general
> > > > > > > > > "Custom
> > > > > > > > > > > Resource Scheduling". Assuming we want to have a
> similar
> > > > > general
> > > > > > > > > extended
> > > > > > > > > > > resource mechanism in the future, we believe that the
> > > current
> > > > > GPU
> > > > > > > > > support
> > > > > > > > > > > design can be easily extended, in an incremental way
> > > without
> > > > > too
> > > > > > > many
> > > > > > > > > > > reworks.
> > > > > > > > > > >
> > > > > > > > > > > - The most important part is probably user interfaces.
> > > Spark
> > > > > > > offers
> > > > > > > > > > > configuration options to define the amount, discovery
> > > script
> > > > > and
> > > > > > > > > vendor
> > > > > > > > > > > (on
> > > > > > > > > > > k8s) in a per resource type bias [1], which is very
> > similar
> > > > to
> > > > > > > what
> > > > > > > > > we
> > > > > > > > > > > proposed in this FLIP. I think it's not necessary to
> > expose
> > > > > > > config
> > > > > > > > > > > options
> > > > > > > > > > > in the general way atm, since we do not have supports
> for
> > > > other
> > > > > > > > > resource
> > > > > > > > > > > types now. If later we decided to have per resource
> type
> > > > config
> > > > > > > > > > > options, we
> > > > > > > > > > > can have backwards compatibility on the current
> proposed
> > > > > options
> > > > > > > > > with
> > > > > > > > > > > simple key mapping.
> > > > > > > > > > > - For the GPU Manager, if later needed we can change it
> > to
> > > a
> > > > > > > > > "Extended
> > > > > > > > > > > Resource Manager" (or whatever it is called). That
> should
> > > be
> > > > a
> > > > > > > pure
> > > > > > > > > > > component-internal refactoring.
> > > > > > > > > > > - For ResourceProfile and ResourceSpec, there are
> already
> > > > > > > fields for
> > > > > > > > > > > general extended resource. We can of course leverage
> them
> > > > when
> > > > > > > > > > > supporting
> > > > > > > > > > > fine grained GPU scheduling. That is also not in the
> > scope
> > > of
> > > > > > > this
> > > > > > > > > first
> > > > > > > > > > > step proposal, and would require FLIP-56 to be finished
> > > > first.
> > > > > > > > > > >
> > > > > > > > > > > To summary up, I agree with Becket that have a separate
> > > FLIP
> > > > > for
> > > > > > > the
> > > > > > > > > > > general extended resource mechanism, and keep it in
> mind
> > > when
> > > > > > > > > discussing
> > > > > > > > > > > and implementing the current one.
> > > > > > > > > > >
> > > > > > > > > > > Thank you~
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > [1]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://spark.apache.org/docs/3.0.0-preview/configuration.html#custom-resource-scheduling-and-configuration-overview
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Mar 4, 2020 at 9:18 AM Becket Qin <
> > > > > becket....@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > That's a good point, Stephan. It makes total sense to
> > > > > generalize
> > > > > > > the
> > > > > > > > > > > > resource management to support custom resources.
> Having
> > > > that
> > > > > > > allows
> > > > > > > > > users
> > > > > > > > > > > > to add new resources by themselves. The general
> > resource
> > > > > > > management
> > > > > > > > > may
> > > > > > > > > > > > involve two different aspects:
> > > > > > > > > > > >
> > > > > > > > > > > > 1. The custom resource type definition. It is
> supported
> > > by
> > > > > the
> > > > > > > > > extended
> > > > > > > > > > > > resources in ResourceProfile and ResourceSpec. This
> > will
> > > > > likely
> > > > > > > cover
> > > > > > > > > > > > majority of the cases.
> > > > > > > > > > > >
> > > > > > > > > > > > 2. The custom resource allocation logic, i.e. how to
> > > assign
> > > > > the
> > > > > > > > > resources
> > > > > > > > > > > > to different tasks, operators, and so on. This may
> > > require
> > > > > two
> > > > > > > > > levels /
> > > > > > > > > > > > steps:
> > > > > > > > > > > > a. Subtask level - make sure the subtasks are put
> into
> > > > > > > suitable
> > > > > > > > > > > slots.
> > > > > > > > > > > > It is done by the global RM and is not customizable
> > right
> > > > > now.
> > > > > > > > > > > > b. Operator level - map the exact resource to the
> > > operators
> > > > > > > in
> > > > > > > > > TM.
> > > > > > > > > > > e.g.
> > > > > > > > > > > > GPU 1 for operator A, GPU 2 for operator B. This step
> > is
> > > > > needed
> > > > > > > > > assuming
> > > > > > > > > > > > the global RM does not distinguish individual
> resources
> > > of
> > > > > the
> > > > > > > same
> > > > > > > > > type.
> > > > > > > > > > > > It is true for memory, but not for GPU.
> > > > > > > > > > > >
> > > > > > > > > > > > The GPU manager is designed to do 2.b here. So it
> > should
> > > > > > > discover the
> > > > > > > > > > > > physical GPU information and bind/match them to each
> > > > > operators.
> > > > > > > > > Making
> > > > > > > > > > > this
> > > > > > > > > > > > general will fill in the missing piece to support
> > custom
> > > > > resource
> > > > > > > > > type
> > > > > > > > > > > > definition. But I'd avoid calling it a "External
> > Resource
> > > > > > > Manager" to
> > > > > > > > > > > avoid
> > > > > > > > > > > > confusion with RM, maybe something like "Operator
> > > Resource
> > > > > > > Assigner"
> > > > > > > > > > > would
> > > > > > > > > > > > be more accurate. So for each resource type users can
> > > have
> > > > an
> > > > > > > > > optional
> > > > > > > > > > > > "Operator Resource Assigner" in the TM. For memory,
> > users
> > > > > don't
> > > > > > > need
> > > > > > > > > > > this,
> > > > > > > > > > > > but for other extended resources, users may need
> that.
> > > > > > > > > > > >
> > > > > > > > > > > > Personally I think a pluggable "Operator Resource
> > > Assigner"
> > > > > is
> > > > > > > > > achievable
> > > > > > > > > > > > in this FLIP. But I am also OK with having that in a
> > > > separate
> > > > > > > FLIP
> > > > > > > > > > > because
> > > > > > > > > > > > the interface between the "Operator Resource
> Assigner"
> > > and
> > > > > > > operator
> > > > > > > > > may
> > > > > > > > > > > > take a while to settle down if we want to make it
> > > generic.
> > > > > But I
> > > > > > > > > think
> > > > > > > > > > > our
> > > > > > > > > > > > implementation should take this future work into
> > > > > consideration so
> > > > > > > > > that we
> > > > > > > > > > > > don't need to break backwards compatibility once we
> > have
> > > > > that.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > >
> > > > > > > > > > > > Jiangjie (Becket) Qin
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Mar 4, 2020 at 12:27 AM Stephan Ewen <
> > > > > se...@apache.org>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thank you for writing this FLIP.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I cannot really give much input into the mechanics
> of
> > > > > GPU-aware
> > > > > > > > > > > > scheduling
> > > > > > > > > > > > > and GPU allocation, as I have no experience with
> > that.
> > > > > > > > > > > > >
> > > > > > > > > > > > > One thought I had when reading the proposal is if
> it
> > > > makes
> > > > > > > sense to
> > > > > > > > > > > look
> > > > > > > > > > > > at
> > > > > > > > > > > > > the "GPU Manager" as an "External Resource
> Manager",
> > > and
> > > > > GPU
> > > > > > > is one
> > > > > > > > > > > such
> > > > > > > > > > > > > resource.
> > > > > > > > > > > > > The way I understand the ResourceProfile and
> > > > ResourceSpec,
> > > > > > > that is
> > > > > > > > > how
> > > > > > > > > > > it
> > > > > > > > > > > > > is done there.
> > > > > > > > > > > > > It has the advantage that it looks more extensible.
> > > Maybe
> > > > > > > there is
> > > > > > > > > a
> > > > > > > > > > > GPU
> > > > > > > > > > > > > Resource, a specialized NVIDIA GPU Resource, and
> FPGA
> > > > > > > Resource, a
> > > > > > > > > > > Alibaba
> > > > > > > > > > > > > TPU Resource, etc.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > > Stephan
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Mar 3, 2020 at 7:57 AM Becket Qin <
> > > > > > > becket....@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the FLIP Yangze. GPU resource
> management
> > > > > support
> > > > > > > is a
> > > > > > > > > > > > > must-have
> > > > > > > > > > > > > > for machine learning use cases. Actually it is
> one
> > of
> > > > the
> > > > > > > mostly
> > > > > > > > > > > asked
> > > > > > > > > > > > > > question from the users who are interested in
> using
> > > > Flink
> > > > > > > for ML.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Some quick comments / questions to the wiki.
> > > > > > > > > > > > > > 1. The WebUI / REST API should probably also be
> > > > > mentioned in
> > > > > > > the
> > > > > > > > > > > public
> > > > > > > > > > > > > > interface section.
> > > > > > > > > > > > > > 2. Is the data structure that holds GPU info
> also a
> > > > > public
> > > > > > > API?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Jiangjie (Becket) Qin
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Mar 3, 2020 at 10:15 AM Xintong Song <
> > > > > > > > > tonysong...@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for drafting the FLIP and kicking off
> the
> > > > > > > discussion,
> > > > > > > > > > > Yangze.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Big +1 for this feature. Supporting using of
> GPU
> > in
> > > > > Flink
> > > > > > > is
> > > > > > > > > > > > > significant,
> > > > > > > > > > > > > > > especially for the ML scenarios.
> > > > > > > > > > > > > > > I've reviewed the FLIP wiki doc and it looks
> good
> > > to
> > > > > me. I
> > > > > > > > > think
> > > > > > > > > > > > it's a
> > > > > > > > > > > > > > > very good first step for Flink's GPU supports.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Mar 2, 2020 at 12:06 PM Yangze Guo <
> > > > > > > karma...@gmail.com
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > We would like to start a discussion thread on
> > > > > "FLIP-108:
> > > > > > > Add
> > > > > > > > > GPU
> > > > > > > > > > > > > > > > support in Flink"[1].
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This FLIP mainly discusses the following
> > issues:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > - Enable user to configure how many GPUs in a
> > > task
> > > > > > > executor
> > > > > > > > > and
> > > > > > > > > > > > > > > > forward such requirements to the external
> > > resource
> > > > > > > managers
> > > > > > > > > (for
> > > > > > > > > > > > > > > > Kubernetes/Yarn/Mesos setups).
> > > > > > > > > > > > > > > > - Provide information of available GPU
> > resources
> > > to
> > > > > > > > > operators.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Key changes proposed in the FLIP are as
> > follows:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > - Forward GPU resource requirements to
> > > > > Yarn/Kubernetes.
> > > > > > > > > > > > > > > > - Introduce GPUManager as one of the task
> > manager
> > > > > > > services to
> > > > > > > > > > > > > discover
> > > > > > > > > > > > > > > > and expose GPU resource information to the
> > > context
> > > > of
> > > > > > > > > functions.
> > > > > > > > > > > > > > > > - Introduce the default script for GPU
> > discovery,
> > > > in
> > > > > > > which we
> > > > > > > > > > > > provide
> > > > > > > > > > > > > > > > the privilege mode to help user to achieve
> > > > > worker-level
> > > > > > > > > isolation
> > > > > > > > > > > > in
> > > > > > > > > > > > > > > > standalone mode.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Please find more details in the FLIP wiki
> > > document
> > > > > [1].
> > > > > > > > > Looking
> > > > > > > > > > > > > forward
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > your feedbacks.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-108%3A+Add+GPU+support+in+Flink
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > Yangze Guo
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

Reply via email to