Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-04-01 Thread Yangze Guo
Thank you all for your participation! I'll start voting for this FLIP. Best, Yangze Guo On Wed, Apr 1, 2020 at 4:55 PM Stephan Ewen wrote: > > Sounds good! > > On Tue, Mar 31, 2020 at 4:32 AM Yangze Guo wrote: > > > Hi everyone, > > I've updated the FLIP accordingly. The key change is replacing

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-04-01 Thread Stephan Ewen
Sounds good! On Tue, Mar 31, 2020 at 4:32 AM Yangze Guo wrote: > Hi everyone, > I've updated the FLIP accordingly. The key change is replacing two > resource allocation interfaces to config options. > > If there are no further comments, I would like to start a voting > thread by tomorrow. > > Be

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-30 Thread Yangze Guo
Hi everyone, I've updated the FLIP accordingly. The key change is replacing two resource allocation interfaces to config options. If there are no further comments, I would like to start a voting thread by tomorrow. Best, Yangze Guo On Mon, Mar 30, 2020 at 9:15 PM Till Rohrmann wrote: > > If the

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-30 Thread Till Rohrmann
If there is no need for the ExternalResourceDriver on the RM side, then it is always a good idea to keep it simple and don't introduce it. One can always change things once one realizes that there is a need for it. Cheers, Till On Mon, Mar 30, 2020 at 12:00 PM Yangze Guo wrote: > Hi @Till, @Xin

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-30 Thread Yangze Guo
Hi @Till, @Xintong I think even without the credential concerns, replacing the interfaces with configuration options is a good idea from my side. - Currently, I don't see any external resource does not compatible with this mechanism - It reduces the burden of users to implement a plugin themselves

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-30 Thread Xintong Song
I also agree that the pluggable ExternalResourceDriver should be loaded by the cluster class loader. Despite the plugin might be implemented by users, external resources (as part of task executor resources) should be cluster configurations, unlike job-level user codes such as UDFs, because the task

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-30 Thread Till Rohrmann
At the moment the RM does not have a user code class loader and I agree with Stephan that it should stay like this. This, however, does not mean that we cannot support pluggable components in the RM. As long as the plugins are on the system's class path, it should be fine for the RM to load them. F

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-29 Thread Yangze Guo
Hi, Stephan, I see your concern and I totally agree with you. The interface on RM side is now `Map getYarn/KubernetesExternalResource()`. The only valid information RM get from it is the configuration key of that external resource in Yarn/K8s. The "String/Long value" would be the same as the exte

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-27 Thread Stephan Ewen
Maybe one final comment: It is probably not an issue, but let's try and keep user code (via user code classloader) out of the ResourceManager, if possible. As background: There were thoughts in the past to support setups where the RM must run with "superuser" credentials, but we cannot run JM/TM

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-26 Thread Yangze Guo
Thanks for the feedback, @Till and @Xintong. Regarding separating the interface, I'm also +1 with it. Regarding the resource allocation interface, true, it's dangerous to give much access to user codes. Changing the return type to Map makes sense to me. AFAIK, it is compatible with all the first-

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-26 Thread Xintong Song
Thanks for updating the FLIP, Yangze. I agree with Till that we probably want to separate the K8s/Yarn decorator calls. Users can still configure one driver class, and we can use `instanceof` to check whether the driver implemented K8s/Yarn specific interfaces. Moreover, I'm not sure about exposi

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-26 Thread Till Rohrmann
Hi everyone, I'm a bit late to the party. I think the current proposal looks good. Concerning the ExternalResourceDriver interface defined in the FLIP [1], I would suggest to not include the decorator calls for Kubernetes and Yarn in the base interface. Instead I would suggest to segregate the de

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-26 Thread Stephan Ewen
Nice, thanks a lot! On Thu, Mar 26, 2020 at 10:21 AM Yangze Guo wrote: > Thanks for the suggestion, @Stephan, @Becket and @Xintong. > > I've updated the FLIP accordingly. I do not add a > ResourceInfoProvider. Instead, I introduce the ExternalResourceDriver, > which takes the responsibility of a

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-26 Thread Yangze Guo
Thanks for the suggestion, @Stephan, @Becket and @Xintong. I've updated the FLIP accordingly. I do not add a ResourceInfoProvider. Instead, I introduce the ExternalResourceDriver, which takes the responsibility of all relevant operations on both RM and TM sides. After a rethink about decoupling th

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-25 Thread Stephan Ewen
This sounds good to go ahead from my side. I like the approach that Becket suggested - in that case the core abstraction that everyone would need to understand would be "external resource allocation" and the "ResourceInfoProvider", and the GPU specific code would be a specific implementation only

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-22 Thread Xintong Song
Thanks for the comments, Stephan & Becket. @Stephan I see your concern, and I completely agree with you that we should first think about the "library" / "plugin" / "extension" style if possible. If GPUs are sliced and assigned during scheduling, there may be reason, > although it looks that it w

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-22 Thread Becket Qin
Thanks for the comment, Stephan. - If everything becomes a "core feature", it will make the project hard > to develop in the future. Thinking "library" / "plugin" / "extension" style > where possible helps. Completely agree. It is much more important to design a mechanism than focusing on a sp

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-22 Thread Stephan Ewen
Hi all! The main point I wanted to throw into the discussion is the following: - With more and more use cases, more and more tools go into Flink - If everything becomes a "core feature", it will make the project hard to develop in the future. Thinking "library" / "plugin" / "extension" style w

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-19 Thread Xintong Song
Thanks for the feedback, Becket. IMO, eventually an operator should only see info of GPUs that are dedicated for it, instead of all GPUs on the machine/container in the current design. It does not make sense to let the user who writes a UDF to worry about coordination among multiple operators run

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-19 Thread Becket Qin
It probably make sense for us to first agree on the final state. More specifically, will the resource info be exposed through runtime context eventually? If that is the final state and we have a seamless migration story from this FLIP to that final state, Personally I think it is OK to expose the

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-15 Thread Xintong Song
@Yangze, I think what Stephan means (@Stephan, please correct me if I'm wrong) is that, we might not need to hold and maintain the GPUManager as a service in TaskManagerServices or RuntimeContext. An alternative is to create / retrieve the GPUManager only in the operators that need it, e.g., with a

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-13 Thread Yangze Guo
@Shephan Do you mean Minicluster? Yes, it makes sense to share the GPU Manager in such scenario. If that's what you worry about, I'm +1 for holding GPUManager(ExternalResourceManagers) in TaskExecutor instead of TaskManagerServices. Regarding the RuntimeContext/FunctionContext, it just holds the G

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-13 Thread Isaac Godfried
On Fri, 13 Mar 2020 15:58:20 + se...@apache.org wrote > > Can we somehow keep this out of the TaskManager services > I fear that we could not. IMO, the GPUManager(or > ExternalServicesManagers in future) is conceptually one of the task > manager services, just like MemoryManage

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-13 Thread Isaac Godfried
On Fri, 13 Mar 2020 15:58:20 + se...@apache.org wrote > > Can we somehow keep this out of the TaskManager services > I fear that we could not. IMO, the GPUManager(or > ExternalServicesManagers in future) is conceptually one of the task > manager services, just like MemoryManage

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-13 Thread Stephan Ewen
> > Can we somehow keep this out of the TaskManager services > I fear that we could not. IMO, the GPUManager(or > ExternalServicesManagers in future) is conceptually one of the task > manager services, just like MemoryManager before 1.10. > - It maintains/holds the GPU resource at TM level and all

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-13 Thread Yangze Guo
Thanks for the feedback, Stephan. > Can we somehow keep this out of the TaskManager services I fear that we could not. IMO, the GPUManager(or ExternalServicesManagers in future) is conceptually one of the task manager services, just like MemoryManager before 1.10. - It maintains/holds the GPU reso

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-13 Thread Stephan Ewen
It sounds fine to initially start with GPU specific support and think about generalizing this once we better understand the space. About the implementation suggested in FLIP-108: - Can we somehow keep this out of the TaskManager services? Anything we have to pull through all layers of the TM mak

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-03 Thread Yangze Guo
Thanks for all the feedbacks. @Becket Regarding the WebUI and GPUInfo, you're right, I'll add them to the Public API section. @Stephan @Becket Regarding the general extended resource mechanism, I second Xintong's suggestion. - It's better to leverage ResourceProfile and ResourceSpec after we sup

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-03 Thread Xingbo Huang
Thanks a lot for the FLIP, Yangze. There is no doubt that GPU resource management support will greatly facilitate the development of AI-related applications by PyFlink users. I have only one comment about this wiki: Regarding the names of several GPU configurations, I think it is better to delet

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-03 Thread Xintong Song
@Stephan, @Becket, Actually, Yangze, Yang and I also had an offline discussion about making the "GPU Support" as some general "Extended Resource Support". We believe supporting extended resources in a general mechanism is definitely a good and extensible way. The reason we propose this FLIP narrow

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-03 Thread Becket Qin
That's a good point, Stephan. It makes total sense to generalize the resource management to support custom resources. Having that allows users to add new resources by themselves. The general resource management may involve two different aspects: 1. The custom resource type definition. It is suppor

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-03 Thread Stephan Ewen
Thank you for writing this FLIP. I cannot really give much input into the mechanics of GPU-aware scheduling and GPU allocation, as I have no experience with that. One thought I had when reading the proposal is if it makes sense to look at the "GPU Manager" as an "External Resource Manager", and G

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-02 Thread Becket Qin
Thanks for the FLIP Yangze. GPU resource management support is a must-have for machine learning use cases. Actually it is one of the mostly asked question from the users who are interested in using Flink for ML. Some quick comments / questions to the wiki. 1. The WebUI / REST API should probably a

Re: [DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-02 Thread Xintong Song
Thanks for drafting the FLIP and kicking off the discussion, Yangze. Big +1 for this feature. Supporting using of GPU in Flink is significant, especially for the ML scenarios. I've reviewed the FLIP wiki doc and it looks good to me. I think it's a very good first step for Flink's GPU supports. Th

[DISCUSS] FLIP-108: Add GPU support in Flink

2020-03-01 Thread Yangze Guo
Hi everyone, We would like to start a discussion thread on "FLIP-108: Add GPU support in Flink"[1]. This FLIP mainly discusses the following issues: - Enable user to configure how many GPUs in a task executor and forward such requirements to the external resource managers (for Kubernetes/Yarn/Me