Fanoid commented on code in PR #210: URL: https://github.com/apache/flink-ml/pull/210#discussion_r1206315147
########## flink-ml-core/src/main/java/org/apache/flink/ml/common/sharedobjects/SharedObjectsStreamOperator.java: ########## @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.common.sharedobjects; + +/** Interface for all operators that need to access the shared objects. */ +public interface SharedObjectsStreamOperator { Review Comment: #237 proposes a brilliant solution about abstracting iterative computations inspired by Parameter Server. I believe both solutions can work well in many algorithms/scenarios. Before comparing two solutions, there are two facts about PS infra I must emphasize: One fact is the PS infra is built on the DataStream APIs, which means there will be no performance improvement compared to implement with raw DataStream APIs. So we mainly discuss its usability with aspect to developers. The other fact is the current status of functionalities shown in #237 cannot fully meet the requirements of GBDT implementation. `MessageType`, model format, reduce logic of messages, etc. are all fixed/hard-coded with respect to gradient-based algorithms. The usability will drop significantly if forcing GBDT implementation to use current APIs. Therefore, to make a reasonable comparison between two solutions, I assume an extended version of current PS infra which supports POJO message types and POJO model data, user-defined `reduce` function, etc. Here are my thoughts under this assumption: 1. Framework Intrusiveness Using PS infra means developers cannot use DataStream APIs in iterations anymore. Then, there are cases where PS infra cannot implement: - side outputs: evaluation result streams; prediction and model streams in online cases. - partition/join/coGroup of training data sets: AUC calculation after model update, ALS, SimRank. As for SharedObjects, it is an augment to DataStream APIs. There is no extra limitation to developers. The intrusiveness also influences the observation of operators when job running as, in PS infra, multiple computations are merged in to one operator, like in/out stats, checkpoint status. This decreases usability to both developers and end-users. 2. Applicable scenarios Besides inapplicable cases mentioned above, PS infra cannot work in non-iteration cases. But SharedObjects can work. One possible case is to improve consecutive joins with a same dataset by reducing a copy of dataset. 3. Learning curve PS infra provides a whole set of concepts and interfaces such as Message, ModelUpdater, ProcessStage, TrainingUtils, etc., which are not related to the existing DataStream API and have a steeper learning curve. SharedObjects provides two interfaces, SharedObjectsUtils and SharedObjectsContext, and can be developed directly based on the existing DataStream API code, making it easier for developers to accept. Overall speaking, I think both solutions can coexist because they are on different levels of APIs and have no conflicts. How about you? @zhipeng93 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org