Fanoid commented on code in PR #210:
URL: https://github.com/apache/flink-ml/pull/210#discussion_r1206315147


##########
flink-ml-core/src/main/java/org/apache/flink/ml/common/sharedobjects/SharedObjectsStreamOperator.java:
##########
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.common.sharedobjects;
+
+/** Interface for all operators that need to access the shared objects. */
+public interface SharedObjectsStreamOperator {

Review Comment:
   #237 proposes a brilliant solution about abstracting iterative computations 
inspired by Parameter Server. I believe both solutions can work well in many 
algorithms/scenarios.
   
   Before comparing two solutions, there are two facts about PS infra I must 
emphasize:
   
   One fact is the PS infra is built on the DataStream APIs, which means there 
will be no performance improvement compared to implement with raw DataStream 
APIs. So we mainly discuss its usability with aspect to developers.
   
   The other fact is the current status of functionalities shown in #237 cannot 
fully meet the requirements of GBDT implementation.  `MessageType`, model 
format, reduce logic of messages, etc. are all fixed/hard-coded with respect to 
gradient-based algorithms. The usability will drop significantly if forcing 
GBDT implementation to use current APIs.
   
   Therefore, to make a reasonable comparison between two solutions, I assume 
an extended version of current PS infra which supports POJO message types and 
POJO model data, user-defined `reduce` function, etc. Here are my thoughts 
under this assumption:
   
   1. Framework Intrusiveness
   
   Using PS infra means developers cannot use DataStream APIs in iterations 
anymore. Then, there are cases where PS infra cannot implement: 
     - side outputs: evaluation result streams; prediction and model streams in 
online cases.
     - partition/join/coGroup of training data sets: AUC calculation after 
model update, ALS, SimRank.
   
   As for SharedObjects, it is an augment to DataStream APIs. There is no extra 
limitation to developers.
   
   The intrusiveness also influences the observation of operators when job 
running as, in PS infra, multiple computations are merged in to one operator, 
like in/out stats, checkpoint status. This decreases usability to both 
developers and end-users.
   
   2. Applicable scenarios
   
   Besides inapplicable cases mentioned above, PS infra cannot work in 
non-iteration cases. But SharedObjects can work. One possible case is to 
improve consecutive joins with a same dataset by reducing a copy of dataset.
   
   3. Learning curve
   
   PS infra provides a whole set of concepts and interfaces such as Message, 
ModelUpdater, ProcessStage, TrainingUtils, etc., which are not related to the 
existing DataStream API and have a steeper learning curve.
   
   SharedObjects provides two interfaces, SharedObjectsUtils and 
SharedObjectsContext, and can be developed directly based on the existing 
DataStream API code, making it easier for developers to accept.
   
   
   Overall speaking, I think both solutions can coexist because they are on 
different levels of APIs and have no conflicts. How about you? @zhipeng93 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to