[GitHub] [flink-ml] yunfengzhou-hub commented on a diff in pull request #97: [FLINK-27096] Improve DataCache and KMeans Performance

GitBox Sun, 05 Jun 2022 02:09:56 -0700


yunfengzhou-hub commented on code in PR #97:
URL: https://github.com/apache/flink-ml/pull/97#discussion_r889669585



##########
flink-ml-iteration/src/main/java/org/apache/flink/iteration/datacache/nonkeyed/DataCacheIterator.java:
##########
@@ -0,0 +1,132 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.iteration.datacache.nonkeyed;
+
+import org.apache.flink.annotation.Internal;
+import org.apache.flink.api.common.typeutils.TypeSerializer;
+import org.apache.flink.api.java.tuple.Tuple2;
+import org.apache.flink.util.Preconditions;
+
+import java.io.IOException;
+import java.util.Iterator;
+import java.util.List;
+
+/** Reads the cached data from a list of segments. */
+@Internal
+public class DataCacheIterator<T> implements Iterator<T> {

Review Comment:
   The original API is coupled together, so whenever the users use 
`DataCacheWriter`, they would also use `DataCacheSnapshot` and 
`DataCacheReader`. In this case I think it is better to merge them into one 
class.
   
   Another advantage of the current API is that it better hides implementation 
details from developers.  If we still use the original API. we would have to 
preserve the `Segment` class in order to pass information between 
`DataCacheWriter`, `DataCacheSnapshot` and `DataCacheReader`, and developers 
need to learn the notion of `Segment` in order to use the data cache mechanism. 
While in previous offline discussion, we agreed that there might not be a 
unified `Segment` concept and might keep `FileSegment` and `MemorySegment` 
separately in internal implementation. We might also find that there is a 
better internal representation of the cache unit than `xxxSegment` in future, 
but it would be hard for us to make such improvement then because `Segment` has 
become a public API.
   
   Since the `DataCache` holds most functions of previous `DataCacheWriter` and 
`DataCacheSnapshot`, naming it as `DataCacheWriter` might not be suitable now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink-ml] yunfengzhou-hub commented on a diff in pull request #97: [FLINK-27096] Improve DataCache and KMeans Performance

Reply via email to