[GitHub] [flink-ml] yunfengzhou-hub commented on a diff in pull request #97: [FLINK-27096] Improve DataCache and KMeans Performance

GitBox Fri, 03 Jun 2022 01:49:40 -0700


yunfengzhou-hub commented on code in PR #97:
URL: https://github.com/apache/flink-ml/pull/97#discussion_r888753223



##########
flink-ml-iteration/src/main/java/org/apache/flink/iteration/datacache/nonkeyed/DataCacheSnapshot.java:
##########
@@ -90,18 +90,18 @@ public void writeTo(OutputStream checkpointOutputStream) 
throws IOException {
             }
 
             dos.writeBoolean(fileSystem.isDistributedFS());
+            for (Segment segment : segments) {
+                persistSegmentToDisk(segment);
+            }
             if (fileSystem.isDistributedFS()) {
                 // We only need to record the segments itself
                 serializeSegments(segments, dos);
             } else {
                 // We have to copy the whole streams.
-                int totalRecords = 
segments.stream().mapToInt(Segment::getCount).sum();
-                long totalSize = 
segments.stream().mapToLong(Segment::getSize).sum();
-                checkState(totalRecords >= 0, "overflowed: " + totalRecords);
-                dos.writeInt(totalRecords);
-                dos.writeLong(totalSize);
-
+                dos.writeInt(segments.size());
                 for (Segment segment : segments) {
+                    dos.writeInt(segment.getCount());

Review Comment:
   Because the max size of a segment is limited. For example, limited by the 
max allowed file size of the underlying filesystem. If we merge all segments 
into one during snapshot, errors due to such limits might be invoked. Treating 
each segment separately could help avoid such errors while not adding much 
overhead to the snapshot process.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink-ml] yunfengzhou-hub commented on a diff in pull request #97: [FLINK-27096] Improve DataCache and KMeans Performance

Reply via email to