pan3793 commented on code in PR #8461:
URL: https://github.com/apache/hadoop/pull/8461#discussion_r3146376526


##########
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.java:
##########
@@ -209,7 +229,16 @@ public int compress(byte[] b, int off, int len) throws 
IOException {
     compressedDirectBuf.position(0);
     compressedDirectBuf.limit(directBufferSize);
 
-    EndDirective endOp = shouldEnd ? EndDirective.END : EndDirective.FLUSH;
+    // CONTINUE should be used for non-end case, to support multi-threaded:
+    // 1. CONTINUE + workers ≥ 1: non-blocking. The call copies as much input
+    //      as it can into a job, dispatches to workers, drains whatever output
+    //      is ready, and returns. Multiple jobs can be in flight in parallel.
+    // 2. FLUSH + workers ≥ 1: multi-threaded compression will block to flush
+    //      as much output as possible. The call won't return until every 
queued
+    //      job has finished and its output has been drained to the dst buffer.
+    // 3. END + workers ≥ 1: same as FLUSH but also closes the frame. Same
+    //      blocking behavior.
+    EndDirective endOp = shouldEnd ? EndDirective.END : EndDirective.CONTINUE;

Review Comment:
   my spark integration test shows it has no effect when using `FLUSH` - 
setting workers to 4 has the same cpu usage and wall-clock time as the default 
workers 0. while after the change to `CONTINUE`, the cpu average usage takes 
~3.5x



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to