FrankChen021 commented on code in PR #19357:
URL: https://github.com/apache/druid/pull/19357#discussion_r3189545609


##########
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/SpillingGrouper.java:
##########
@@ -345,14 +362,105 @@ public Entry<KeyType> apply(Entry<KeyType> entry)
 
   private void spill() throws IOException
   {
+    // Stream directly to a temp file first, then check the file size. If the 
file is small
+    // (serialized size much smaller than the pre-allocated buffer, e.g. HLL 
sketches in List mode),
+    // read it back into memory for batching to avoid creating thousands of 
tiny disk files.
+    // If the file is already large enough, keep it on disk as-is.
+    final File file;
     try (CloseableIterator<Entry<KeyType>> iterator = grouper.iterator(true)) {
-      files.add(spill(iterator));
-      dictionaryFiles.add(spill(keySerde.getDictionary().iterator()));
+      file = spill(iterator);
+    }
+    pendingDictionaryEntries.addAll(keySerde.getDictionary());
+    grouper.reset();
+
+    final long fileSize = file.length();
+    if (fileSize < MIN_SPILL_FILE_BYTES) {
+      pendingSpillRuns.add(Files.readAllBytes(file.toPath()));

Review Comment:
   [P1] Deleted staging spills still consume the disk quota
   
   This path writes every small spill through LimitedTemporaryStorage, reads it 
back into heap, then deletes the temp file. LimitedTemporaryStorage.delete only 
removes the file from the file set; it does not decrement bytesUsed, and 
LimitedOutputStream.grab has already charged those bytes against 
maxOnDiskStorage. As a result, high-cardinality small-spill queries can hit 
TemporaryStorageFullException even though those staging files were deleted and 
no persistent spill file exists yet, and later flushes double-charge the same 
data when writing the merged file. This undermines the batching optimization 
and can fail queries well below their configured on-disk limit; small runs 
should avoid charging LimitedTemporaryStorage or the accounting needs to refund 
deleted staging bytes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to