FrankChen021 commented on code in PR #19357:
URL: https://github.com/apache/druid/pull/19357#discussion_r3189545609
##########
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/SpillingGrouper.java:
##########
@@ -345,14 +362,105 @@ public Entry<KeyType> apply(Entry<KeyType> entry)
private void spill() throws IOException
{
+ // Stream directly to a temp file first, then check the file size. If the
file is small
+ // (serialized size much smaller than the pre-allocated buffer, e.g. HLL
sketches in List mode),
+ // read it back into memory for batching to avoid creating thousands of
tiny disk files.
+ // If the file is already large enough, keep it on disk as-is.
+ final File file;
try (CloseableIterator<Entry<KeyType>> iterator = grouper.iterator(true)) {
- files.add(spill(iterator));
- dictionaryFiles.add(spill(keySerde.getDictionary().iterator()));
+ file = spill(iterator);
+ }
+ pendingDictionaryEntries.addAll(keySerde.getDictionary());
+ grouper.reset();
+
+ final long fileSize = file.length();
+ if (fileSize < MIN_SPILL_FILE_BYTES) {
+ pendingSpillRuns.add(Files.readAllBytes(file.toPath()));
Review Comment:
[P1] Deleted staging spills still consume the disk quota
This path writes every small spill through LimitedTemporaryStorage, reads it
back into heap, then deletes the temp file. LimitedTemporaryStorage.delete only
removes the file from the file set; it does not decrement bytesUsed, and
LimitedOutputStream.grab has already charged those bytes against
maxOnDiskStorage. As a result, high-cardinality small-spill queries can hit
TemporaryStorageFullException even though those staging files were deleted and
no persistent spill file exists yet, and later flushes double-charge the same
data when writing the merged file. This undermines the batching optimization
and can fail queries well below their configured on-disk limit; small runs
should avoid charging LimitedTemporaryStorage or the accounting needs to refund
deleted staging bytes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]