jtuglu1 commented on code in PR #19357:
URL: https://github.com/apache/druid/pull/19357#discussion_r3196625224


##########
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/SpillingGrouper.java:
##########
@@ -345,12 +354,103 @@ public Entry<KeyType> apply(Entry<KeyType> entry)
 
   private void spill() throws IOException
   {
+    // Stream directly to a temp file first, then check the file size. If the 
file is small
+    // (serialized size much smaller than the pre-allocated buffer, e.g. HLL 
sketches in List mode),
+    // read it back into memory for batching to avoid creating thousands of 
tiny disk files.
+    // If the file is already large enough, keep it on disk as-is.
+    final File file;
     try (CloseableIterator<Entry<KeyType>> iterator = grouper.iterator(true)) {
-      files.add(spill(iterator));
-      dictionaryFiles.add(spill(keySerde.getDictionary().iterator()));
+      file = spill(iterator);
+    }
+    pendingDictionaryEntries.addAll(keySerde.getDictionary());
+    grouper.reset();
+
+    final long fileSize = file.length();

Review Comment:
   That's fine – I mention point #1 already as another reason against this, but 
I think #2 is really holding the weight here. #3 isn't really too relevant IMO 
because you can just do `pendingSpillRuns.ensureCapacity(fileSize);` (with the 
heuristic that files will be approximately similarly sized).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to