maytasm commented on code in PR #19357:
URL: https://github.com/apache/druid/pull/19357#discussion_r3193316749
##########
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/SpillingGrouper.java:
##########
@@ -345,12 +354,103 @@ public Entry<KeyType> apply(Entry<KeyType> entry)
private void spill() throws IOException
{
+ // Stream directly to a temp file first, then check the file size. If the
file is small
+ // (serialized size much smaller than the pre-allocated buffer, e.g. HLL
sketches in List mode),
+ // read it back into memory for batching to avoid creating thousands of
tiny disk files.
+ // If the file is already large enough, keep it on disk as-is.
+ final File file;
try (CloseableIterator<Entry<KeyType>> iterator = grouper.iterator(true)) {
- files.add(spill(iterator));
- dictionaryFiles.add(spill(keySerde.getDictionary().iterator()));
+ file = spill(iterator);
+ }
+ pendingDictionaryEntries.addAll(keySerde.getDictionary());
+ grouper.reset();
+
+ final long fileSize = file.length();
Review Comment:
I don't think it's worth optimizing.
1. ArrayList.clear() sets size to 0 but keeps the backing array, so after
the first batch flushes, the capacity is already there for subsequent batches.
2. Even on the first batch, the cost is trivial. Say we have ~170 small
spills per 1MB batch — that's about 7-8 doublings from the default capacity of
10. Each doubling copies an array of object references (8 bytes each), so the
total copy overhead is on the order of a few KB. Negligible compared to the MB
of actual spill data being handled.
3. Pre-allocating would require estimating how many spills fit in a batch
(minSpillFileSize / avgSpillSize), but the spill size isn't known ahead of time
— it depends on the data. So the estimate would be a guess anyway.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]