Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2569505334


##########
cpp/src/arrow/util/compression_zstd.cc:
##########
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
       DCHECK_EQ(output_buffer_len, 0);
       output_buffer = &empty_buffer;
     }
-
-    size_t ret = ZSTD_decompress(output_buffer, 
static_cast<size_t>(output_buffer_len),
-                                 input, static_cast<size_t>(input_len));
+    // Decompression context for ZSTD contains several large heap allocations.

Review Comment:
   > Arrow uses its internal thread pool extensively, so that doesn't apply 
here.
   
   One global thread pool, and the size of that pool is bounded to the number 
of CPU cores. From my perspective, this means that there is already a sane 
limit in the number of threads.
   
   Meanwhile, for any application I can think of, it should be more or less be 
safe to assume that any thread that did require a `thread_local` scoped context 
once, will also either need it again during the lifetime of the process, or the 
thread will have a limited life time itself. It's not like someone is going to 
spawn thousands of threads for single-shot compression, and then somehow 
additionally keeps all of them around indefinitely without ever using them for 
the same task again. (While also not having enough RAM to pay for that 
allocated context, while still having enough RAM to pay for the remaining 
overhead of an OS thread...)
   
   > By the way, a simpler alternative to this non-static ThreadLocal class is 
to manage a bounded-size freelist of contexts at the Codec level. This would 
probably cut down on implementation complexity, though it would also limit 
context reuse if the number of decompressing threads goes above the limit.
   
   That would imply a global synchronization primitive on the freelist, and 
also completely messes up the cache locality in case the cached context is 
jumping cores due to the non-local properties of any naively implemented 
freelist. You'd at least have to implement an instance stealing pattern in 
order to preserve locality in the good path. Also "bounded-size" - assuming 
that you had wanted to block until an instance became free - means there's now 
a source of priority inversion, and the only way to avoid is to 
"coincidentally" set the upper bound exactly to the number of threads.
   
   The only potential benefit was to be had in the case of potentially 
short-lived threads - but the overhead of creating a fresh thread dominates the 
cost for that ZSTD context allocation bound to the TLS by magnitudes so 
certainly not worth the effort to optimize for.
   
   
   By all means, `thread_local` is appropriate here for holding the context.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to