Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2569505334
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
DCHECK_EQ(output_buffer_len, 0);
output_buffer = &empty_buffer;
}
-
-size_t ret = ZSTD_decompress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len));
+// Decompression context for ZSTD contains several large heap allocations.
Review Comment:
> Arrow uses its internal thread pool extensively, so that doesn't apply
here.
One global thread pool, and the size of that pool is bounded to the number
of CPU cores. From my perspective, this means that there is already a sane
limit in the number of threads.
Meanwhile, for any application I can think of, it should be more or less be
safe to assume that any thread that did require a `thread_local` scoped context
once, will also either need it again during the lifetime of the process, or the
thread will have a limited life time itself. It's not like someone is going to
spawn thousands of threads for single-shot compression, and then somehow
additionally keeps all of them around indefinitely without ever using them for
the same task again. (While also not having enough RAM to pay for that
allocated context, while still having enough RAM to pay for the remaining
overhead of an OS thread...)
> By the way, a simpler alternative to this non-static ThreadLocal class is
to manage a bounded-size freelist of contexts at the Codec level. This would
probably cut down on implementation complexity, though it would also limit
context reuse if the number of decompressing threads goes above the limit.
That would imply a global synchronization primitive on the freelist, and
also completely messes up the cache locality in case the cached context is
jumping cores due to the non-local properties of any naively implemented
freelist. You'd at least have to implement an instance stealing pattern in
order to preserve locality in the good path. Also "bounded-size" - assuming
that you had wanted to block until an instance became free - means there's now
a source of priority inversion, and the only way to avoid is to
"coincidentally" set the upper bound exactly to the number of threads.
The only potential benefit was to be had in the case of potentially
short-lived threads - but the overhead of creating a fresh thread dominates the
cost for that ZSTD context allocation bound to the TLS by magnitudes so
certainly not worth the effort to optimize for.
By all means, `thread_local` is appropriate here for holding the context.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
pitrou commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2569353687
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
DCHECK_EQ(output_buffer_len, 0);
output_buffer = &empty_buffer;
}
-
-size_t ret = ZSTD_decompress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len));
+// Decompression context for ZSTD contains several large heap allocations.
Review Comment:
By the way, a simpler alternative to this non-static `ThreadLocal` class is
to manage a bounded-size freelist of contexts at the Codec level. This would
probably cut down on implementation complexity, though it would also limit
context reuse if the number of decompressing threads goes above the limit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
pitrou commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2569344430
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
DCHECK_EQ(output_buffer_len, 0);
output_buffer = &empty_buffer;
}
-
-size_t ret = ZSTD_decompress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len));
+// Decompression context for ZSTD contains several large heap allocations.
Review Comment:
> Can't really say I like the idea though, that bad use of that interface
(i.e. not _**explicitly**_ sharing the factory as the user of the outermost
interface!) would still inevitably result in carrying multiple instances of the
thread local scope instead.
Two non-exclusive answers: 1) documentation 2) make caching optional in the
`Codec` constructor
> Makes it harder to use, and potentially even backfires in terms of peak
memory consumption. Even worst, that map in the TLS is accumulating a memory
leak of expired shared ptrs - which due to extensive use of `std::make_shared`
are usually actually fused allocations.
Right, but the underlying compression context will be destroyed anyway,
which is what matters.
(and a mitigation for this is to scrub expired weak_ptrs depending on
heuristics)
> assuming that threads are usually a resource developers are good at
tracking and tearing down when no longer intended for re-use
Arrow uses its internal thread pool extensively, so that doesn't apply here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2569244891
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
DCHECK_EQ(output_buffer_len, 0);
output_buffer = &empty_buffer;
}
-
-size_t ret = ZSTD_decompress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len));
+// Decompression context for ZSTD contains several large heap allocations.
Review Comment:
Can't really say I like the idea though, that bad use of that interface
(i.e. not ***explicitly*** sharing the factory as the user of the outermost
interface!) would still inevitably result in carrying multiple instances of the
thread local scope instead.
Makes it harder to use, and potentially even backfires in terms of peak
memory consumption. Even worst, that map in the TLS is accumulating a memory
leak of expired shared ptrs - which due to extensive use of `std::make_shared`
are usually actually fused allocations.
I would rather take the risk of leaking a few MB of memory per thread,
assuming that threads are usually a resource developers are good at tracking
and tearing down when no longer intended for re-use.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2569244891
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
DCHECK_EQ(output_buffer_len, 0);
output_buffer = &empty_buffer;
}
-
-size_t ret = ZSTD_decompress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len));
+// Decompression context for ZSTD contains several large heap allocations.
Review Comment:
Can't really say I like the idea though, that bad use of that interface
(i.e. not ***explicitly*** sharing the factory as the user of the outermost
interface!) would still inevitably result in carrying multiple instances of the
thread local scope instead.
Makes it harder to use, and potentially even backfires in terms of peak
memory consumption. Even worst, that map in the TLS is accumulating a memory
leak of expired shared ptrs - which due to extensive use of `std::make_shared`
are usually actually fused allocations.
I would rather take the risk of leaking a few MB of memory per thread,
assuming that threads are usually a resource developers are good at tracking.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2569095045
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
DCHECK_EQ(output_buffer_len, 0);
output_buffer = &empty_buffer;
}
-
-size_t ret = ZSTD_decompress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len));
+// Decompression context for ZSTD contains several large heap allocations.
Review Comment:
You mean like:
```c++
#include
#include
#include
#include
class LocalContext
{
};
// Container, only updated in rare case. Doubles as factory for
`LocalContext`.
class ContextFactory {
public:
// To be called by `LocalContextReference`.
// Returned context is to be explicitly released if external reference
expires first.
std::shared_ptr Create() {
auto instance = std::make_shared();
std::scoped_lock lock(door_);
instances_.emplace(instance);
return instance;
}
void Release(const std::shared_ptr& instance)
{
assert(instance);
// This will result in LocalContext expiring outside of the mutex to
avoid unnecessary contention.
std::scoped_lock lock(door_);
instances_.erase(instance);
}
private:
// Strong references are held here, but expiry of TLS can clear them out.
std::set> instances_;
std::mutex door_;
};
// Thread-local cache at the usage location.
class LocalContextReference {
public:
~LocalContextReference() {
// Need to inform all `ContextFactory ` that are still alive, that
they may release the strong reference to `LocalContext` associated with this
thread.
for(const auto& instance : instances_)
{
// Factory can be expired before this.
if(auto factory = instance.first.lock())
{
factory->Release(instance.second.lock());
}
}
}
std::shared_ptr Get(const std::shared_ptr&
factory)
{
assert(factory);
// Not possible until C++26...
// Need std::weak_ptr::owner_equal from C++26.
auto it = instances_.find(factory);
if(it == instances_.end()) {
it = instances_.emplace(std::piecewise_construct,
std::forward_as_tuple(factory), std::forward_as_tuple(factory->Create())).first;
}
return it->second.lock();
}
private:
std::map, std::weak_ptr>
instances_;
};
// At usage location.
thread_local LocalContextReference reference;
```
At ***minimum*** this would require a (reasonable...) backport of
`std::weak_ptr::owner_equal` to work with older C++ standards in the line
`instances_.find(factory)`. We can not use `std::unordered_map` with
`std::weak_ptr` because `std::weak_ptr::owner_hash` - unlike `owner_equal` -
is ***not*** possible to back port as it requires private interfaces on
`std::weak_ptr`.
Apart from that it should behave acceptable. `std::map` in uncontested cache
lines / TLS is surprisingly fast. Any creation and destruction of instances is
kept strictly out of contested code locations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2569095045
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
DCHECK_EQ(output_buffer_len, 0);
output_buffer = &empty_buffer;
}
-
-size_t ret = ZSTD_decompress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len));
+// Decompression context for ZSTD contains several large heap allocations.
Review Comment:
You mean like:
```c++
#include
#include
#include
#include
class LocalContext
{
};
// Container, only updated in rare case. Doubles as factory for
`LocalContext`.
class ContextFactory {
public:
// To be called by `LocalContextReference`.
// Returned context is to be explicitly released if external reference
expires first.
std::shared_ptr Create() {
auto instance = std::make_shared();
std::scoped_lock lock(door_);
instances_.emplace(instance);
return instance;
}
void Release(const std::shared_ptr& instance)
{
assert(instance);
// This will result in LocalContext expiring outside of the mutex to
avoid unnecessary contention.
std::scoped_lock lock(door_);
instances_.erase(instance);
}
private:
// Strong references are held here, but expiry of TLS can clear them out.
std::set> instances_;
std::mutex door_;
};
// Thread-local cache at the usage location.
class LocalContextReference {
public:
~LocalContextReference() {
// Need to inform all `ContextFactory ` that are still alive, that
they may release the strong reference to `LocalContext` associated with this
thread.
for(const auto& instance : instances_)
{
// Factory can be expired before this.
if(auto factory = instance.first.lock())
{
factory->Release(instance.second.lock());
}
}
}
std::shared_ptr Get(const std::shared_ptr&
factory)
{
assert(factory);
// Not possible until C++26...
// Need std::weak_ptr::owner_equal from C++26.
auto it = instances_.find(factory);
if(it == instances_.end()) {
it = instances_.emplace(std::piecewise_construct,
std::forward_as_tuple(factory), std::forward_as_tuple(factory->Create())).first;
}
return it->second.lock();
}
private:
std::map, std::weak_ptr>
instances_;
};
// At usage location.
thread_local LocalContextReference reference;
```
At ***minimum*** this would require a (reasonable...) backport of
`std::weak_ptr::owner_equal` to work with older C++ standards in the line
`instances_.find(factory)`. We can not use `std::unordered_map` with
`std::weak_ptr` because `std::weak_ptr::owner_hash` - unlike `owner_equal` -
is ***not*** possible to back port as it requires private interfaces on
`std::weak_ptr`.
Apart from that it should behave acceptable. `std::map` in uncontested cache
lines / TLS is surprisingly fast.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2569095045
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
DCHECK_EQ(output_buffer_len, 0);
output_buffer = &empty_buffer;
}
-
-size_t ret = ZSTD_decompress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len));
+// Decompression context for ZSTD contains several large heap allocations.
Review Comment:
You mean like:
```c++
#include
#include
#include
#include
class LocalContext
{
};
// Container, only updated in rare case. Doubles as factory for
`LocalContext`.
class ContextFactory {
public:
// To be called by `LocalContextReference`.
// Returned context is to be explicitly released if external reference
expires first.
std::shared_ptr Create() {
std::scoped_lock lock(door_);
return *instances_.emplace(std::make_shared()).first;
}
void Release(const std::shared_ptr& instance)
{
assert(instance);
// This will result in LocalContext expiring outside of the mutex to
avoid unnecessary contention.
std::scoped_lock lock(door_);
instances_.erase(instance);
}
private:
// Strong references are held here, but expiry of TLS can clear them out.
std::set> instances_;
std::mutex door_;
};
// Thread-local cache at the usage location.
class LocalContextReference {
public:
~LocalContextReference() {
// Need to inform all `ContextFactory ` that are still alive, that
they may release the strong reference to `LocalContext` associated with this
thread.
for(const auto& instance : instances_)
{
// Factory can be expired before this.
if(auto factory = instance.first.lock())
{
factory->Release(instance.second.lock());
}
}
}
std::shared_ptr Get(const std::shared_ptr&
factory)
{
assert(factory);
// Not possible until C++26...
// Need std::weak_ptr::owner_equal from C++26.
auto it = instances_.find(factory);
if(it == instances_.end()) {
it = instances_.emplace(std::piecewise_construct,
std::forward_as_tuple(factory), std::forward_as_tuple(factory->Create())).first;
}
return it->second.lock();
}
private:
std::map, std::weak_ptr>
instances_;
};
// At usage location.
thread_local LocalContextReference reference;
```
At ***minimum*** this would require a (reasonable...) backport of
`std::weak_ptr::owner_equal` to work with older C++ standards in the line
`instances_.find(factory)`. We can not use `std::unordered_map` with
`std::weak_ptr` because `std::weak_ptr::owner_hash` - unlike `owner_equal` -
is ***not*** possible to back port as it requires private interfaces on
`std::weak_ptr`.
Apart from that it should behave acceptable. `std::map` in uncontested cache
lines / TLS is surprisingly fast.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2569095045
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
DCHECK_EQ(output_buffer_len, 0);
output_buffer = &empty_buffer;
}
-
-size_t ret = ZSTD_decompress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len));
+// Decompression context for ZSTD contains several large heap allocations.
Review Comment:
You mean like:
```c++
#include
#include
#include
#include
class LocalContext
{
};
// Container, only updated in rare case. Doubles as factory for
`LocalContext`.
class ContextFactory {
public:
// To be called by `LocalContextReference`.
// Returned context is to be explicitly released if external reference
expires first.
std::shared_ptr Create() {
std::scoped_lock lock(door_);
return *instances_.emplace(std::make_shared()).first;
}
void Release(const std::shared_ptr& instance)
{
assert(instance);
// This will result in LocalContext expiring outside of the mutex to
avoid unnecessary contention.
std::scoped_lock lock(door_);
instances_.erase(instance);
}
private:
// Strong references are held here, but expiry of TLS can clear them out.
std::set> instances_;
std::mutex door_;
};
// Thread-local cache at the usage location.
class LocalContextReference {
public:
~LocalContextReference() {
// Need to inform all `ContextFactory ` that are still alive, that
they may release the strong reference to `LocalContext` associated with this
thread.
for(const auto& instance : instances_)
{
// Factory can be expired before this.
if(auto factory = instance.first.lock())
{
factory->Release(instance.second.lock());
}
}
}
std::shared_ptr Get(const std::shared_ptr&
factory)
{
assert(factory);
// Not possible until C++26...
// Need std::weak_ptr::owner_equal from C++26.
auto it = instances_.find(factory);
if(it == instances_.end()) {
it = instances_.emplace(std::piecewise_construct,
std::forward_as_tuple(factory), std::forward_as_tuple(factory->Create())).first;
}
return it->second.lock();
}
private:
std::map, std::weak_ptr>
instances_;
};
// At usage location.
thread_local LocalContextReference reference;
```
At ***minimum*** this would require a (reasonable...) backport of
`std::weak_ptr::owner_equal` to work with older C++ standards in the line
`instances_.find(factory)`. We can not use `std::unordered_map` with
`std::weak_ptr` because `std::weak_ptr::owner_hash` - unlike `owner_equal` -
is ***not*** possible to back port as it requires private interfaces on
`std::weak_ptr`.
Apart from that it should behave acceptable. `std::map` in uncontested cache
lines / TLS is surprisingly fast.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
pitrou commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2568843279
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
DCHECK_EQ(output_buffer_len, 0);
output_buffer = &empty_buffer;
}
-
-size_t ret = ZSTD_decompress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len));
+// Decompression context for ZSTD contains several large heap allocations.
Review Comment:
> Unsure about that - the problematic free-threaded case there is the use of
thread pools within feather/ipc. They'd need a `thread_local` like patterns in
any case. Which means instead of one central `thread_local` there would simply
be one at 3+ code locations instead.
The problem is all those thread locals are as long-lived as the threads
themselves.
What we would need is a non-static `ThreadLocal` class that allows tying the
context lifetime to the Codec instance lifetime, for example.
(this could actually be useful for other purposes)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2559600380
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
DCHECK_EQ(output_buffer_len, 0);
output_buffer = &empty_buffer;
}
-
-size_t ret = ZSTD_decompress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len));
+// Decompression context for ZSTD contains several large heap allocations.
Review Comment:
... after checking the Rust implementation, it really should have used a
`thread_local!` scoped context as well. That went badly, where it's now
creating that ZSTD context even if LZ4 is selected, it's creating one distinct
context per usage location, and it's still creating a new context for a lot of
potentially short-lived objects. Also it missed that there is not just a need
for the `CompressionContext` but also the `DecompressionContext` specifically
when talking about the IPC library which uses compression in both directions...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2559488106
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
DCHECK_EQ(output_buffer_len, 0);
output_buffer = &empty_buffer;
}
-
-size_t ret = ZSTD_decompress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len));
+// Decompression context for ZSTD contains several large heap allocations.
Review Comment:
Unsure about that - the problematic free-threaded case there is the use of
thread pools within feather/ipc. They'd need a `thread_local` like patterns in
any case. Which means instead of one central `thread_local` there would simply
be one at 3+ code locations instead.
Exposing the context for use with Parquet would require exposing it all the
way out to `parquet::WriterProperties::Builder` - and then you'd possibly still
end up with multiple writer instances wrongly sharing a context, rendering
threading of those writers suddenly impossible. If anything you'd need to
export a threading aware "context pool" rather than a context, but that would
be equal to reinventing `thread_local` except worse in terms of cache locality
and undesirable extra synchronization primitives.
The Rust implementation did not encounter those issues as there is no
sharing of the context permitted in the first place due to being constrained by
language features. And correspondingly also no aggressive threading using a
(potentially) shared state.
Ultimately, having exactly one cached context per thread for the single shot
compression/decompression API is the recommended usage pattern from the ZSTD
maintainers, and aligns best with the available API:
```c++
/*= Decompression context
* When decompressing many times,
* it is recommended to allocate a context only once,
* and reuse it for each successive compression operation.
* This will make workload friendlier for system's memory.
* Use one context per thread for parallel execution. */
typedef struct ZSTD_DCtx_s ZSTD_DCtx;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2559488106
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
DCHECK_EQ(output_buffer_len, 0);
output_buffer = &empty_buffer;
}
-
-size_t ret = ZSTD_decompress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len));
+// Decompression context for ZSTD contains several large heap allocations.
Review Comment:
Unsure about that - the problematic free-threaded case there is the use of
thread pools within feather/ipc. They'd need a `thread_local` like patterns in
any case. Which means inste
Exposing the context for use with Parquet would require exposing it all the
way out to `parquet::WriterProperties::Builder` - and then you'd possibly still
end up with multiple writer instances wrongly sharing a context, rendering
threading of those writers suddenly impossible. If anything you'd need to
export a threading aware "context pool" rather than a context, but that would
be equal to reinventing `thread_local` except worse in terms of cache locality
and undesirable extra synchronization primitives.
The Rust implementation did not encounter those issues as there is no
sharing of the context permitted in the first place due to being constrained by
language features. And correspondingly also no aggressive threading using a
(potentially) shared state.
Ultimately, having exactly one cached context per thread for the single shot
compression/decompression API is the recommended usage pattern from the ZSTD
maintainers, and aligns best with the available API:
```c++
/*= Decompression context
* When decompressing many times,
* it is recommended to allocate a context only once,
* and reuse it for each successive compression operation.
* This will make workload friendlier for system's memory.
* Use one context per thread for parallel execution. */
typedef struct ZSTD_DCtx_s ZSTD_DCtx;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
pitrou commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2559392123
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
DCHECK_EQ(output_buffer_len, 0);
output_buffer = &empty_buffer;
}
-
-size_t ret = ZSTD_decompress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len));
+// Decompression context for ZSTD contains several large heap allocations.
Review Comment:
To that means that reuse of the contexts should be governed at a higher
level, for example the Parquet reader. Perhaps do how the Rust implementation
did and expose some kind of "decompression context" API?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2559342235
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
DCHECK_EQ(output_buffer_len, 0);
output_buffer = &empty_buffer;
}
-
-size_t ret = ZSTD_decompress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len));
+// Decompression context for ZSTD contains several large heap allocations.
Review Comment:
IIRC 5-10MB in total. Enough to hurt performance with small blocks (i.e.
Parquet with 8kB row groups) both due to memory management and cache trashing,
not enough to hurt in terms of total memory footprint.
Would have liked to slave those allocations to the arrow default memory pool
for proper tracing, but that feature is exclusive to the static linkage of ZSTD.
I did deliberately avoid managing a pool per instance, assuming that there
may be many instances of this class, more than threads in the thread pools.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
pitrou commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2559279786
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -187,9 +202,14 @@ class ZSTDCodec : public Codec {
DCHECK_EQ(output_buffer_len, 0);
output_buffer = &empty_buffer;
}
-
-size_t ret = ZSTD_decompress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len));
+// Decompression context for ZSTD contains several large heap allocations.
Review Comment:
How large is "large" here? This is proposing to keep those per-thread heap
allocations alive until the threads themselves are joined (which typically
happens at process exit for a thread pool).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2558834530
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -207,8 +226,18 @@ class ZSTDCodec : public Codec {
Result Compress(int64_t input_len, const uint8_t* input,
int64_t output_buffer_len, uint8_t* output_buffer)
override {
-size_t ret = ZSTD_compress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len),
compression_level_);
+if (!compression_context_) {
Review Comment:
`thread_local` it is, no other sensible options. And not going to touch
Brotli - that one does not support context reuse at all.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2547005014
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -207,8 +226,18 @@ class ZSTDCodec : public Codec {
Result Compress(int64_t input_len, const uint8_t* input,
int64_t output_buffer_len, uint8_t* output_buffer)
override {
-size_t ret = ZSTD_compress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len),
compression_level_);
+if (!compression_context_) {
Review Comment:
It appears the issue is common to Brotli and ZSTD - both of them are
requiring more space than can fit on the stack. So those two would benefit of a
thread-local context, as their context is always heap allocated.
The others, LZ4, Snappy, BZIP etc. are all properly optimized in the sense
that they can fit the context entirely onto the stack. For them, the
single-shot interfaces work as expected and carry no (significant) overhead.
Merely `lz4` still has a fast path for context-reuse which would be applicable
for small frames only.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2546811701
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -207,8 +226,18 @@ class ZSTDCodec : public Codec {
Result Compress(int64_t input_len, const uint8_t* input,
int64_t output_buffer_len, uint8_t* output_buffer)
override {
-size_t ret = ZSTD_compress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len),
compression_level_);
+if (!compression_context_) {
Review Comment:
And there's a weird mismatch in the `Compressor` interface - it does not
support context/state-recycling even though all of the backing interfaces
actually do. It always assumes that the context/state would be dead after a
single compression pass.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2547005014
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -207,8 +226,18 @@ class ZSTDCodec : public Codec {
Result Compress(int64_t input_len, const uint8_t* input,
int64_t output_buffer_len, uint8_t* output_buffer)
override {
-size_t ret = ZSTD_compress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len),
compression_level_);
+if (!compression_context_) {
Review Comment:
It appears the issue is common to Brotli and ZSTD - both of them are
requiring more space than can fit on the stack. So those two would benefit of a
thread-local context, as their context is always heap allocated.
The others, LZ4, Snappy, BZIP etc. are all properly optimized in the sense
that they can fit the context entirely onto the stack. For them, the
single-shot interfaces work as expected and carry no (significant) overhead.
Merely `lz4` still has a fast path for context-reuse but it's in the
static-only part of the interface.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2546811701
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -207,8 +226,18 @@ class ZSTDCodec : public Codec {
Result Compress(int64_t input_len, const uint8_t* input,
int64_t output_buffer_len, uint8_t* output_buffer)
override {
-size_t ret = ZSTD_compress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len),
compression_level_);
+if (!compression_context_) {
Review Comment:
And there's a weird mismatch in the `Compressor` interface - it does not
support context/state-recycling even though all of the backing interfaces
actually do. It always assumes that the context/state would be dead after a
single compression pass.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2546797428
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -207,8 +226,18 @@ class ZSTDCodec : public Codec {
Result Compress(int64_t input_len, const uint8_t* input,
int64_t output_buffer_len, uint8_t* output_buffer)
override {
-size_t ret = ZSTD_compress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len),
compression_level_);
+if (!compression_context_) {
Review Comment:
`thread_local` is not appropriate, because there's a conflict on compression
level.
Need to think carefully about this. It's the same problematic implementation
detail for all the codecs - they are all using the "singleshot" APIs for stuff
that is done at high frequency with usually tiny buffers, and at least within
IPC, they are also all used in a concurrent context with the same codec...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] GH-48187: [C++] Cache ZSTD compression/decompression context [arrow]
Ext3h commented on code in PR #48192:
URL: https://github.com/apache/arrow/pull/48192#discussion_r2546756165
##
cpp/src/arrow/util/compression_zstd.cc:
##
@@ -207,8 +226,18 @@ class ZSTDCodec : public Codec {
Result Compress(int64_t input_len, const uint8_t* input,
int64_t output_buffer_len, uint8_t* output_buffer)
override {
-size_t ret = ZSTD_compress(output_buffer,
static_cast(output_buffer_len),
- input, static_cast(input_len),
compression_level_);
+if (!compression_context_) {
Review Comment:
There's parallelization in the IPC API that I didn't account for, the same
instance of `ZSTDCodec` is used by multiple threads concurrently. Revising and
checking if `thread_local` can be appropriate.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
