GitHub user InCerryGit created a discussion: Optimize LZ4 IPC decompression by
avoiding an intermediate frame-reader drain copy
Hi,
I would like to discuss a potential optimization for LZ4-compressed Arrow IPC
buffers in Arrow .NET.
### Background
Arrow IPC compressed buffers already carry the expected uncompressed buffer
length in the IPC metadata. During reading, Arrow .NET allocates the
destination buffer first, then calls the configured compression codec to
decompress into that destination.
For LZ4 today, the current path uses K4os' frame reader API roughly like this:
```csharp
using var decoder = LZ4Frame.Decode(source);
return decoder.ReadManyBytes(destination.Span);
```
This is correct and simple, but profiling shows a significant amount of time is
spent in an intermediate drain/copy path inside the K4os frame reader.
### Profiling evidence
Using a deterministic local benchmark that generates a real LZ4-compressed
Arrow IPC stream and replays it without external network variability, the hot
path was dominated by:
- `LZ4_decompress_*`
- `Buffer._Memmove`
- K4os frame-reader drain/copy logic
The important observation is that the `Buffer._Memmove` cost appears to come
largely from K4os decoding into its internal decoder buffer and then
draining/copying into the caller-provided destination span.
### Experiment
I prototyped a restricted fast path in Arrow .NET:
- parse only simple LZ4 frame headers
- support frames with:
- standard LZ4 frame magic
- no block checksum
- no content checksum
- no content size
- no dictionary id
- valid block size descriptor
- decode each LZ4 block directly into the Arrow destination buffer using K4os'
block API:
```csharp
LZ4Codec.Decode(block, destinationSlice);
```
If any unsupported frame feature or malformed input is detected, the code falls
back to the existing K4os frame reader path.
This is not a public API change and not a "zero-copy" decompression path. The
better description is:
> direct-to-destination LZ4 frame decode for simple frames
### Benchmark result
On the deterministic LZ4 Arrow IPC replay benchmark:
| Case | Existing K4os frame-reader path | Direct block decode fast path |
Improvement |
|---|---:|---:|---:|
| Chunked network-style stream | 2.704 ms | 1.962 ms | ~27% |
| Local memory replay | 2.530 ms | 1.644 ms | ~35% |
The trace also showed `Buffer._Memmove` dropping substantially:
| Hotspot | Existing path | Fast path |
|---|---:|---:|
| memory replay `Buffer._Memmove` | ~30.5% | ~2.4% |
| chunked stream `Buffer._Memmove` | ~28.0% | ~6.1% |
### Applicability
This fast path is expected to help common Arrow IPC LZ4 compressed buffers
where the producer writes simple LZ4 frames.
It does not apply to all LZ4 frames. The implementation falls back to K4os for:
- block checksum
- content checksum
- content size
- dictionary id
- unsupported or reserved frame flags
- malformed block sizes
- decoded length mismatch
So this is not intended to replace a full LZ4 frame decoder.
### Design question
I see two possible directions:
#### Option A: Add a restricted fast path in Arrow .NET
Arrow .NET can keep a private, conservative parser for simple LZ4 frames and
use K4os' block decoder to write directly into the already-allocated Arrow
destination buffer.
Pros:
- immediate measurable benefit for Arrow IPC LZ4 reads
- no public API change
- fallback keeps compatibility with general LZ4 frames
- optimized specifically for Arrow IPC, where the expected output length is
already known
Cons:
- Arrow .NET would maintain some LZ4 frame parsing logic
- this duplicates part of K4os' responsibility
- future K4os behavior changes may require re-checking this code
#### Option B: Request a new K4os API
A cleaner long-term solution may be an upstream K4os API such as:
```csharp
LZ4Frame.TryDecode(
ReadOnlySpan<byte> source,
Span<byte> destination,
out int bytesWritten);
```
or similar.
The key requirement is that the frame decoder should be able to write directly
into a caller-provided destination span when the caller already knows the
expected output size.
Pros:
- frame parsing remains owned by K4os
- other projects can benefit
- Arrow .NET avoids maintaining a custom frame fast path
Cons:
- requires K4os API design/release cycle
- may not be available soon
- Arrow .NET still needs a short-term answer for current performance work
### Question
Would maintainers prefer that Arrow .NET:
1. keeps a conservative internal fast path for simple LZ4 frames, with fallback
to K4os for everything else; or
2. avoids custom frame parsing and instead waits for / proposes a new K4os
direct-to-destination frame decode API?
My current leaning is:
- use an Arrow .NET internal fast path short term, because the benefit is
measurable and the fallback keeps compatibility;
- also open an upstream K4os proposal for a proper direct-to-destination frame
decode API;
- replace the Arrow-side private parser later if K4os exposes such an API.
I would appreciate feedback on which direction is more acceptable for this
project.
GitHub link: https://github.com/apache/arrow-dotnet/discussions/353
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]