Re: [PR] [fix](parquet) Don't decompress dict page when dict page is empty [doris]

via GitHub Thu, 12 Mar 2026 01:38:03 -0700


github-actions[bot] commented on PR #60374:
URL: https://github.com/apache/doris/pull/60374#issuecomment-4044938718


   ## Code Review Summary
   
   **PR: [fix](parquet) Don't decompress dict page when dict page is empty**
   
   ### Overview
   This PR fixes a bug where parquet files with all-null string columns would 
fail with a ZSTD decompression error because the empty dictionary page was 
being passed to the decompressor. The fix skips decompression when `dict_num == 
0` and properly propagates errors through the 
`convert_dict_column_to_string_column` call chain by changing its return type 
from `MutableColumnPtr` to `Result<MutableColumnPtr>`.
   
   ### Critical Checkpoint Conclusions
   
   1. **Goal and correctness**: The PR correctly addresses the stated problem. 
When a parquet string column is all-null, the dict page has `num_values=0` and 
decompressing the (empty) compressed payload causes a ZSTD error. The fix 
correctly skips decompression and caches empty data. Tests prove the fix works 
via unit tests (decoder-level) and integration tests (reader-level with various 
filter/lazy-read combinations).
   
   2. **Modification size and focus**: The change is focused and well-scoped — 
it touches only the parquet reading pipeline. The three logical parts (skip 
decompression, change return type to Result, add error propagation) are 
cohesive.
   
   3. **Concurrency**: No concurrency concerns. The modified code paths are 
single-threaded per column chunk reader.
   
   4. **Error handling**: Properly addressed. 
`_convert_dict_cols_to_string_cols` changed from `void` to `Status`, and all 6 
callers now use `RETURN_IF_ERROR(...)`. The `DORIS_TRY` macro is used correctly 
to unwrap `Result<MutableColumnPtr>` into a return-based Status propagation 
chain. The previously silently-ignored return value of 
`_convert_dict_cols_to_string_cols` is now checked — this is a good improvement 
beyond the immediate bug fix.
   
   5. **Parallel code paths**: All callers of 
`convert_dict_column_to_string_column` have been updated — both 
`ByteArrayDictDecoder` and `FixLengthDictDecoder` (the only two dict decoder 
implementations), plus all passthrough layers (`ColumnChunkReader`, 
`ScalarColumnReader`, `ParquetColumnReader`). The ORC reader has a separate 
private method with a different signature — it is unrelated and correctly left 
unchanged.
   
   6. **Cache path correctness**: When `dict_num == 0`, the code correctly: (a) 
validates that `uncompressed_size` must also be 0, (b) skips `get_page_data` 
and `decompress`, (c) caches empty decompressed data (since 
`should_cache_decompressed` returns `true` when `compressed_page_size <= 0`), 
and (d) calls `skip_page_data()` after cache insertion to advance the page 
reader offset. The ordering of `_insert_page_into_cache` before 
`skip_page_data` is correct per the comment.
   
   7. **Configuration items**: No new configuration items added. N/A.
   
   8. **Incompatible changes**: The return type change of 
`convert_dict_column_to_string_column` is API-internal only (parquet reader 
internals). No external/storage format incompatibility.
   
   9. **Test coverage**: Good coverage with:
      - Unit tests for both `ByteArrayDictDecoder` and `FixLengthDictDecoder` 
empty dict cases (both error and success paths)
      - Integration tests covering 4 combinations of (filter_all × enable_lazy) 
for all-null string columns
      - Additional tests with partition columns and only-partition-column 
scenarios
      - A real parquet test file (`test_string_null.zst.parquet`) is included
   
   10. **Observability**: The `LOG(ERROR)` in 
`convert_dict_column_to_string_column` when dict is empty but column has data 
provides good diagnostics for the error case.
   
   ### No blocking issues found.
   
   The implementation is correct, well-tested, and improves error handling in 
the existing code path beyond just the bug fix itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [fix](parquet) Don't decompress dict page when dict page is empty [doris]

Reply via email to