Re: [PR] [fix](iceberg) Avoid dict reads on mixed-encoding position delete files [doris]

via GitHub Tue, 31 Mar 2026 14:42:47 -0700


github-actions[bot] commented on code in PR #61759:
URL: https://github.com/apache/doris/pull/61759#discussion_r3018573516



##########
be/src/format/table/iceberg_reader.cpp:
##########
@@ -107,6 +107,43 @@ class GroupedDeleteRowsVisitor final : public 
IcebergPositionDeleteVisitor {
 
 const std::string IcebergOrcReader::ICEBERG_ORC_ATTRIBUTE = "iceberg.id";
 
+bool IcebergTableReader::_is_fully_dictionary_encoded(
+        const tparquet::ColumnMetaData& column_metadata) {
+    // A column chunk may have a dictionary page but still contain 
plain-encoded data pages.
+    // Only treat it as dictionary-coded when all data pages are dictionary 
encoded.
+    if (column_metadata.__isset.encoding_stats) {
+        for (const tparquet::PageEncodingStats& enc_stat : 
column_metadata.encoding_stats) {

Review Comment:
   `encoding_stats` can contain `DATA_PAGE_V2` entries as well as `DATA_PAGE`. 
In a chunk that mixes dictionary and plain `DATA_PAGE_V2` pages, this helper 
still returns `true`, so `init_parquet_delete_reader()` will still build a 
dictionary-backed `file_path` column and the plain decoder can hit the same 
`insert_many_strings()` failure this PR is trying to avoid. Please treat both 
page types as data pages here and add a regression/unit case for the v2 variant.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [fix](iceberg) Avoid dict reads on mixed-encoding position delete files [doris]

Reply via email to