github-actions[bot] commented on code in PR #61759:
URL: https://github.com/apache/doris/pull/61759#discussion_r3018573516
##########
be/src/format/table/iceberg_reader.cpp:
##########
@@ -107,6 +107,43 @@ class GroupedDeleteRowsVisitor final : public
IcebergPositionDeleteVisitor {
const std::string IcebergOrcReader::ICEBERG_ORC_ATTRIBUTE = "iceberg.id";
+bool IcebergTableReader::_is_fully_dictionary_encoded(
+ const tparquet::ColumnMetaData& column_metadata) {
+ // A column chunk may have a dictionary page but still contain
plain-encoded data pages.
+ // Only treat it as dictionary-coded when all data pages are dictionary
encoded.
+ if (column_metadata.__isset.encoding_stats) {
+ for (const tparquet::PageEncodingStats& enc_stat :
column_metadata.encoding_stats) {
Review Comment:
`encoding_stats` can contain `DATA_PAGE_V2` entries as well as `DATA_PAGE`.
In a chunk that mixes dictionary and plain `DATA_PAGE_V2` pages, this helper
still returns `true`, so `init_parquet_delete_reader()` will still build a
dictionary-backed `file_path` column and the plain decoder can hit the same
`insert_many_strings()` failure this PR is trying to avoid. Please treat both
page types as data pages here and add a regression/unit case for the v2 variant.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]