[ https://issues.apache.org/jira/browse/IMPALA-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825042#comment-17825042 ]
ASF subversion and git services commented on IMPALA-10186: ---------------------------------------------------------- Commit 82103101826309138d22864d04137da2df15f0c3 in impala's branch refs/heads/branch-3.4.2 from Zoltan Borok-Nagy [ https://gitbox.apache.org/repos/asf?p=impala.git;h=821031018 ] IMPALA-9952: Fix page index filtering for empty pages As IMPALA-4371 and IMPALA-10186 points out, Impala might write empty data pages. It usually does that when it has to write a bigger page than the current page size. If we really need to write empty data pages is a different question, but we need to handle them correctly as there are already such files out there. The corresponding Parquet offset index entries to empty data pages are invalid PageLocation objects with 'compressed_page_size' = 0. Before this commit Impala didn't ignore the empty page locations, but generated a warning. Since invalid page index doesn't fail a scan by default, Impala continued scanning the file with semi-initialized page filtering. This resulted in 'Top level rows aren't in sync' error, or a crash in DEBUG builds. With this commit Impala ignores empty data pages and still able to filter the rest of the pages. Also, if the page index is corrupt for some other reason, Impala correctly resets the page filtering logic and falls back to regular scanning. Testing: * Added unit test for empty data pages * Added e2e test for empty data pages * Added e2e test for invalid page index Change-Id: I4db493fc7c383ed5ef492da29c9b15eeb3d17bb0 Reviewed-on: http://gerrit.cloudera.org:8080/16503 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> > Write invalid parquet PageLocations which table sort by some columns > -------------------------------------------------------------------- > > Key: IMPALA-10186 > URL: https://issues.apache.org/jira/browse/IMPALA-10186 > Project: IMPALA > Issue Type: Bug > Components: Backend > Affects Versions: Impala 4.2.0 > Reporter: guojingfeng > Assignee: Michael Smith > Priority: Major > Labels: parquet > Fix For: Impala 4.3.0 > > > Current parquet writer write -1 of PageLocation.offset and > PageLocation.first_row_index when meet a empty page. > hdfs-parquet-file-writer.cc Line: 808 ~ 819 > {code:java} > // Write data pages > for (const DataPage& page : pages_) { > if (page.header.data_page_header.num_values == 0) { > // Skip empty pages > location.offset = -1; > location.compressed_page_size = 0; > location.first_row_index = -1; > AddLocationToOffsetIndex(location); > continue; > } > {code} > But -1 values may cause ComputeCandidatePages function run into unexpected > status. > {code:java} > bool ComputeCandidatePages( > const vector<parquet::PageLocation>& page_locations, > const vector<RowRange>& candidate_ranges, > const int64_t num_rows, vector<int>* candidate_pages) { > if (!ValidatePageLocations(page_locations, num_rows)) return false > {code} > and then cause IMPALA-9952 > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org