Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/18372 )
Change subject: IMPALA-11039: Fix incorrect page jumping in late materialization of Parquet ...................................................................... IMPALA-11039: Fix incorrect page jumping in late materialization of Parquet The current calculation of LastRowIdxInCurrentPage() is incorrect. It uses the first row index of the next candidate page instead of the next valid page. The next candidate page could be far away from the current page. Thus giving a number larger than the current page size. Skipping rows in the current page could overflow the boundary due to this. This patch fixes LastRowIdxInCurrentPage() to use the next valid page. When skip_row_id is set (>0), the current approach of SkipRowsInternal<false>() expects jumping to a page containing this row and then skipping rows in that page. However, the expected row might not be in the candidate pages. When we jump to the next candidate page, the target row could already be skipped. In this case, we don't need to skip rows in the current page. Tests: - Add a test on alltypes_empty_pages to reveal the bug. - Add more batch_size values in test_page_index. - Pass tests/query_test/test_parquet_stats.py locally. Change-Id: I3a783115ba8faf1a276e51087f3a70f79402c21d Reviewed-on: http://gerrit.cloudera.org:8080/18372 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> --- M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/parquet-column-readers.cc M be/src/exec/parquet/parquet-column-readers.h M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M testdata/workloads/functional-query/queries/QueryTest/parquet-page-index.test M tests/query_test/test_parquet_stats.py 7 files changed, 97 insertions(+), 30 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/18372 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I3a783115ba8faf1a276e51087f3a70f79402c21d Gerrit-Change-Number: 18372 Gerrit-PatchSet: 6 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>