Ted-Jiang commented on PR #3769: URL: https://github.com/apache/arrow-datafusion/pull/3769#issuecomment-1272454236
@thinkharderdev thanks for your great bench. I run parquet tools in local get (1.0 GB) ``` (venv) yangjiang@LM-SHC-15009782 data % parquet-tools column-index ./logs.parquet row group 0: column index for column service: Boudary order: UNORDERED null count min max page-0 0 backend frontend offset index for column service: offset compressed size first row index page-0 62 117 0 column index for column host: Boudary order: UNORDERED null count min max page-0 0 i-1ec3ca3151468928.ec2.internal i-1ec408f54dbd3750.ec2.internal offset index for column host: offset compressed size first row index page-0 566 125 0 column index for column pod: Boudary order: UNORDERED null count min max page-0 0 aejowuublavflbbsvlfozigwpmrxldvhaollk zxxlzhdrucrhpicpdgxtfpyuknvviimggtq offset index for column pod: offset compressed size first row index page-0 6689 602 0 column index for column container: Boudary order: UNORDERED null count min max page-0 0 backend_container_0 frontend_container_1 offset index for column container: offset compressed size first row index page-0 7602 593 0 ``` There are at most two pages in one col, I think if we adjust to get more pages in one col (like reduce the page size), it will get greater performance in enable `enable_page_index `, we can get more opportunitys to skip whole pages without decoding! 🤔 FYI, i see impala choose to use fixed row number in one page to do benchmark for getting good performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org