Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
alamb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-3167576633 Update here is that we are actively in the process of adding a metadata cache in the built in ListingTable table provider in DataFusion. Assuming there is enough memory allocated I think that will help this usecase substantially - https://github.com/apache/datafusion/issues/17000 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
zhuqi-lucas commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2980710353 Anyway, i think it's related to data set, i will make datafusion to support benchmark clickbench with page index, tracking ticket: https://github.com/apache/datafusion/issues/16427 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
zhuqi-lucas commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2965622944 It seems no difference about tpch which already with page_index for the testing files, will investigate. ```rust ./benchmarks/bench.sh compare load_page_index load_page_index_with_column_id Comparing load_page_index and load_page_index_with_column_id Benchmark tpch_sf1.json ┏━━┳━┳┳━━┓ ┃ Query┃ load_page_index ┃ load_page_index_with_column_id ┃ Change ┃ ┡━━╇━╇╇━━┩ │ QQuery 1 │57.33 ms │ 56.77 ms │no change │ │ QQuery 2 │18.82 ms │ 18.24 ms │no change │ │ QQuery 3 │29.33 ms │ 28.72 ms │no change │ │ QQuery 4 │20.27 ms │ 19.93 ms │no change │ │ QQuery 5 │38.64 ms │ 38.79 ms │no change │ │ QQuery 6 │17.64 ms │ 18.77 ms │ 1.06x slower │ │ QQuery 7 │48.37 ms │ 48.38 ms │no change │ │ QQuery 8 │37.18 ms │ 38.67 ms │no change │ │ QQuery 9 │45.32 ms │ 45.97 ms │no change │ │ QQuery 10│42.06 ms │ 40.60 ms │no change │ │ QQuery 11│13.36 ms │ 13.02 ms │no change │ │ QQuery 12│31.59 ms │ 30.11 ms │no change │ │ QQuery 13│28.35 ms │ 27.97 ms │no change │ │ QQuery 14│22.29 ms │ 21.75 ms │no change │ │ QQuery 15│32.08 ms │ 32.58 ms │no change │ │ QQuery 16│13.15 ms │ 12.78 ms │no change │ │ QQuery 17│58.29 ms │ 56.34 ms │no change │ │ QQuery 18│66.70 ms │ 65.19 ms │no change │ │ QQuery 19│34.23 ms │ 35.21 ms │no change │ │ QQuery 20│26.83 ms │ 27.17 ms │no change │ │ QQuery 21│57.99 ms │ 56.34 ms │no change │ │ QQuery 22│12.76 ms │ 14.69 ms │ 1.15x slower │ └──┴─┴┴──┘ ┏━━━┳━━┓ ┃ Benchmark Summary ┃ ┃ ┡━━━╇━━┩ │ Total Time (load_page_index) │ 752.58ms │ │ Total Time (load_page_index_with_column_id) │ 747.99ms │ │ Average Time (load_page_index)│ 34.21ms │ │ Average Time (load_page_index_with_column_id) │ 34.00ms │ │ Queries Faster│0 │ │ Queries Slower│2 │ │ Queries with No Change│ 20 │ │ Queries with Failure │0 │ └───┴──┘ Benchmark tpch_sf10.json ┏━━┳━┳┳━━┓ ┃ Query┃ load_page_index ┃ load_page_index_with_column_id ┃ Change ┃ ┡━━╇━╇╇━━┩ │ QQuery 1 │ 450.45 ms │ 450.69 ms │no change │ │ QQuery 2 │97.15 ms │ 97.04 ms │no change │ │ QQuery 3 │ 242.00 ms │ 244.22 ms │no change │ │ QQuery 4 │ 216.88 ms │ 214.72 ms │no change │ │ QQuery 5 │ 342.74 ms │ 361.20 ms │ 1.05x slower │ │ QQuery 6 │ 112.64 ms │ 114.86 ms │no change │ │ QQuery 7 │ 493.75 ms │ 492.92 ms │no change │ │ QQuery 8 │ 371.66 ms │ 369.60 ms │no change │ │ QQuery 9 │ 570.78 ms │ 557.16 ms │no change │ │ QQuery 10│ 335.92 ms │ 335.63 ms │no change │ │ QQuery 11│79.16 ms │ 78.40 ms │no change │ │ QQuery 12│ 219.40 ms │ 221.33 ms │no change │ │ QQuery 13│ 351.92 ms │ 352.46 ms │no change │ │ QQuery 14│ 153.00 m
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
alamb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2954343017 Looks great @zhuqi-lucas -- I will find time to review the arrow changes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
adriangb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2954080922 > +1.98x faster 🚀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
zhuqi-lucas commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2953664649 Test experiment for datafusion: https://github.com/apache/arrow-rs/pull/7624 https://github.com/apache/datafusion/pull/16329 ```rust python3 ./compare.py /tmp/clickbench_main /tmp/clickbench_load_page_index_with_column_id ┏━━┳┳┳━━━┓ ┃ Query┃tmp ┃tmp ┃Change ┃ ┡━━╇╇╇━━━┩ │ QQuery 0 │ 12.13 ms │ 10.64 ms │ +1.14x faster │ │ QQuery 1 │ 89.03 ms │ 46.27 ms │ +1.92x faster │ │ QQuery 2 │ 36.80 ms │ 35.80 ms │ no change │ │ QQuery 3 │ 35.67 ms │ 35.99 ms │ no change │ │ QQuery 4 │ 268.92 ms │ 256.73 ms │ no change │ │ QQuery 5 │ 310.28 ms │ 313.01 ms │ no change │ │ QQuery 6 │ 13.22 ms │ 12.86 ms │ no change │ │ QQuery 7 │ 92.93 ms │ 46.94 ms │ +1.98x faster │ │ QQuery 8 │ 314.49 ms │ 340.63 ms │ 1.08x slower │ │ QQuery 9 │ 483.32 ms │ 481.19 ms │ no change │ │ QQuery 10│ 190.48 ms │ 137.51 ms │ +1.39x faster │ │ QQuery 11│ 202.86 ms │ 151.79 ms │ +1.34x faster │ │ QQuery 12│ 405.74 ms │ 360.43 ms │ +1.13x faster │ │ QQuery 13│ 536.49 ms │ 488.00 ms │ +1.10x faster │ │ QQuery 14│ 397.92 ms │ 346.56 ms │ +1.15x faster │ │ QQuery 15│ 327.45 ms │ 323.72 ms │ no change │ │ QQuery 16│ 699.47 ms │ 734.16 ms │ no change │ │ QQuery 17│ 632.90 ms │ 654.54 ms │ no change │ │ QQuery 18│ 2018.58 ms │ 2052.34 ms │ no change │ │ QQuery 19│ 99.19 ms │ 67.29 ms │ +1.47x faster │ │ QQuery 20│ 885.77 ms │ 1047.55 ms │ 1.18x slower │ │ QQuery 21│ 470.21 ms │ 420.91 ms │ +1.12x faster │ │ QQuery 22│ 2994.52 ms │ 2677.85 ms │ +1.12x faster │ │ QQuery 23│ 9516.60 ms │ 9347.08 ms │ no change │ │ QQuery 24│ 245.20 ms │ 189.26 ms │ +1.30x faster │ │ QQuery 25│ 234.46 ms │ 191.13 ms │ +1.23x faster │ │ QQuery 26│ 281.25 ms │ 203.97 ms │ +1.38x faster │ │ QQuery 27│ 737.55 ms │ 656.28 ms │ +1.12x faster │ │ QQuery 28│ 6585.31 ms │ 6629.37 ms │ no change │ │ QQuery 29│ 308.20 ms │ 314.17 ms │ no change │ │ QQuery 30│ 378.65 ms │ 315.16 ms │ +1.20x faster │ │ QQuery 31│ 374.84 ms │ 329.12 ms │ +1.14x faster │ │ QQuery 32│ 1125.68 ms │ 1145.51 ms │ no change │ │ QQuery 33│ 1885.61 ms │ 1976.18 ms │ no change │ │ QQuery 34│ 1898.28 ms │ 2030.53 ms │ 1.07x slower │ │ QQuery 35│ 449.56 ms │ 453.16 ms │ no change │ │ QQuery 36│ 130.23 ms │ 173.75 ms │ 1.33x slower │ │ QQuery 37│ 141.73 ms │ 224.22 ms │ 1.58x slower │ │ QQuery 38│ 123.32 ms │ 165.20 ms │ 1.34x slower │ │ QQuery 39│ 167.83 ms │ 310.21 ms │ 1.85x slower │ │ QQuery 40│ 103.24 ms │ 87.30 ms │ +1.18x faster │ │ QQuery 41│ 105.57 ms │ 56.85 ms │ +1.86x faster │ │ QQuery 42│ 90.65 ms │ 65.76 ms │ +1.38x faster │ └──┴┴┴───┘ ┏┳┓ ┃ Benchmark Summary ┃┃ ┡╇┩ │ Total Time (tmp) │ 36402.11ms │ │ Total Time (tmp) │ 35906.88ms │ │ Average Time (tmp) │ 846.56ms │ │ Average Time (tmp) │ 835.04ms │ │ Queries Faster │ 20 │ │ Queries Slower │ 7 │ │ Queries with No Change │ 16 │ │ Queries with Failure │ 0 │ └┴┘ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
zhuqi-lucas commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2953460233 > Looks like not a big difference to me? Some queries has 30% peformance improvement, will try to mock this in real datafusion benchmark. ``` arrow_reader_clickbench/sync/Q28 1.30 41.6±9.18ms? ?/sec 1.00 32.1±1.86ms? ?/sec ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
adriangb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2952754881 Looks like not a big difference to me? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
zhuqi-lucas commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2952713037 Do some experiment in: https://github.com/apache/arrow-rs/pull/7624 It looks like the result is not bad, run 2 times, need to check it again: ```rust critcmp mock_customer_format mock_customer_format_reduce_columns_index groupmock_customer_format mock_customer_format_reduce_columns_index - - arrow_reader_clickbench/async/Q1 1.00 883.0±109.45µs? ?/sec 1.06 938.7±53.40µs? ?/sec arrow_reader_clickbench/async/Q101.17 28.7±5.48ms? ?/sec 1.00 24.6±3.92ms? ?/sec arrow_reader_clickbench/async/Q111.15 28.4±6.49ms? ?/sec 1.00 24.7±5.49ms? ?/sec arrow_reader_clickbench/async/Q121.00 16.4±3.45ms? ?/sec 1.07 17.5±3.33ms? ?/sec arrow_reader_clickbench/async/Q131.16 24.9±3.33ms? ?/sec 1.00 21.4±3.06ms? ?/sec arrow_reader_clickbench/async/Q141.12 28.3±4.49ms? ?/sec 1.00 25.3±3.93ms? ?/sec arrow_reader_clickbench/async/Q191.06 5.1±1.00ms? ?/sec 1.00 4.8±0.48ms? ?/sec arrow_reader_clickbench/async/Q201.12 36.1±0.83ms? ?/sec 1.00 32.2±2.58ms? ?/sec arrow_reader_clickbench/async/Q211.16 44.5±3.15ms? ?/sec 1.00 38.5±4.35ms? ?/sec arrow_reader_clickbench/async/Q221.03 55.4±7.86ms? ?/sec 1.00 53.6±6.67ms? ?/sec arrow_reader_clickbench/async/Q231.10134.4±8.08ms? ?/sec 1.00122.6±4.85ms? ?/sec arrow_reader_clickbench/async/Q241.14 25.8±5.71ms? ?/sec 1.00 22.5±3.40ms? ?/sec arrow_reader_clickbench/async/Q271.2144.3±11.05ms? ?/sec 1.00 36.7±5.02ms? ?/sec arrow_reader_clickbench/async/Q281.11 47.6±9.94ms? ?/sec 1.00 43.0±3.91ms? ?/sec arrow_reader_clickbench/async/Q301.14 37.6±5.53ms? ?/sec 1.00 32.8±3.57ms? ?/sec arrow_reader_clickbench/async/Q361.07 44.3±6.08ms? ?/sec 1.00 41.5±5.25ms? ?/sec arrow_reader_clickbench/async/Q371.01 45.3±7.10ms? ?/sec 1.00 44.7±7.40ms? ?/sec arrow_reader_clickbench/async/Q381.00 21.9±2.52ms? ?/sec 1.03 22.6±1.16ms? ?/sec arrow_reader_clickbench/async/Q391.00 23.1±2.04ms? ?/sec 1.02 23.5±1.45ms? ?/sec arrow_reader_clickbench/async/Q401.04 24.1±3.34ms? ?/sec 1.00 23.1±2.17ms? ?/sec arrow_reader_clickbench/async/Q411.11 24.2±6.06ms? ?/sec 1.00 21.7±3.14ms? ?/sec arrow_reader_clickbench/async/Q421.10 17.2±2.25ms? ?/sec 1.00 15.7±2.67ms? ?/sec arrow_reader_clickbench/sync/Q1 1.00 1066.4±26.26µs? ?/sec 1.03 1094.5±46.23µs? ?/sec arrow_reader_clickbench/sync/Q10 1.00 8.0±0.17ms? ?/sec 1.00 8.0±0.10ms? ?/sec arrow_reader_clickbench/sync/Q11 1.00 9.1±0.20ms? ?/sec 1.00 9.1±0.20ms? ?/sec arrow_reader_clickbench/sync/Q12 1.02 11.6±0.18ms? ?/sec 1.00 11.4±0.25ms? ?/sec arrow_reader_clickbench/sync/Q13 1.01 16.3±0.17ms? ?/sec 1.00 16.1±0.25ms? ?/sec arrow_reader_clickbench/sync/Q14 1.00 16.2±0.30ms? ?/sec 1.00 16.2±0.18ms? ?/sec arrow_reader_clickbench/sync/Q19 1.02 1727.9±134.63µs? ?/sec 1.00 1689.3±130.61µs? ?/sec arrow_reader_clickbench/sync/Q20 1.00 29.4±1.21ms? ?/sec 1.00 29.4±1.58ms? ?/sec arrow_reader_clickbench/sync/Q21 1.00 41.5±2.84ms? ?/sec 1.00 41.5±2.51ms? ?/sec arrow_reader_clickbench/sync/Q22 1.03 85.6±8.97ms? ?/sec 1.00 83.4±3.46ms? ?/sec arrow_reader_clickbench/sync/Q23 1.06101.0±5.82ms? ?/sec 1.00 95.6±4.93ms? ?/sec arrow_reader_clickbench/sync/Q24 1.00 20.2±2.71ms? ?/sec 1.05 21.2±4.32ms? ?/sec arrow_reader_clickbench/sync/Q27 1.16 33.2±4.42ms? ?/sec 1.00 28.7±1.94ms? ?/sec arrow_reader_clickbench/sync/Q28 1.36 41.6±9.18ms? ?/sec 1.00 30.6±1.98ms? ?/sec arrow_reader_clickbench/sync/Q30 1.20 33.9±5.95ms? ?/sec 1.00 28.3±2.18ms? ?/sec arrow_reader_clickbench/sync/Q36 1.1947.9±
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
zhuqi-lucas commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2938913313 Thank you @alamb @adriangb @etseidl , i agree, so first step may be to not copying the page index values into the rust structures for columns we don't need in the query. The second step, we can design the adaptive choice for ObjectStore and SSD. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
adriangb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2930710421 Yeah I see two ways to go about that: 1. We make individual wrappers for each operation that decide how to fetch things. A MetadataLoader, FileReaderFactory, etc. I think this is the direction things are headed in now. 2. Configs/settings like Postgres' `random_page_cost` or some sort of `target_fetch_size`, etc and we have DataFusion determine what to do with that information. We could even add the ability to dynamically measure and set that so that eg DataFusion CLI can adapt itself. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
alamb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2930452684 I feel like we may need to add some sort of policy as this same tradeoff is coming up when implementing filter_pushdown optimizations. Namely, is it important to minimize IO operations or are more IO operations ok if it reduces CPU/Memory requirements. As @adriangb and @etseidl say, this tradeoff is quite different depending on local vs object store. Maybe we could make some sort of ObjectStore based interface that allows the parquet reader to hint what data might be necessary (e.g. the entire range of metadata / pages before pruning) and then allow the lower level system to decide if it wanted to prefetch, buffer or just pass through the request 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
alamb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2930459668 FWIW we should still be able to have a significant win by not copying the page index values into the rust structures for columns we don't need in the query (even if we had to fetch the bytes from object store and decode them in thrift) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
adriangb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2928411444 Sadly I doubt there's a correct answer. It might be the opposite for a local SSD vs object storage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
etseidl commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2928051534 > One challenge / tradeoff that would be interesting/required is that doing another async load to read more of the metdata will be very bad if that has to actually go to object store again. Yes, this has me very worried. The layout of the column index is by row group, then column. So to read just a single column requires jumping around quite a bit if there are many row groups. Also, if there is no projection involved, the entire offset index will be read as well. This will need some careful testing to see if multiple fetches are worthwhile, or if doing a single fetch with a range large enough to include all column and offset indexes needed (and then only _parsing_ the needed indexes) would be better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
zhuqi-lucas commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2926814928 Created a arrow-rs issue, we can implement the interface first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
alamb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-291747 Thank you @zhuqi-lucas -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
zhuqi-lucas commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2919171527 1. The first step which i have done the experiment to rewrite the clickbench partition to support page_index, details: https://github.com/apache/datafusion/issues/16149#issuecomment-2918761743 2. The second step is to implement the code. 3. The third step is to show performance improvement for our new implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
zhuqi-lucas commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2919165981 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
zhuqi-lucas commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2919165846 I'd like to take this issue and try. And feel free to reassign if i don't submit a PR for a long time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
adriangb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2913898178 > Though since DataFusion knows what columns are needed for predicates (and thus would be used in pruning) we could easily disable loading the page index for the other columns 🤔 That's exactly my thought. You know which columns you need to filter on and after the ParquetOpener the metadata is discarded anyway, so there's no reason to read _any_ (row group or page) metadata / stats for columns you're not going to filter on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
alamb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2913888070 Though since DataFusion knows what columns are needed for predicates (and thus would be used in pruning) we could easily disable loading the page index for the other columns 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
alamb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2913885348 > Yes very neat. I was actually thinking this would be along the other axis: loading metadata only for the _columns_ that are needed. My gut feeling is that a lot of compute is spent loading metadata for columns that aren't being filtered on. But I don't know if that's possible given the structure of the row group / page metadata. I think we could certainly avoid loading page metadata for columns We would probably have to add some sort of new API to [`ParquetMetadataLoader`](https://docs.rs/parquet/latest/parquet/file/metadata/struct.ParquetMetaDataReader.html) One challenge / tradeoff that would be interesting/required is that doing another async load to read more of the metdata will be very bad if that has to actually go to object store again. Influx has it all cached in memory so it doesn't matter, but in general we need to be careful of adding additional requests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
alamb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2913807410 FYI @adriangb -- I think this is something you were interested in too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
adriangb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2913868245 Yes very neat. I was actually thinking this would be along the other axis: loading metadata only for the _columns_ that are needed. My gut feeling is that a lot of compute is spent loading metadata for columns that aren't being filtered on. But I don't know if that's possible given the structure of the row group / page metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]
alamb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2913853034 Thank you @adrian-thurston -- this is a very neat idea For others not following along, most of benchmarks don't have page metadata at the moment (e.g. the ClickBench partitioned set doesn't have any page index metadata) so this wouldn't show up in our existing benchmarks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
