Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
alamb commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4500085124 π -- amazing - Thank you so much @mkleen and @nuno-faria and @kosiew -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
alamb merged PR #20047: URL: https://github.com/apache/datafusion/pull/20047 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
alamb commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4482080876 All right, adding to the merge queue to get this in for 54. Thanks again @mkleen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
nuno-faria commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4481397868 > @nuno-faria and @kosiew -- is there anything else we need to do before merging this PR? Not on my end. Thanks again @mkleen for the hard work! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
alamb commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4481195950 @nuno-faria and @kosiew -- is there anything else we need to do before merging this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
alamb commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4481189144 > @alamb It looks to me that the slowdowns are not real, but only noise. Agreed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4466796224 @alamb It looks to me the benchmark regression is not real, but only noise. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4462833622 π€ Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4462702395) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Details ``` Comparing HEAD and file-stats-cache Benchmark clickbench_partitioned.json βββββ³ββββ³ββββ³ββββ β Query β HEAD β file-stats-cache βChange β β‘ββββββββββββββββ© β QQuery 0 β 1.21 / 4.71 Β±6.86 / 18.44 ms β 1.30 / 4.84 Β±6.93 / 18.69 ms β no change β β QQuery 1 β12.53 / 12.78 Β±0.18 / 12.99 ms β12.60 / 13.00 Β±0.21 / 13.17 ms β no change β β QQuery 2 β35.58 / 35.85 Β±0.28 / 36.38 ms β35.98 / 36.25 Β±0.32 / 36.72 ms β no change β β QQuery 3 β31.46 / 31.91 Β±0.46 / 32.79 ms β31.57 / 32.23 Β±0.41 / 32.73 ms β no change β β QQuery 4 β 241.91 / 244.85 Β±1.66 / 246.64 ms β 244.34 / 247.59 Β±4.37 / 256.22 ms β no change β β QQuery 5 β 285.58 / 286.91 Β±1.81 / 290.48 ms β 288.66 / 290.41 Β±1.83 / 293.24 ms β no change β β QQuery 6 β 5.91 / 7.20 Β±0.68 / 7.73 ms β 6.60 / 8.43 Β±2.02 / 12.36 ms β 1.17x slower β β QQuery 7 β13.95 / 14.13 Β±0.16 / 14.37 ms β14.08 / 14.20 Β±0.09 / 14.32 ms β no change β β QQuery 8 β 339.54 / 342.56 Β±2.50 / 347.08 ms β 345.51 / 348.46 Β±2.50 / 351.80 ms β no change β β QQuery 9 β 465.89 / 473.60 Β±5.72 / 480.85 ms β 455.62 / 469.15 Β±8.05 / 480.71 ms β no change β β QQuery 10 β71.54 / 74.50 Β±2.23 / 78.03 ms β70.03 / 73.47 Β±4.08 / 81.48 ms β no change β β QQuery 11 β82.55 / 83.09 Β±0.47 / 83.84 ms β80.75 / 81.63 Β±0.67 / 82.35 ms β no change β β QQuery 12 β 282.26 / 286.51 Β±3.69 / 292.47 ms β 284.36 / 287.10 Β±2.21 / 289.95 ms β no change β β QQuery 13 β 398.79 / 404.83 Β±5.96 / 415.54 ms β 404.20 / 409.11 Β±5.41 / 419.41 ms β no change β β QQuery 14 β 293.48 / 298.78 Β±6.53 / 311.32 ms β 295.85 / 299.17 Β±3.20 / 305.00 ms β no change β β QQuery 15 β
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4462717687 π€ Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4462702395) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4462702395-151-55fcl 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing file-stats-cache (f648bee41c1d50502b9b1fda63dd2e86dc8e41ea) to 9d92944 (merge-base) [diff](https://github.com/apache/datafusion/compare/9d92944900f8a50b688a4bd8685629ebd76999f3..f648bee41c1d50502b9b1fda63dd2e86dc8e41ea) using: clickbench_partitioned Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
alamb commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4462706714 I want to see if some of these slowdowns are reproduced ``` β QQuery 35 β277.44 / 311.98 Β±42.77 / 392.86 ms β 309.15 / 317.20 Β±7.72 / 327.72 ms βno change β β QQuery 36 β66.14 / 70.52 Β±2.72 / 74.25 ms β70.52 / 75.12 Β±4.11 / 81.54 ms β 1.07x slower β β QQuery 37 β35.02 / 35.94 Β±0.57 / 36.58 ms β38.04 / 40.48 Β±2.45 / 44.69 ms β 1.13x slower β β QQuery 38 β39.71 / 42.03 Β±1.39 / 43.54 ms β42.75 / 46.11 Β±2.71 / 50.99 ms β 1.10x slower β β QQuery 39 β125.16 / 140.06 Β±10.52 / 149.46 ms β 143.78 / 149.00 Β±7.79 / 164.51 ms β 1.06x slower β β QQuery 40 β13.80 / 14.09 Β±0.27 / 14.59 ms β15.42 / 16.94 Β±2.43 / 21.77 ms β 1.20x slower β β QQuery 41 β13.52 / 13.55 Β±0.03 / 13.60 ms β14.99 / 15.20 Β±0.19 / 15.50 ms β 1.12x slower β β QQuery 42 β12.96 / 13.16 Β±0.14 / 13.28 ms β14.28 / 16.59 Β±4.25 / 25.09 ms β 1.26x slower β ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
alamb commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4462702395 run benchmark clickbench_partitioned -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4459100954 @kosiew Thank you for all the work and the reviews! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
kosiew commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4458125471 @mkleen I will add this to the merge queue after 24 hours if there are no further issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4458055885 π€ Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4457908563) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Details ``` Comparing HEAD and file-stats-cache Benchmark clickbench_partitioned.json βββββ³ββββ³ββββ³βββ β Query β HEAD β file-stats-cache β Change β β‘βββββββββββββββ© β QQuery 0 β 1.21 / 4.58 Β±6.67 / 17.93 ms β 1.27 / 4.69 Β±6.79 / 18.26 ms βno change β β QQuery 1 β12.46 / 12.62 Β±0.12 / 12.78 ms β12.80 / 12.90 Β±0.10 / 13.04 ms βno change β β QQuery 2 β35.22 / 35.57 Β±0.29 / 36.08 ms β35.57 / 35.83 Β±0.18 / 36.12 ms βno change β β QQuery 3 β30.78 / 32.84 Β±1.91 / 35.66 ms β30.74 / 31.63 Β±0.85 / 32.91 ms βno change β β QQuery 4 β 230.28 / 234.55 Β±3.18 / 238.52 ms β 226.34 / 231.59 Β±3.27 / 234.96 ms βno change β β QQuery 5 β 275.85 / 277.08 Β±1.61 / 280.21 ms β 271.34 / 275.08 Β±2.60 / 278.63 ms βno change β β QQuery 6 β 6.78 / 7.05 Β±0.27 / 7.55 ms β 6.94 / 7.18 Β±0.20 / 7.50 ms βno change β β QQuery 7 β13.43 / 13.52 Β±0.06 / 13.59 ms β13.58 / 13.81 Β±0.16 / 14.06 ms βno change β β QQuery 8 β 324.80 / 328.39 Β±2.52 / 331.95 ms β 323.04 / 327.41 Β±2.72 / 331.44 ms βno change β β QQuery 9 β 448.01 / 452.30 Β±3.96 / 457.57 ms β 443.43 / 449.73 Β±6.36 / 460.10 ms βno change β β QQuery 10 β70.36 / 71.07 Β±0.63 / 72.11 ms β69.35 / 70.94 Β±2.41 / 75.74 ms βno change β β QQuery 11 β81.10 / 81.51 Β±0.38 / 81.97 ms β79.23 / 82.35 Β±3.64 / 89.48 ms βno change β β QQuery 12 β 268.41 / 270.50 Β±2.01 / 273.33 ms β 268.41 / 271.12 Β±3.02 / 276.25 ms βno change β β QQuery 13 β 381.12 / 389.06 Β±5.76 / 396.66 ms β 384.90 / 392.02 Β±9.20 / 409.94 ms βno change β β QQuery 14 β 280.97 / 283.81 Β±3.36 / 290.36 ms β 281.68 / 284.83 Β±3.82 / 292.30 ms βno change β β QQuery 15 β 275.82 / 277.7
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4457998524 @kosiew Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4457928339 π€ Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4457908563) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4457908563-122-6mxc6 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing file-stats-cache (f648bee41c1d50502b9b1fda63dd2e86dc8e41ea) to 9d92944 (merge-base) [diff](https://github.com/apache/datafusion/compare/9d92944900f8a50b688a4bd8685629ebd76999f3..f648bee41c1d50502b9b1fda63dd2e86dc8e41ea) using: clickbench_partitioned Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
kosiew commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4457908563 run benchmark clickbench_partitioned -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4457457101 Hi @kosiew This is ready for another review. I fixed all remaining open issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3221422772
##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1420,14 +1429,19 @@ impl SessionContext {
&& table_provider.table_type() == table_type
{
schema.deregister_table(&table)?;
-if table_type == TableType::Base
-&& let Some(lfc) =
self.runtime_env().cache_manager.get_list_files_cache()
-{
-lfc.drop_table_entries(&Some(table_ref))?;
+if table_type == TableType::Base {
+if let Some(lfc) =
self.runtime_env().cache_manager.get_list_files_cache()
+{
+lfc.drop_table_entries(&Some(table_ref.clone()))?;
+}
+if let Some(fsc) =
+self.runtime_env().cache_manager.get_file_statistic_cache()
+{
Review Comment:
See
https://github.com/apache/datafusion/pull/20047/changes/63fafe41ba315149dad37980a7370f8995e53499
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3220859959
##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1879,6 +1881,7 @@ impl SessionContext {
) -> Result>> {
let table_ref = table_ref.into();
let table = table_ref.table().to_owned();
+self.invalidate_caches(&Some(table_ref.clone()))?;
Review Comment:
This invalidates the caches when `deregister_table()` is called.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3220859959
##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1879,6 +1881,7 @@ impl SessionContext {
) -> Result>> {
let table_ref = table_ref.into();
let table = table_ref.table().to_owned();
+self.invalidate_caches(&Some(table_ref.clone()))?;
Review Comment:
This invalidates the caches when `deregister_table()` is called.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3220859959
##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1879,6 +1881,7 @@ impl SessionContext {
) -> Result>> {
let table_ref = table_ref.into();
let table = table_ref.table().to_owned();
+self.invalidate_caches(&Some(table_ref.clone()))?;
Review Comment:
This invalidates the caches when `deregister_table` is called.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4413607400 run benchmark clickbench_partitioned -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4413107933 run benchmark clickbench_partitioned -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
kosiew commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3212246881
##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1420,14 +1429,19 @@ impl SessionContext {
&& table_provider.table_type() == table_type
{
schema.deregister_table(&table)?;
-if table_type == TableType::Base
-&& let Some(lfc) =
self.runtime_env().cache_manager.get_list_files_cache()
-{
-lfc.drop_table_entries(&Some(table_ref))?;
+if table_type == TableType::Base {
+if let Some(lfc) =
self.runtime_env().cache_manager.get_list_files_cache()
+{
+lfc.drop_table_entries(&Some(table_ref.clone()))?;
+}
+if let Some(fsc) =
+self.runtime_env().cache_manager.get_file_statistic_cache()
+{
Review Comment:
Nice improvement adding cache cleanup through `find_and_deregister()`. One
thing still missing though is the public `SessionContext::deregister_table()`
path.
Right now `register_listing_table()` can populate table-scoped list-files
and file-statistics cache entries, but calling the public `deregister_table()`
API only removes the provider and leaves those cache entries behind.
That means API users can still end up with stale table-scoped entries
visible in cache introspection until they age out via LRU eviction.
Could we mirror the same cleanup logic used in `find_and_deregister()` here
by removing entries from both `get_list_files_cache()` and
`get_file_statistic_cache()` after a successful deregistration?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
kosiew commented on code in PR #20047: URL: https://github.com/apache/datafusion/pull/20047#discussion_r3212211721 ## datafusion/sqllogictest/test_files/set_variable.slt: ## @@ -605,6 +611,15 @@ SHOW datafusion.runtime.max_temp_directory_size datafusion.runtime.max_temp_directory_size 10G +# Test SET and SHOW runtime.file_statistics_cache_limit +statement ok +SET datafusion.runtime.file_statistics_cache_limit = '42M' Review Comment: This SLT block sets `datafusion.runtime.file_statistics_cache_limit` and verifies it with `SHOW`, but it does not reset the value afterward. To avoid order-dependent behavior in later SLT statements, could we add: ```sql statement ok RESET datafusion.runtime.file_statistics_cache_limit ``` after the `SHOW` verification block? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3208656616
##
datafusion-cli/src/main.rs:
##
@@ -687,68 +682,17 @@ mod tests {
.await?;
}
-// When the cache manager creates a StatisticsCache by default,
-// the contents will show up here
-let sql = "SELECT split_part(path, '/', -1) as filename,
file_size_bytes, num_rows, num_columns, table_size_bytes from
statistics_cache() order by filename";
+let sql = "SELECT split_part(path, '/', -1) as filename, table,
file_size_bytes, num_rows, num_columns, table_size_bytes from
statistics_cache() order by filename";
let df = ctx.sql(sql).await?;
let rbs = df.collect().await?;
assert_snapshot!(batches_to_string(&rbs),@r"
-++
-++
-");
-
-Ok(())
-}
-
-// Can be removed when https://github.com/apache/datafusion/issues/19217
is resolved
-#[tokio::test]
-async fn test_statistics_cache_override() -> Result<(), DataFusionError> {
-// Install a specific StatisticsCache implementation
-let file_statistics_cache =
Arc::new(DefaultFileStatisticsCache::default());
-let cache_config = CacheManagerConfig::default()
-.with_files_statistics_cache(Some(file_statistics_cache.clone()));
-let runtime = RuntimeEnvBuilder::new()
-.with_cache_manager(cache_config)
-.build()?;
-let config = SessionConfig::new().with_collect_statistics(true);
-let ctx = SessionContext::new_with_config_rt(config,
Arc::new(runtime));
-
-ctx.register_udtf(
-"statistics_cache",
-Arc::new(StatisticsCacheFunc::new(
-ctx.task_ctx().runtime_env().cache_manager.clone(),
-)),
-);
-
-for filename in [
-"alltypes_plain",
-"alltypes_tiny_pages",
-"lz4_raw_compressed_larger",
-] {
-ctx.sql(
-format!(
-"create external table {filename}
-stored as parquet
-location '../parquet-testing/data/{filename}.parquet'",
-)
-.as_str(),
-)
-.await?
-.collect()
-.await?;
-}
-
-let sql = "SELECT split_part(path, '/', -1) as filename,
file_size_bytes, num_rows, num_columns, table_size_bytes from
statistics_cache() order by filename";
-let df = ctx.sql(sql).await?;
-let rbs = df.collect().await?;
-assert_snapshot!(batches_to_string(&rbs),@r"
-
+---+-+--+-+--+
-| filename | file_size_bytes | num_rows |
num_columns | table_size_bytes |
-
+---+-+--+-+--+
-| alltypes_plain.parquet| 1851| Exact(8) |
11 | Absent |
-| alltypes_tiny_pages.parquet | 454233 | Exact(7300) |
13 | Absent |
-| lz4_raw_compressed_larger.parquet | 380836 | Exact(1) |
1 | Absent |
-
+---+-+--+-+--+
+
+---+---+-+--+-+--+
+ | filename | table |
file_size_bytes | num_rows | num_columns | table_size_bytes |
+
+---+---+-+--+-+--+
+ | alltypes_plain.parquet| alltypes_plain|
1851| Exact(8) | 11 | Absent |
Review Comment:
Since the file statistics cache is now tablescoped, this should be reflected
in the output. This also tests that the table references are always included in
the key now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3208656616
##
datafusion-cli/src/main.rs:
##
@@ -687,68 +682,17 @@ mod tests {
.await?;
}
-// When the cache manager creates a StatisticsCache by default,
-// the contents will show up here
-let sql = "SELECT split_part(path, '/', -1) as filename,
file_size_bytes, num_rows, num_columns, table_size_bytes from
statistics_cache() order by filename";
+let sql = "SELECT split_part(path, '/', -1) as filename, table,
file_size_bytes, num_rows, num_columns, table_size_bytes from
statistics_cache() order by filename";
let df = ctx.sql(sql).await?;
let rbs = df.collect().await?;
assert_snapshot!(batches_to_string(&rbs),@r"
-++
-++
-");
-
-Ok(())
-}
-
-// Can be removed when https://github.com/apache/datafusion/issues/19217
is resolved
-#[tokio::test]
-async fn test_statistics_cache_override() -> Result<(), DataFusionError> {
-// Install a specific StatisticsCache implementation
-let file_statistics_cache =
Arc::new(DefaultFileStatisticsCache::default());
-let cache_config = CacheManagerConfig::default()
-.with_files_statistics_cache(Some(file_statistics_cache.clone()));
-let runtime = RuntimeEnvBuilder::new()
-.with_cache_manager(cache_config)
-.build()?;
-let config = SessionConfig::new().with_collect_statistics(true);
-let ctx = SessionContext::new_with_config_rt(config,
Arc::new(runtime));
-
-ctx.register_udtf(
-"statistics_cache",
-Arc::new(StatisticsCacheFunc::new(
-ctx.task_ctx().runtime_env().cache_manager.clone(),
-)),
-);
-
-for filename in [
-"alltypes_plain",
-"alltypes_tiny_pages",
-"lz4_raw_compressed_larger",
-] {
-ctx.sql(
-format!(
-"create external table {filename}
-stored as parquet
-location '../parquet-testing/data/{filename}.parquet'",
-)
-.as_str(),
-)
-.await?
-.collect()
-.await?;
-}
-
-let sql = "SELECT split_part(path, '/', -1) as filename,
file_size_bytes, num_rows, num_columns, table_size_bytes from
statistics_cache() order by filename";
-let df = ctx.sql(sql).await?;
-let rbs = df.collect().await?;
-assert_snapshot!(batches_to_string(&rbs),@r"
-
+---+-+--+-+--+
-| filename | file_size_bytes | num_rows |
num_columns | table_size_bytes |
-
+---+-+--+-+--+
-| alltypes_plain.parquet| 1851| Exact(8) |
11 | Absent |
-| alltypes_tiny_pages.parquet | 454233 | Exact(7300) |
13 | Absent |
-| lz4_raw_compressed_larger.parquet | 380836 | Exact(1) |
1 | Absent |
-
+---+-+--+-+--+
+
+---+---+-+--+-+--+
+ | filename | table |
file_size_bytes | num_rows | num_columns | table_size_bytes |
+
+---+---+-+--+-+--+
+ | alltypes_plain.parquet| alltypes_plain|
1851| Exact(8) | 11 | Absent |
Review Comment:
Since the file statistics cache is now tablescoped, this should be reflected
in the output. This also tests that the tablereferences are always included.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3208663879
##
datafusion/catalog-listing/src/helpers.rs:
##
@@ -416,8 +419,15 @@ pub async fn pruned_partition_list<'a>(
table_path
);
-// if no partition col => simply list all the files
-Ok(objects.map_ok(|object_meta| object_meta.into()).boxed())
+// if no partition col => list all the files
+Ok(objects
+.try_filter_map(|object_meta| {
+futures::future::ready(object_meta_to_partitioned_file(
+object_meta,
+table_path.get_table_ref(),
Review Comment:
Table-reference needs to be always passed on the `PartitionedFile` because
file statistics cache is now table-scoped and need the Table-reference for the
caching for the key.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4406461353 I also made sure https://github.com/apache/datafusion/pull/22014 is now part of this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4406450456 @nuno-faria @kosiew @martin-g https://github.com/apache/datafusion/commit/d09a9195e711abe60d25fdfe3f7b3ddb1d7a78ab revealed up for this pr. The were cache entries which had no table reference and would therefore not be removed when the table is dropped. This is now fixed. Could you do another review please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3208663879
##
datafusion/catalog-listing/src/helpers.rs:
##
@@ -416,8 +419,15 @@ pub async fn pruned_partition_list<'a>(
table_path
);
-// if no partition col => simply list all the files
-Ok(objects.map_ok(|object_meta| object_meta.into()).boxed())
+// if no partition col => list all the files
+Ok(objects
+.try_filter_map(|object_meta| {
+futures::future::ready(object_meta_to_partitioned_file(
+object_meta,
+table_path.get_table_ref(),
Review Comment:
Table-reference needs to be always passed on the `PartitionedFile` because
file statistics cache is now table-scoped and need the Table-reference for the
caching.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3208656616
##
datafusion-cli/src/main.rs:
##
@@ -687,68 +682,17 @@ mod tests {
.await?;
}
-// When the cache manager creates a StatisticsCache by default,
-// the contents will show up here
-let sql = "SELECT split_part(path, '/', -1) as filename,
file_size_bytes, num_rows, num_columns, table_size_bytes from
statistics_cache() order by filename";
+let sql = "SELECT split_part(path, '/', -1) as filename, table,
file_size_bytes, num_rows, num_columns, table_size_bytes from
statistics_cache() order by filename";
let df = ctx.sql(sql).await?;
let rbs = df.collect().await?;
assert_snapshot!(batches_to_string(&rbs),@r"
-++
-++
-");
-
-Ok(())
-}
-
-// Can be removed when https://github.com/apache/datafusion/issues/19217
is resolved
-#[tokio::test]
-async fn test_statistics_cache_override() -> Result<(), DataFusionError> {
-// Install a specific StatisticsCache implementation
-let file_statistics_cache =
Arc::new(DefaultFileStatisticsCache::default());
-let cache_config = CacheManagerConfig::default()
-.with_files_statistics_cache(Some(file_statistics_cache.clone()));
-let runtime = RuntimeEnvBuilder::new()
-.with_cache_manager(cache_config)
-.build()?;
-let config = SessionConfig::new().with_collect_statistics(true);
-let ctx = SessionContext::new_with_config_rt(config,
Arc::new(runtime));
-
-ctx.register_udtf(
-"statistics_cache",
-Arc::new(StatisticsCacheFunc::new(
-ctx.task_ctx().runtime_env().cache_manager.clone(),
-)),
-);
-
-for filename in [
-"alltypes_plain",
-"alltypes_tiny_pages",
-"lz4_raw_compressed_larger",
-] {
-ctx.sql(
-format!(
-"create external table {filename}
-stored as parquet
-location '../parquet-testing/data/{filename}.parquet'",
-)
-.as_str(),
-)
-.await?
-.collect()
-.await?;
-}
-
-let sql = "SELECT split_part(path, '/', -1) as filename,
file_size_bytes, num_rows, num_columns, table_size_bytes from
statistics_cache() order by filename";
-let df = ctx.sql(sql).await?;
-let rbs = df.collect().await?;
-assert_snapshot!(batches_to_string(&rbs),@r"
-
+---+-+--+-+--+
-| filename | file_size_bytes | num_rows |
num_columns | table_size_bytes |
-
+---+-+--+-+--+
-| alltypes_plain.parquet| 1851| Exact(8) |
11 | Absent |
-| alltypes_tiny_pages.parquet | 454233 | Exact(7300) |
13 | Absent |
-| lz4_raw_compressed_larger.parquet | 380836 | Exact(1) |
1 | Absent |
-
+---+-+--+-+--+
+
+---+---+-+--+-+--+
+ | filename | table |
file_size_bytes | num_rows | num_columns | table_size_bytes |
+
+---+---+-+--+-+--+
+ | alltypes_plain.parquet| alltypes_plain|
1851| Exact(8) | 11 | Absent |
Review Comment:
Since the file statistics cache is now tablescoped, this should be reflected
in the output.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4388401223 The filter pushdown does not work as expected anymore when the stats are cached. The commit https://github.com/apache/datafusion/commit/d09a9195e711abe60d25fdfe3f7b3ddb1d7a78ab changes the stats handling for this case. I will look into it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4388347487 I would say it's a bug which wasn't exposed before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4388341446 > We'll probably have to figure out what is causing that failure / if we should update the expected output or if there is a bug to fix Rhe query in the test is executed twice, once for the output and once for the plan (EXPLAIN) and now the caching kicks in, which wasn't the case before. I will look into the details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
alamb commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4388315179 We'll probably have to figure out what is causing that failure / if we should update the expected output or if there is a bug to fix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
alamb commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4388308017 > This is related to the this commit https://github.com/apache/datafusion/commit/d09a9195e711abe60d25fdfe3f7b3ddb1d7a78ab If I revert this commit everything is passing. How should i proceed here? I recommend merging up from main and hopefully it passes; I'll click the button and see what happens -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4388281109
> @nuno-faria and @kosiew -- shall we merge this PR? Or is it waiting on
anything else?
@alamb No, there is a blocker. While merging in the master branch i noticed
on failing test:
```
at
/Users/mkleen/Test/datafusion/datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:700
2. query result mismatch:
[SQL] EXPLAIN SELECT b FROM (
SELECT a, b FROM t_union_mem
UNION ALL
SELECT a, b FROM t_union_parquet
) WHERE b > 2;
[Diff] (-expected|+actual)
logical_plan
01)Projection: b
02)--Filter: b > Int64(2)
03)Union
04)--Projection: t_union_mem.a, t_union_mem.b
05)TableScan: t_union_mem
06)--Projection: t_union_parquet.a, t_union_parquet.b
- 07)TableScan: t_union_parquet
- physical_plan
- 01)UnionExec
- 02)--FilterExec: b@0 > 2
- 03)DataSourceExec: partitions=1, partition_sizes=[1]
- 04)--DataSourceExec: file_groups={1 group:
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet]]},
projection=[b], file_type=parquet, predicate=b@1 > 2,
pruning_predicate=b_null_count@1 != row_count@2 AND b_max@0 > 2,
required_guarantees=[]
+ 07)TableScan: t_union_parquet
at
/Users/mkleen/Test/datafusion/datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:713
```
This is related to the this commit
https://github.com/apache/datafusion/commit/d09a9195e711abe60d25fdfe3f7b3ddb1d7a78ab
If revert this commit everything is passing. How should i proceed here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
alamb commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4388041815 @nuno-faria and @kosiew -- shall we merge this PR? Or is it waiting on anything else? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3187757931
##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,561 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano,
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use hashbrown::HashSet;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until
and
+/// are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize;
+}
+
+#[derive(Default)]
+pub struct DFHeapSizeCtx {
+seen: HashSet,
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.num_rows.heap_size(ctx)
++ self.total_byte_size.heap_size(ctx)
++ self.column_statistics.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+match self {
+TableReference::Bare { table } => table.heap_size(ctx),
+TableReference::Partial { schema, table } => {
+schema.heap_size(ctx) + table.heap_size(ctx)
+}
+TableReference::Full {
+catalog,
+schema,
+table,
+} => catalog.heap_size(ctx) + schema.heap_size(ctx) +
table.heap_size(ctx),
+}
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size(ctx))
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.null_count.heap_size(ctx)
++ self.max_value.heap_size(ctx)
++ self.min_value.heap_size(ctx)
++ self.sum_value.heap_size(ctx)
++ self.distinct_count.heap_size(ctx)
++ self.byte_size.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(ctx),
+Float16(f) => f.heap_size(ctx),
+Float32(f) => f.heap_size(ctx),
+Float64(f) => f.heap_size(ctx),
+Decimal32(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) +
c.heap_size(ctx),
+Decimal64(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) +
c.heap_size(ctx),
+Decimal128(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) +
c.heap_size(ctx),
+Decimal256(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) +
c.heap_size(ctx),
+Int8(i) => i.heap_size(ctx),
+Int16(i) => i.heap_size(ctx),
+Int32(i) => i.heap_size(ctx),
+Int64(i) => i.heap_size(ctx),
+UInt8(u) => u.heap_size(ctx),
+UInt16(u) => u.heap_size(ctx),
+UInt32(u) => u.heap_size(ctx),
+UInt64(u) => u.heap_size(ctx),
+Utf8(u) => u.heap_size(ctx),
+Utf8View(u) => u.heap_size(ctx),
+LargeUtf8(l) => l.heap_size(ctx),
+Binary(b) => b.heap_size(ctx),
+BinaryView(b) => b.heap_size(ctx),
+FixedSizeBinary(a, b) => a.heap_size(ctx) + b.heap_size(ctx),
+Large
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3187716765
##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,561 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano,
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use hashbrown::HashSet;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until
and
+/// are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize;
+}
+
+#[derive(Default)]
+pub struct DFHeapSizeCtx {
+seen: HashSet,
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.num_rows.heap_size(ctx)
++ self.total_byte_size.heap_size(ctx)
++ self.column_statistics.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+match self {
+TableReference::Bare { table } => table.heap_size(ctx),
+TableReference::Partial { schema, table } => {
+schema.heap_size(ctx) + table.heap_size(ctx)
+}
+TableReference::Full {
+catalog,
+schema,
+table,
+} => catalog.heap_size(ctx) + schema.heap_size(ctx) +
table.heap_size(ctx),
+}
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size(ctx))
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.null_count.heap_size(ctx)
++ self.max_value.heap_size(ctx)
++ self.min_value.heap_size(ctx)
++ self.sum_value.heap_size(ctx)
++ self.distinct_count.heap_size(ctx)
++ self.byte_size.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(ctx),
+Float16(f) => f.heap_size(ctx),
+Float32(f) => f.heap_size(ctx),
+Float64(f) => f.heap_size(ctx),
+Decimal32(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) +
c.heap_size(ctx),
+Decimal64(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) +
c.heap_size(ctx),
+Decimal128(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) +
c.heap_size(ctx),
+Decimal256(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) +
c.heap_size(ctx),
+Int8(i) => i.heap_size(ctx),
+Int16(i) => i.heap_size(ctx),
+Int32(i) => i.heap_size(ctx),
+Int64(i) => i.heap_size(ctx),
+UInt8(u) => u.heap_size(ctx),
+UInt16(u) => u.heap_size(ctx),
+UInt32(u) => u.heap_size(ctx),
+UInt64(u) => u.heap_size(ctx),
+Utf8(u) => u.heap_size(ctx),
+Utf8View(u) => u.heap_size(ctx),
+LargeUtf8(l) => l.heap_size(ctx),
+Binary(b) => b.heap_size(ctx),
+BinaryView(b) => b.heap_size(ctx),
+FixedSizeBinary(a, b) => a.heap_size(ctx) + b.heap_size(ctx),
+Large
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3186723956
##
docs/source/library-user-guide/upgrading/54.0.0.md:
##
@@ -460,3 +460,42 @@ impl Default for MyTreeNode {
}
}
```
+
+[20047]: https://github.com/apache/datafusion/pull/20047
+
+### File statistics cache is now memory-limited and managed by the
`CacheManager`
+
+The file statistics cache used by `ListingTable` is now memory-limited and
+centrally managed through the `CacheManager`.
+
+To configure the cache size use the `file_statistics_cache_limit` setting:
+
+```sql
+SET datafusion.runtime.file_statistics_cache_limit = '10MB'
+```
+
+To disable the file statistics cache, set the limit to 0.
+
+The file statistics cache is no longer created inside the `ListingTable`.
+Instead, it is created within the `CacheManager` and must be passed to the
`ListingTable`.
+
+**Who is affected:**
+
+- Users who want to limit the memory usage of the file statistics cache.
+- Users who want to disable the file statistics cache.
+- Users who want to create a `ListingTable` programmatically with a file
statistics cache.
+
+**Migration guide:**
+
+Disable the cache by setting the configuration value to 0:
Review Comment:
See https://github.com/apache/datafusion/pull/22014
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3187716765
##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,561 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano,
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use hashbrown::HashSet;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until
and
+/// are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize;
+}
+
+#[derive(Default)]
+pub struct DFHeapSizeCtx {
+seen: HashSet,
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.num_rows.heap_size(ctx)
++ self.total_byte_size.heap_size(ctx)
++ self.column_statistics.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+match self {
+TableReference::Bare { table } => table.heap_size(ctx),
+TableReference::Partial { schema, table } => {
+schema.heap_size(ctx) + table.heap_size(ctx)
+}
+TableReference::Full {
+catalog,
+schema,
+table,
+} => catalog.heap_size(ctx) + schema.heap_size(ctx) +
table.heap_size(ctx),
+}
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size(ctx))
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.null_count.heap_size(ctx)
++ self.max_value.heap_size(ctx)
++ self.min_value.heap_size(ctx)
++ self.sum_value.heap_size(ctx)
++ self.distinct_count.heap_size(ctx)
++ self.byte_size.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(ctx),
+Float16(f) => f.heap_size(ctx),
+Float32(f) => f.heap_size(ctx),
+Float64(f) => f.heap_size(ctx),
+Decimal32(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) +
c.heap_size(ctx),
+Decimal64(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) +
c.heap_size(ctx),
+Decimal128(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) +
c.heap_size(ctx),
+Decimal256(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) +
c.heap_size(ctx),
+Int8(i) => i.heap_size(ctx),
+Int16(i) => i.heap_size(ctx),
+Int32(i) => i.heap_size(ctx),
+Int64(i) => i.heap_size(ctx),
+UInt8(u) => u.heap_size(ctx),
+UInt16(u) => u.heap_size(ctx),
+UInt32(u) => u.heap_size(ctx),
+UInt64(u) => u.heap_size(ctx),
+Utf8(u) => u.heap_size(ctx),
+Utf8View(u) => u.heap_size(ctx),
+LargeUtf8(l) => l.heap_size(ctx),
+Binary(b) => b.heap_size(ctx),
+BinaryView(b) => b.heap_size(ctx),
+FixedSizeBinary(a, b) => a.heap_size(ctx) + b.heap_size(ctx),
+Large
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3187716765
##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,561 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano,
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use hashbrown::HashSet;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until
and
+/// are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize;
+}
+
+#[derive(Default)]
+pub struct DFHeapSizeCtx {
+seen: HashSet,
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.num_rows.heap_size(ctx)
++ self.total_byte_size.heap_size(ctx)
++ self.column_statistics.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+match self {
+TableReference::Bare { table } => table.heap_size(ctx),
+TableReference::Partial { schema, table } => {
+schema.heap_size(ctx) + table.heap_size(ctx)
+}
+TableReference::Full {
+catalog,
+schema,
+table,
+} => catalog.heap_size(ctx) + schema.heap_size(ctx) +
table.heap_size(ctx),
+}
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size(ctx))
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.null_count.heap_size(ctx)
++ self.max_value.heap_size(ctx)
++ self.min_value.heap_size(ctx)
++ self.sum_value.heap_size(ctx)
++ self.distinct_count.heap_size(ctx)
++ self.byte_size.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(ctx),
+Float16(f) => f.heap_size(ctx),
+Float32(f) => f.heap_size(ctx),
+Float64(f) => f.heap_size(ctx),
+Decimal32(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) +
c.heap_size(ctx),
+Decimal64(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) +
c.heap_size(ctx),
+Decimal128(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) +
c.heap_size(ctx),
+Decimal256(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) +
c.heap_size(ctx),
+Int8(i) => i.heap_size(ctx),
+Int16(i) => i.heap_size(ctx),
+Int32(i) => i.heap_size(ctx),
+Int64(i) => i.heap_size(ctx),
+UInt8(u) => u.heap_size(ctx),
+UInt16(u) => u.heap_size(ctx),
+UInt32(u) => u.heap_size(ctx),
+UInt64(u) => u.heap_size(ctx),
+Utf8(u) => u.heap_size(ctx),
+Utf8View(u) => u.heap_size(ctx),
+LargeUtf8(l) => l.heap_size(ctx),
+Binary(b) => b.heap_size(ctx),
+BinaryView(b) => b.heap_size(ctx),
+FixedSizeBinary(a, b) => a.heap_size(ctx) + b.heap_size(ctx),
+Large
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3186723956
##
docs/source/library-user-guide/upgrading/54.0.0.md:
##
@@ -460,3 +460,42 @@ impl Default for MyTreeNode {
}
}
```
+
+[20047]: https://github.com/apache/datafusion/pull/20047
+
+### File statistics cache is now memory-limited and managed by the
`CacheManager`
+
+The file statistics cache used by `ListingTable` is now memory-limited and
+centrally managed through the `CacheManager`.
+
+To configure the cache size use the `file_statistics_cache_limit` setting:
+
+```sql
+SET datafusion.runtime.file_statistics_cache_limit = '10MB'
+```
+
+To disable the file statistics cache, set the limit to 0.
+
+The file statistics cache is no longer created inside the `ListingTable`.
+Instead, it is created within the `CacheManager` and must be passed to the
`ListingTable`.
+
+**Who is affected:**
+
+- Users who want to limit the memory usage of the file statistics cache.
+- Users who want to disable the file statistics cache.
+- Users who want to create a `ListingTable` programmatically with a file
statistics cache.
+
+**Migration guide:**
+
+Disable the cache by setting the configuration value to 0:
Review Comment:
https://github.com/apache/datafusion/pull/22014
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
martin-g commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3181769626
##
docs/source/library-user-guide/upgrading/54.0.0.md:
##
@@ -460,3 +460,42 @@ impl Default for MyTreeNode {
}
}
```
+
+[20047]: https://github.com/apache/datafusion/pull/20047
+
+### File statistics cache is now memory-limited and managed by the
`CacheManager`
+
+The file statistics cache used by `ListingTable` is now memory-limited and
+centrally managed through the `CacheManager`.
+
+To configure the cache size use the `file_statistics_cache_limit` setting:
+
+```sql
+SET datafusion.runtime.file_statistics_cache_limit = '10MB'
Review Comment:
```suggestion
SET datafusion.runtime.file_statistics_cache_limit = '10M'
```
##
docs/source/library-user-guide/upgrading/54.0.0.md:
##
@@ -460,3 +460,42 @@ impl Default for MyTreeNode {
}
}
```
+
+[20047]: https://github.com/apache/datafusion/pull/20047
+
+### File statistics cache is now memory-limited and managed by the
`CacheManager`
+
+The file statistics cache used by `ListingTable` is now memory-limited and
+centrally managed through the `CacheManager`.
+
+To configure the cache size use the `file_statistics_cache_limit` setting:
+
+```sql
+SET datafusion.runtime.file_statistics_cache_limit = '10MB'
+```
+
+To disable the file statistics cache, set the limit to 0.
+
+The file statistics cache is no longer created inside the `ListingTable`.
+Instead, it is created within the `CacheManager` and must be passed to the
`ListingTable`.
+
+**Who is affected:**
+
+- Users who want to limit the memory usage of the file statistics cache.
+- Users who want to disable the file statistics cache.
+- Users who want to create a `ListingTable` programmatically with a file
statistics cache.
+
+**Migration guide:**
+
+Disable the cache by setting the configuration value to 0:
+
+```sql
+SET datafusion.runtime.file_statistics_cache_limit = '0k'
Review Comment:
```suggestion
SET datafusion.runtime.file_statistics_cache_limit = '0K'
```
##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,561 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano,
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use hashbrown::HashSet;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until
and
+/// are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize;
+}
+
+#[derive(Default)]
+pub struct DFHeapSizeCtx {
+seen: HashSet,
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.num_rows.heap_size(ctx)
++ self.total_byte_size.heap_size(ctx)
++ self.column_statistics.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+match self {
+TableReference::Bare { table } => table.heap_size(ctx),
+TableReference::Partial { schema, table } => {
+schema.heap_size(ctx) + table.heap_size(ctx)
+}
+TableReference::Full {
+catalog,
+schema,
+table,
+} => catalog.heap_size(ctx) + schema.heap_size(ctx) +
table.heap_size(ctx),
+}
+
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
github-actions[bot] commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4351882236 Thank you for opening this pull request! Reviewer note: [cargo-semver-checks](https://github.com/obi1kenobi/cargo-semver-checks) reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details ``` Cloning origin/main Building datafusion v53.1.0 (current) Built [ 82.893s] (current) Parsing datafusion v53.1.0 (current) Parsed [ 0.033s] (current) Building datafusion v53.1.0 (baseline) Built [ 81.011s] (baseline) Parsing datafusion v53.1.0 (baseline) Parsed [ 0.033s] (baseline) Checking datafusion v53.1.0 -> v53.1.0 (no change; assume patch) Checked [ 0.750s] 222 checks: 222 pass, 30 skip Summary no semver update required Finished [ 167.572s] datafusion Building datafusion-catalog-listing v53.1.0 (current) Built [ 41.429s] (current) Parsing datafusion-catalog-listing v53.1.0 (current) Parsed [ 0.011s] (current) Building datafusion-catalog-listing v53.1.0 (baseline) Built [ 41.688s] (baseline) Parsing datafusion-catalog-listing v53.1.0 (baseline) Parsed [ 0.011s] (baseline) Checking datafusion-catalog-listing v53.1.0 -> v53.1.0 (no change; assume patch) Checked [ 0.103s] 222 checks: 222 pass, 30 skip Summary no semver update required Finished [ 85.122s] datafusion-catalog-listing Building datafusion-cli v53.1.0 (current) Built [ 138.721s] (current) Parsing datafusion-cli v53.1.0 (current) Parsed [ 0.031s] (current) Building datafusion-cli v53.1.0 (baseline) Built [ 137.963s] (baseline) Parsing datafusion-cli v53.1.0 (baseline) Parsed [ 0.031s] (baseline) Checking datafusion-cli v53.1.0 -> v53.1.0 (no change; assume patch) Checked [ 0.145s] 222 checks: 222 pass, 30 skip Summary no semver update required Finished [ 280.601s] datafusion-cli Building datafusion-common v53.1.0 (current) error: running cargo-doc on crate 'datafusion-common' failed with output: - Compiling proc-macro2 v1.0.106 Compiling unicode-ident v1.0.24 Compiling quote v1.0.45 Compiling libc v0.2.186 Compiling libm v0.2.16 Compiling autocfg v1.5.0 Checking cfg-if v1.0.4 Compiling num-traits v0.2.19 Checking memchr v2.8.0 Compiling syn v2.0.117 Compiling shlex v1.3.0 Compiling find-msvc-tools v0.1.9 Compiling serde_core v1.0.228 Compiling zerocopy v0.8.48 Checking itoa v1.0.18 Checking bytes v1.11.1 Compiling zmij v1.0.21 Compiling jobserver v0.1.34 Compiling serde_json v1.0.149 Compiling cc v1.2.61 Checking num-integer v0.1.46 Checking siphasher v1.0.2 Compiling getrandom v0.3.4 Compiling version_check v0.9.5 Checking iana-time-zone v0.1.65 Checking stable_deref_trait v1.2.1 Checking chrono v0.4.44 Compiling ahash v0.8.12 Checking phf_shared v0.12.1 Checking num-bigint v0.4.6 Compiling chrono-tz v0.10.4 Checking phf v0.12.1 Checking once_cell v1.21.4 Compiling synstructure v0.13.2 Checking arrow-schema v58.1.0 Checking num-complex v0.4.6 Checking hashbrown v0.16.1 Checking writeable v0.6.3 Checking litemap v0.8.2 Checking lexical-util v1.0.7 Compiling pkg-config v0.3.33 Compiling icu_properties_data v2.2.0 Compiling icu_normalizer_data v2.2.0 Compiling zerocopy-derive v0.8.48 Compiling zerofrom-derive v0.1.7 Compiling yoke-derive v0.8.2 Compiling zerovec-derive v0.11.3 Checking zerofrom v0.1.7 Checking yoke v0.8.2 Compiling displaydoc v0.2.5 Checking zerotrie v0.2.4 Compiling zstd-sys v2.0.16+zstd.1.5.7 Checking zerovec v0.11.6 Checking utf8_iter v1.0.4 Compiling object v0.37.3 Checking smallvec v1.15.1 Checking tinystr v0.8.3 Checking potential_utf v0.1.5 Checking icu_collections v2.2.0 Checking icu_locale_core v2.2.0 Checking icu_provider v2.2.0 Compiling semver v1.0.28 Compiling rustc_version v0.4.1 Checking lexical-write-integer v1.0.6 Checking lexical-parse-integer v1.0.6 Compiling zstd-safe v7.2.4 Checking lexical-parse-float v1.0.6 Checking lexical-write-float v1.0.6 Checking icu_normalizer v2.2.0 Checking half v2.7.1 Checking arrow-buffer v58.1.0 Checking icu_properties v2.2.0 Checking arrow-data v58.1.0 Checking arrow-array v58.1.0 Compiling flatbuffers v25.12.19 Checking aho-cora
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4351783039 > hi @mkleen Can you merge with latest `main` to get #21913 which would fix the [CI error](https://github.com/apache/datafusion/actions/runs/25147830630/job/73712482487?pr=20047) Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
kosiew commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4351739294 hi @mkleen Can you merge with latest `main` to get #21913 which would fix the [CI error](https://github.com/apache/datafusion/actions/runs/25147830630/job/73712482487?pr=20047) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4342271621 @kosiew Thanks a lot, for all these excellent review, this was really a fantastic job. I made one more iteration adressing the last outstanding issue and I hope this pr is now ready-to-go. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3159660521
##
datafusion/execution/src/cache/file_statistics_cache.rs:
##
@@ -0,0 +1,745 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::cache::cache_manager::{
+CachedFileMetadata, FileStatisticsCache, FileStatisticsCacheEntry,
+};
+use crate::cache::{CacheAccessor, TableScopedPath};
+use std::collections::HashMap;
+use std::sync::Mutex;
+
+pub use crate::cache::DefaultFilesMetadataCache;
+use crate::cache::lru_queue::LruQueue;
+use datafusion_common::TableReference;
+use datafusion_common::heap_size::{DFHeapSize, DFHeapSizeCtx};
+
+/// Default implementation of [`FileStatisticsCache`]
+///
+/// Stores cached file metadata (statistics and orderings) for files.
+///
+/// The typical usage pattern is:
+/// 1. Call `get(path)` to check for cached value
+/// 2. If `Some(cached)`, validate with `cached.is_valid_for(Β€t_meta)`
+/// 3. If invalid or missing, compute new value and call `put(path, new_value)`
+///
+/// # Internal details
+///
+/// The `memory_limit` controls the maximum size of the cache, which uses a
+/// Least Recently Used eviction algorithm. When adding a new entry, if the
total
+/// size of the cached entries exceeds `memory_limit`, the least recently used
entries
+/// are evicted until the total size is lower than `memory_limit`.
+///
+///
+/// [`FileStatisticsCache`]: crate::cache::cache_manager::FileStatisticsCache
+#[derive(Default)]
+pub struct DefaultFileStatisticsCache {
+state: Mutex,
+}
+
+impl DefaultFileStatisticsCache {
+pub fn new(memory_limit: usize) -> Self {
+Self {
+state:
Mutex::new(DefaultFileStatisticsCacheState::new(memory_limit)),
+}
+}
+
+/// Returns the size of the cached memory, in bytes.
+pub fn memory_used(&self) -> usize {
+let state = self.state.lock().unwrap();
+state.memory_used
+}
+}
+
+struct DefaultFileStatisticsCacheState {
+lru_queue: LruQueue,
+memory_limit: usize,
+memory_used: usize,
+}
+
+pub const DEFAULT_FILE_STATISTICS_MEMORY_LIMIT: usize = 20 * 1024 * 1024; //
20MiB
+
+impl Default for DefaultFileStatisticsCacheState {
+fn default() -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit: DEFAULT_FILE_STATISTICS_MEMORY_LIMIT,
+memory_used: 0,
+}
+}
+}
+
+impl DefaultFileStatisticsCacheState {
+fn new(memory_limit: usize) -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit,
+memory_used: 0,
+}
+}
+fn get(&mut self, key: &TableScopedPath) -> Option {
+self.lru_queue.get(key).cloned()
+}
+
+fn put(
+&mut self,
+key: &TableScopedPath,
+value: CachedFileMetadata,
+) -> Option {
+let mut ctx = DFHeapSizeCtx::default();
Review Comment:
Thank you, i made one more iteration to streamline the memory accounting for
inserting.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3159664174
##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,551 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano,
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use hashbrown::HashSet;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until
and
+/// are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize;
+}
+
+#[derive(Default)]
+pub struct DFHeapSizeCtx {
+seen: HashSet,
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.num_rows.heap_size(ctx)
++ self.total_byte_size.heap_size(ctx)
++ self.column_statistics.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.table().heap_size(ctx)
Review Comment:
This was a good catch, i improved the heapsize estimation. It was actually
really off.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
kosiew commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4341379343 I ran the recommended `clickbench_partitioned.json` benchmark comparing HEAD with `file-stats-cache`. The total runtime was effectively unchanged: `20045.25ms` on HEAD vs `20005.27ms` with the PR. Most queries showed no change, with 4 reported faster and 2 reported slower. The two slower queries were very small absolute regressions on short-running queries and had relatively high variance, so I donβt think they indicate a meaningful regression. Based on this, I donβt see evidence that further performance investigation is needed before merging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4340870470 π€ Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4340783583) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Details ``` Comparing HEAD and file-stats-cache Benchmark clickbench_partitioned.json βββββ³ββββ³ββββ³ββββ β Query β HEAD β file-stats-cache βChange β β‘ββββββββββββββββ© β QQuery 0 β 1.20 / 4.59 Β±6.72 / 18.03 ms β 1.25 / 4.74 Β±6.87 / 18.47 ms β no change β β QQuery 1 β12.57 / 12.84 Β±0.18 / 13.11 ms β12.06 / 12.46 Β±0.20 / 12.60 ms β no change β β QQuery 2 β36.40 / 36.87 Β±0.32 / 37.33 ms β36.41 / 36.58 Β±0.26 / 37.09 ms β no change β β QQuery 3 β31.68 / 32.60 Β±0.73 / 33.90 ms β30.88 / 31.57 Β±0.49 / 32.32 ms β no change β β QQuery 4 β 240.46 / 244.16 Β±3.00 / 248.49 ms β 241.65 / 243.43 Β±1.70 / 246.27 ms β no change β β QQuery 5 β 282.44 / 284.79 Β±2.09 / 287.37 ms β 281.03 / 283.28 Β±1.21 / 284.50 ms β no change β β QQuery 6 β 7.14 / 7.38 Β±0.18 / 7.67 ms β 6.84 / 7.15 Β±0.31 / 7.72 ms β no change β β QQuery 7 β13.73 / 13.83 Β±0.09 / 13.97 ms β13.53 / 15.53 Β±3.92 / 23.37 ms β 1.12x slower β β QQuery 8 β 328.30 / 330.07 Β±1.61 / 332.38 ms β 328.45 / 332.13 Β±2.78 / 335.14 ms β no change β β QQuery 9 β 443.05 / 450.06 Β±8.21 / 465.68 ms β430.37 / 447.80 Β±10.49 / 462.69 ms β no change β β QQuery 10 β74.68 / 76.06 Β±1.08 / 77.86 ms β72.95 / 73.83 Β±0.83 / 75.18 ms β no change β β QQuery 11 β85.29 / 87.00 Β±1.64 / 89.50 ms β83.83 / 84.81 Β±0.85 / 86.37 ms β no change β β QQuery 12 β 272.49 / 276.98 Β±4.74 / 283.90 ms β 274.31 / 277.15 Β±2.31 / 280.87 ms β no change β β QQuery 13 β 394.14 / 402.44 Β±5.78 / 409.35 ms β 396.80 / 404.30 Β±6.96 / 415.39 ms β no change β β QQuery 14 β 284.33 / 291.75 Β±7.44 / 304.65 ms β 285.16 / 286.64 Β±1.49 / 289.31 ms β no change β β QQuery 15 β
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
kosiew commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3158625956
##
datafusion/execution/src/cache/file_statistics_cache.rs:
##
@@ -0,0 +1,745 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::cache::cache_manager::{
+CachedFileMetadata, FileStatisticsCache, FileStatisticsCacheEntry,
+};
+use crate::cache::{CacheAccessor, TableScopedPath};
+use std::collections::HashMap;
+use std::sync::Mutex;
+
+pub use crate::cache::DefaultFilesMetadataCache;
+use crate::cache::lru_queue::LruQueue;
+use datafusion_common::TableReference;
+use datafusion_common::heap_size::{DFHeapSize, DFHeapSizeCtx};
+
+/// Default implementation of [`FileStatisticsCache`]
+///
+/// Stores cached file metadata (statistics and orderings) for files.
+///
+/// The typical usage pattern is:
+/// 1. Call `get(path)` to check for cached value
+/// 2. If `Some(cached)`, validate with `cached.is_valid_for(Β€t_meta)`
+/// 3. If invalid or missing, compute new value and call `put(path, new_value)`
+///
+/// # Internal details
+///
+/// The `memory_limit` controls the maximum size of the cache, which uses a
+/// Least Recently Used eviction algorithm. When adding a new entry, if the
total
+/// size of the cached entries exceeds `memory_limit`, the least recently used
entries
+/// are evicted until the total size is lower than `memory_limit`.
+///
+///
+/// [`FileStatisticsCache`]: crate::cache::cache_manager::FileStatisticsCache
+#[derive(Default)]
+pub struct DefaultFileStatisticsCache {
+state: Mutex,
+}
+
+impl DefaultFileStatisticsCache {
+pub fn new(memory_limit: usize) -> Self {
+Self {
+state:
Mutex::new(DefaultFileStatisticsCacheState::new(memory_limit)),
+}
+}
+
+/// Returns the size of the cached memory, in bytes.
+pub fn memory_used(&self) -> usize {
+let state = self.state.lock().unwrap();
+state.memory_used
+}
+}
+
+struct DefaultFileStatisticsCacheState {
+lru_queue: LruQueue,
+memory_limit: usize,
+memory_used: usize,
+}
+
+pub const DEFAULT_FILE_STATISTICS_MEMORY_LIMIT: usize = 20 * 1024 * 1024; //
20MiB
+
+impl Default for DefaultFileStatisticsCacheState {
+fn default() -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit: DEFAULT_FILE_STATISTICS_MEMORY_LIMIT,
+memory_used: 0,
+}
+}
+}
+
+impl DefaultFileStatisticsCacheState {
+fn new(memory_limit: usize) -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit,
+memory_used: 0,
+}
+}
+fn get(&mut self, key: &TableScopedPath) -> Option {
+self.lru_queue.get(key).cloned()
+}
+
+fn put(
+&mut self,
+key: &TableScopedPath,
+value: CachedFileMetadata,
+) -> Option {
+let mut ctx = DFHeapSizeCtx::default();
Review Comment:
This logic is correct today, but it feels a bit fragile.
Right now it works because `TableScopedPath::heap_size` does not interact
with the context, so recomputing `key.heap_size` with a fresh `DFHeapSizeCtx`
gives the same result.
However, if the key ever includes something like an `Arc` that *does*
register in the context, the second computation could return 0 and silently
skew `memory_used`.
It might be safer to reuse the already computed `key_size` instead of
recomputing it with a different context. That would make the intent clearer and
avoid future footguns.
##
datafusion/execution/src/cache/cache_manager.rs:
##
Review Comment:
This comment says "Default is disabled", but in practice the cache is
created when `list_files_cache_limit > 0` (which it is by default).
So effectively this is enabled by default, similar to the earlier issue we
discussed. Might be worth tweaking the comment to reflect the actual behavior
so it does not surprise future readers.
##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,551 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for a
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
kosiew commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3158582525
##
datafusion/execution/src/cache/cache_manager.rs:
##
@@ -410,8 +454,10 @@ pub const DEFAULT_METADATA_CACHE_LIMIT: usize = 50 * 1024
* 1024; // 50M
pub struct CacheManagerConfig {
/// Enable caching of file statistics when listing files.
/// Enabling the cache avoids repeatedly reading file statistics in a
DataFusion session.
-/// Default is disabled. Currently only Parquet files are supported.
-pub table_files_statistics_cache: Option>,
+/// Default is enabled with 1MiB. Currently only Parquet files are
supported.
Review Comment:
Small doc nit here. The comment is a bit out of sync with the current
behavior.
First, "Default is enabled" is misleading. The field itself defaults to
None, and the cache only gets created in CacheManager::try_new when
file_statistics_cache_limit > 0.
Second, the "1MiB" value is stale. The default limit was bumped to 20 MiB,
and the nearby docs already reflect that.
Could you update this comment to match the current behavior and default? It
will save future readers a bit of head scratching π
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4340794653 π€ Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4340783583) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4340783583-1903-vbrzn 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing file-stats-cache (6808feffadc88aa18be810e48ee32a08ee865591) to 54a5515 (merge-base) [diff](https://github.com/apache/datafusion/compare/54a551583bb13356c42e3c1d9e11b63079d2e40b..6808feffadc88aa18be810e48ee32a08ee865591) using: clickbench_partitioned Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
kosiew commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4340783583 run benchmark clickbench_partitioned -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3146316357
##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,476 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano,
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until
and
+/// are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self) -> usize;
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self) -> usize {
+self.num_rows.heap_size()
++ self.total_byte_size.heap_size()
++ self.column_statistics.heap_size()
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self) -> usize {
+self.table().heap_size()
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size())
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self) -> usize {
+self.null_count.heap_size()
++ self.max_value.heap_size()
++ self.min_value.heap_size()
++ self.sum_value.heap_size()
++ self.distinct_count.heap_size()
++ self.byte_size.heap_size()
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(),
+Float16(f) => f.heap_size(),
+Float32(f) => f.heap_size(),
+Float64(f) => f.heap_size(),
+Decimal32(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Decimal64(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Decimal128(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Decimal256(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Int8(i) => i.heap_size(),
+Int16(i) => i.heap_size(),
+Int32(i) => i.heap_size(),
+Int64(i) => i.heap_size(),
+UInt8(u) => u.heap_size(),
+UInt16(u) => u.heap_size(),
+UInt32(u) => u.heap_size(),
+UInt64(u) => u.heap_size(),
+Utf8(u) => u.heap_size(),
+Utf8View(u) => u.heap_size(),
+LargeUtf8(l) => l.heap_size(),
+Binary(b) => b.heap_size(),
+BinaryView(b) => b.heap_size(),
+FixedSizeBinary(a, b) => a.heap_size() + b.heap_size(),
+LargeBinary(l) => l.heap_size(),
+FixedSizeList(f) => f.heap_size(),
+List(l) => l.heap_size(),
+LargeList(l) => l.heap_size(),
+Struct(s) => s.heap_size(),
+Map(m) => m.heap_size(),
+Date32(d) => d.heap_size(),
+Date64(d) => d.heap_size(),
+Time32Second(t) => t.heap_size(),
+Time32Millisecond(t) => t.heap_size(),
+Time64Microsecond(t) => t.heap_size(),
+Time64Nanosecond(t) => t.heap_size(),
+TimestampSecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampMillisecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampMicrosecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampNanosecond(a, b)
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3146316357
##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,476 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano,
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until
and
+/// are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self) -> usize;
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self) -> usize {
+self.num_rows.heap_size()
++ self.total_byte_size.heap_size()
++ self.column_statistics.heap_size()
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self) -> usize {
+self.table().heap_size()
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size())
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self) -> usize {
+self.null_count.heap_size()
++ self.max_value.heap_size()
++ self.min_value.heap_size()
++ self.sum_value.heap_size()
++ self.distinct_count.heap_size()
++ self.byte_size.heap_size()
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(),
+Float16(f) => f.heap_size(),
+Float32(f) => f.heap_size(),
+Float64(f) => f.heap_size(),
+Decimal32(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Decimal64(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Decimal128(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Decimal256(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Int8(i) => i.heap_size(),
+Int16(i) => i.heap_size(),
+Int32(i) => i.heap_size(),
+Int64(i) => i.heap_size(),
+UInt8(u) => u.heap_size(),
+UInt16(u) => u.heap_size(),
+UInt32(u) => u.heap_size(),
+UInt64(u) => u.heap_size(),
+Utf8(u) => u.heap_size(),
+Utf8View(u) => u.heap_size(),
+LargeUtf8(l) => l.heap_size(),
+Binary(b) => b.heap_size(),
+BinaryView(b) => b.heap_size(),
+FixedSizeBinary(a, b) => a.heap_size() + b.heap_size(),
+LargeBinary(l) => l.heap_size(),
+FixedSizeList(f) => f.heap_size(),
+List(l) => l.heap_size(),
+LargeList(l) => l.heap_size(),
+Struct(s) => s.heap_size(),
+Map(m) => m.heap_size(),
+Date32(d) => d.heap_size(),
+Date64(d) => d.heap_size(),
+Time32Second(t) => t.heap_size(),
+Time32Millisecond(t) => t.heap_size(),
+Time64Microsecond(t) => t.heap_size(),
+Time64Nanosecond(t) => t.heap_size(),
+TimestampSecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampMillisecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampMicrosecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampNanosecond(a, b)
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3146316357
##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,476 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano,
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until
and
+/// are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self) -> usize;
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self) -> usize {
+self.num_rows.heap_size()
++ self.total_byte_size.heap_size()
++ self.column_statistics.heap_size()
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self) -> usize {
+self.table().heap_size()
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size())
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self) -> usize {
+self.null_count.heap_size()
++ self.max_value.heap_size()
++ self.min_value.heap_size()
++ self.sum_value.heap_size()
++ self.distinct_count.heap_size()
++ self.byte_size.heap_size()
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(),
+Float16(f) => f.heap_size(),
+Float32(f) => f.heap_size(),
+Float64(f) => f.heap_size(),
+Decimal32(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Decimal64(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Decimal128(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Decimal256(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Int8(i) => i.heap_size(),
+Int16(i) => i.heap_size(),
+Int32(i) => i.heap_size(),
+Int64(i) => i.heap_size(),
+UInt8(u) => u.heap_size(),
+UInt16(u) => u.heap_size(),
+UInt32(u) => u.heap_size(),
+UInt64(u) => u.heap_size(),
+Utf8(u) => u.heap_size(),
+Utf8View(u) => u.heap_size(),
+LargeUtf8(l) => l.heap_size(),
+Binary(b) => b.heap_size(),
+BinaryView(b) => b.heap_size(),
+FixedSizeBinary(a, b) => a.heap_size() + b.heap_size(),
+LargeBinary(l) => l.heap_size(),
+FixedSizeList(f) => f.heap_size(),
+List(l) => l.heap_size(),
+LargeList(l) => l.heap_size(),
+Struct(s) => s.heap_size(),
+Map(m) => m.heap_size(),
+Date32(d) => d.heap_size(),
+Date64(d) => d.heap_size(),
+Time32Second(t) => t.heap_size(),
+Time32Millisecond(t) => t.heap_size(),
+Time64Microsecond(t) => t.heap_size(),
+Time64Nanosecond(t) => t.heap_size(),
+TimestampSecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampMillisecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampMicrosecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampNanosecond(a, b)
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3146323961
##
datafusion/execution/src/cache/cache_manager.rs:
##
@@ -330,8 +356,19 @@ pub struct CacheManager {
impl CacheManager {
pub fn try_new(config: &CacheManagerConfig) -> Result> {
-let file_statistic_cache =
-config.table_files_statistics_cache.as_ref().map(Arc::clone);
+let file_statistic_cache = match &config.file_statistics_cache {
+Some(fsc) if config.file_statistics_cache_limit > 0 => {
+fsc.update_cache_limit(config.file_statistics_cache_limit);
+Some(Arc::clone(fsc))
+}
+None if config.file_statistics_cache_limit > 0 => {
Review Comment:
The behaviour change is on purpose see
https://github.com/apache/datafusion/issues/19217 for more details. I
documented now the change in the upgrade guide.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3146316357
##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,476 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano,
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until
and
+/// are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self) -> usize;
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self) -> usize {
+self.num_rows.heap_size()
++ self.total_byte_size.heap_size()
++ self.column_statistics.heap_size()
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self) -> usize {
+self.table().heap_size()
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size())
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self) -> usize {
+self.null_count.heap_size()
++ self.max_value.heap_size()
++ self.min_value.heap_size()
++ self.sum_value.heap_size()
++ self.distinct_count.heap_size()
++ self.byte_size.heap_size()
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(),
+Float16(f) => f.heap_size(),
+Float32(f) => f.heap_size(),
+Float64(f) => f.heap_size(),
+Decimal32(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Decimal64(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Decimal128(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Decimal256(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Int8(i) => i.heap_size(),
+Int16(i) => i.heap_size(),
+Int32(i) => i.heap_size(),
+Int64(i) => i.heap_size(),
+UInt8(u) => u.heap_size(),
+UInt16(u) => u.heap_size(),
+UInt32(u) => u.heap_size(),
+UInt64(u) => u.heap_size(),
+Utf8(u) => u.heap_size(),
+Utf8View(u) => u.heap_size(),
+LargeUtf8(l) => l.heap_size(),
+Binary(b) => b.heap_size(),
+BinaryView(b) => b.heap_size(),
+FixedSizeBinary(a, b) => a.heap_size() + b.heap_size(),
+LargeBinary(l) => l.heap_size(),
+FixedSizeList(f) => f.heap_size(),
+List(l) => l.heap_size(),
+LargeList(l) => l.heap_size(),
+Struct(s) => s.heap_size(),
+Map(m) => m.heap_size(),
+Date32(d) => d.heap_size(),
+Date64(d) => d.heap_size(),
+Time32Second(t) => t.heap_size(),
+Time32Millisecond(t) => t.heap_size(),
+Time64Microsecond(t) => t.heap_size(),
+Time64Nanosecond(t) => t.heap_size(),
+TimestampSecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampMillisecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampMicrosecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampNanosecond(a, b)
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
kosiew commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3135429326
##
datafusion/execution/src/cache/cache_manager.rs:
##
@@ -330,8 +356,19 @@ pub struct CacheManager {
impl CacheManager {
pub fn try_new(config: &CacheManagerConfig) -> Result> {
-let file_statistic_cache =
-config.table_files_statistics_cache.as_ref().map(Arc::clone);
+let file_statistic_cache = match &config.file_statistics_cache {
+Some(fsc) if config.file_statistics_cache_limit > 0 => {
+fsc.update_cache_limit(config.file_statistics_cache_limit);
+Some(Arc::clone(fsc))
+}
+None if config.file_statistics_cache_limit > 0 => {
Review Comment:
This change subtly alters the behavior of
`with_file_statistics_cache(None)`. It no longer disables the cache if
`file_statistics_cache_limit > 0`, because `try_new` will create a default
cache in that case.
Previously, passing `None` was enough to disable the cache. Now callers also
need to know to set the limit to 0, which is not obvious.
This feels like a silent behavior change and it also contradicts the current
builder documentation. It would be good to either restore the original behavior
or make the contract explicit in both code and docs.
##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,476 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano,
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until
and
+/// are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self) -> usize;
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self) -> usize {
+self.num_rows.heap_size()
++ self.total_byte_size.heap_size()
++ self.column_statistics.heap_size()
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self) -> usize {
+self.table().heap_size()
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size())
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self) -> usize {
+self.null_count.heap_size()
++ self.max_value.heap_size()
++ self.min_value.heap_size()
++ self.sum_value.heap_size()
++ self.distinct_count.heap_size()
++ self.byte_size.heap_size()
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(),
+Float16(f) => f.heap_size(),
+Float32(f) => f.heap_size(),
+Float64(f) => f.heap_size(),
+Decimal32(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Decimal64(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Decimal128(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Decimal256(a, b, c) => a.heap_size() + b.heap_size() +
c.heap_size(),
+Int8(i) => i.heap_size(),
+Int16(i) => i.heap_size(),
+Int32(i) => i.heap_size(),
+Int64(i) => i.he
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r312248
##
datafusion/execution/src/cache/file_statistics_cache.rs:
##
@@ -0,0 +1,739 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::cache::cache_manager::{
+CachedFileMetadata, FileStatisticsCache, FileStatisticsCacheEntry,
+};
+use crate::cache::{CacheAccessor, TableScopedPath};
+use object_store::path::Path;
+use std::collections::HashMap;
+use std::sync::Mutex;
+
+pub use crate::cache::DefaultFilesMetadataCache;
+use crate::cache::lru_queue::LruQueue;
+use datafusion_common::TableReference;
+use datafusion_common::heap_size::DFHeapSize;
+
+/// Default implementation of [`FileStatisticsCache`]
+///
+/// Stores cached file metadata (statistics and orderings) for files.
+///
+/// The typical usage pattern is:
+/// 1. Call `get(path)` to check for cached value
+/// 2. If `Some(cached)`, validate with `cached.is_valid_for(Β€t_meta)`
+/// 3. If invalid or missing, compute new value and call `put(path, new_value)`
+///
+/// # Internal details
+///
+/// The `memory_limit` controls the maximum size of the cache, which uses a
+/// Least Recently Used eviction algorithm. When adding a new entry, if the
total
+/// size of the cached entries exceeds `memory_limit`, the least recently used
entries
+/// are evicted until the total size is lower than `memory_limit`.
+///
+///
+/// [`FileStatisticsCache`]: crate::cache::cache_manager::FileStatisticsCache
+#[derive(Default)]
+pub struct DefaultFileStatisticsCache {
+state: Mutex,
+}
+
+impl DefaultFileStatisticsCache {
+pub fn new(memory_limit: usize) -> Self {
+Self {
+state:
Mutex::new(DefaultFileStatisticsCacheState::new(memory_limit)),
+}
+}
+
+/// Returns the size of the cached memory, in bytes.
+pub fn memory_used(&self) -> usize {
+let state = self.state.lock().unwrap();
+state.memory_used
+}
+}
+
+struct DefaultFileStatisticsCacheState {
+lru_queue: LruQueue,
+memory_limit: usize,
+memory_used: usize,
+}
+
+pub const DEFAULT_FILE_STATISTICS_MEMORY_LIMIT: usize = 20 * 1024 * 1024; //
20MiB
+
+impl Default for DefaultFileStatisticsCacheState {
+fn default() -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit: DEFAULT_FILE_STATISTICS_MEMORY_LIMIT,
+memory_used: 0,
+}
+}
+}
+
+impl DefaultFileStatisticsCacheState {
+fn new(memory_limit: usize) -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit,
+memory_used: 0,
+}
+}
+fn get(&mut self, key: &TableScopedPath) -> Option {
+self.lru_queue.get(key).cloned()
+}
+
+fn put(
+&mut self,
+key: &TableScopedPath,
+value: CachedFileMetadata,
+) -> Option {
+let key_size = key.path.as_ref().heap_size();
+let entry_size = value.heap_size();
+
+if entry_size + key_size > self.memory_limit {
+// Remove potential stale entry
+self.remove(key);
+return None;
+}
+
+let old_value = self.lru_queue.put(key.clone(), value);
+self.memory_used += entry_size;
+self.memory_used += key.path.as_ref().heap_size();
+
+if let Some(old_entry) = &old_value {
+self.memory_used -= old_entry.heap_size();
+self.memory_used -= key.path.as_ref().heap_size();
+}
+
+self.evict_entries();
+
+old_value
+}
+
+fn remove(&mut self, k: &TableScopedPath) -> Option {
+if let Some(old_entry) = self.lru_queue.remove(k) {
+self.memory_used -= k.path.as_ref().heap_size();
+self.memory_used -= old_entry.heap_size();
+Some(old_entry)
+} else {
+None
+}
+}
+
+fn contains_key(&self, k: &TableScopedPath) -> bool {
+self.lru_queue.contains_key(k)
+}
+
+fn len(&self) -> usize {
+self.lru_queue.len()
+}
+
+fn clear(&mut self) {
+self.lru_queue.clear();
+self.memory_used = 0;
+}
+
+fn evict_entries(&mut self) {
+while
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3129697841
##
datafusion/execution/src/cache/file_statistics_cache.rs:
##
@@ -0,0 +1,739 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::cache::cache_manager::{
+CachedFileMetadata, FileStatisticsCache, FileStatisticsCacheEntry,
+};
+use crate::cache::{CacheAccessor, TableScopedPath};
+use object_store::path::Path;
+use std::collections::HashMap;
+use std::sync::Mutex;
+
+pub use crate::cache::DefaultFilesMetadataCache;
+use crate::cache::lru_queue::LruQueue;
+use datafusion_common::TableReference;
+use datafusion_common::heap_size::DFHeapSize;
+
+/// Default implementation of [`FileStatisticsCache`]
+///
+/// Stores cached file metadata (statistics and orderings) for files.
+///
+/// The typical usage pattern is:
+/// 1. Call `get(path)` to check for cached value
+/// 2. If `Some(cached)`, validate with `cached.is_valid_for(Β€t_meta)`
+/// 3. If invalid or missing, compute new value and call `put(path, new_value)`
+///
+/// # Internal details
+///
+/// The `memory_limit` controls the maximum size of the cache, which uses a
+/// Least Recently Used eviction algorithm. When adding a new entry, if the
total
+/// size of the cached entries exceeds `memory_limit`, the least recently used
entries
+/// are evicted until the total size is lower than `memory_limit`.
+///
+///
+/// [`FileStatisticsCache`]: crate::cache::cache_manager::FileStatisticsCache
+#[derive(Default)]
+pub struct DefaultFileStatisticsCache {
+state: Mutex,
+}
+
+impl DefaultFileStatisticsCache {
+pub fn new(memory_limit: usize) -> Self {
+Self {
+state:
Mutex::new(DefaultFileStatisticsCacheState::new(memory_limit)),
+}
+}
+
+/// Returns the size of the cached memory, in bytes.
+pub fn memory_used(&self) -> usize {
+let state = self.state.lock().unwrap();
+state.memory_used
+}
+}
+
+struct DefaultFileStatisticsCacheState {
+lru_queue: LruQueue,
+memory_limit: usize,
+memory_used: usize,
+}
+
+pub const DEFAULT_FILE_STATISTICS_MEMORY_LIMIT: usize = 20 * 1024 * 1024; //
20MiB
+
+impl Default for DefaultFileStatisticsCacheState {
+fn default() -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit: DEFAULT_FILE_STATISTICS_MEMORY_LIMIT,
+memory_used: 0,
+}
+}
+}
+
+impl DefaultFileStatisticsCacheState {
+fn new(memory_limit: usize) -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit,
+memory_used: 0,
+}
+}
+fn get(&mut self, key: &TableScopedPath) -> Option {
+self.lru_queue.get(key).cloned()
+}
+
+fn put(
+&mut self,
+key: &TableScopedPath,
+value: CachedFileMetadata,
+) -> Option {
+let key_size = key.path.as_ref().heap_size();
Review Comment:
Good catch, thank you.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047: URL: https://github.com/apache/datafusion/pull/20047#discussion_r3129581031 ## datafusion/core/src/execution/context/mod.rs: ## Review Comment: Yes, thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3129343757
##
datafusion/execution/src/cache/file_statistics_cache.rs:
##
@@ -0,0 +1,739 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::cache::cache_manager::{
+CachedFileMetadata, FileStatisticsCache, FileStatisticsCacheEntry,
+};
+use crate::cache::{CacheAccessor, TableScopedPath};
+use object_store::path::Path;
+use std::collections::HashMap;
+use std::sync::Mutex;
+
+pub use crate::cache::DefaultFilesMetadataCache;
+use crate::cache::lru_queue::LruQueue;
+use datafusion_common::TableReference;
+use datafusion_common::heap_size::DFHeapSize;
+
+/// Default implementation of [`FileStatisticsCache`]
+///
+/// Stores cached file metadata (statistics and orderings) for files.
+///
+/// The typical usage pattern is:
+/// 1. Call `get(path)` to check for cached value
+/// 2. If `Some(cached)`, validate with `cached.is_valid_for(Β€t_meta)`
+/// 3. If invalid or missing, compute new value and call `put(path, new_value)`
+///
+/// # Internal details
+///
+/// The `memory_limit` controls the maximum size of the cache, which uses a
+/// Least Recently Used eviction algorithm. When adding a new entry, if the
total
+/// size of the cached entries exceeds `memory_limit`, the least recently used
entries
+/// are evicted until the total size is lower than `memory_limit`.
+///
+///
+/// [`FileStatisticsCache`]: crate::cache::cache_manager::FileStatisticsCache
+#[derive(Default)]
+pub struct DefaultFileStatisticsCache {
+state: Mutex,
+}
+
+impl DefaultFileStatisticsCache {
+pub fn new(memory_limit: usize) -> Self {
+Self {
+state:
Mutex::new(DefaultFileStatisticsCacheState::new(memory_limit)),
+}
+}
+
+/// Returns the size of the cached memory, in bytes.
+pub fn memory_used(&self) -> usize {
+let state = self.state.lock().unwrap();
+state.memory_used
+}
+}
+
+struct DefaultFileStatisticsCacheState {
+lru_queue: LruQueue,
+memory_limit: usize,
+memory_used: usize,
+}
+
+pub const DEFAULT_FILE_STATISTICS_MEMORY_LIMIT: usize = 20 * 1024 * 1024; //
20MiB
+
+impl Default for DefaultFileStatisticsCacheState {
+fn default() -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit: DEFAULT_FILE_STATISTICS_MEMORY_LIMIT,
+memory_used: 0,
+}
+}
+}
+
+impl DefaultFileStatisticsCacheState {
+fn new(memory_limit: usize) -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit,
+memory_used: 0,
+}
+}
+fn get(&mut self, key: &TableScopedPath) -> Option {
+self.lru_queue.get(key).cloned()
+}
+
+fn put(
+&mut self,
+key: &TableScopedPath,
+value: CachedFileMetadata,
+) -> Option {
+let key_size = key.path.as_ref().heap_size();
Review Comment:
Yes, let's do this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047: URL: https://github.com/apache/datafusion/pull/20047#discussion_r3129341511 ## datafusion/core/src/execution/context/mod.rs: ## Review Comment: Yes, this makes sense! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
kosiew commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3128191420
##
datafusion/execution/src/cache/file_statistics_cache.rs:
##
@@ -0,0 +1,739 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::cache::cache_manager::{
+CachedFileMetadata, FileStatisticsCache, FileStatisticsCacheEntry,
+};
+use crate::cache::{CacheAccessor, TableScopedPath};
+use object_store::path::Path;
+use std::collections::HashMap;
+use std::sync::Mutex;
+
+pub use crate::cache::DefaultFilesMetadataCache;
+use crate::cache::lru_queue::LruQueue;
+use datafusion_common::TableReference;
+use datafusion_common::heap_size::DFHeapSize;
+
+/// Default implementation of [`FileStatisticsCache`]
+///
+/// Stores cached file metadata (statistics and orderings) for files.
+///
+/// The typical usage pattern is:
+/// 1. Call `get(path)` to check for cached value
+/// 2. If `Some(cached)`, validate with `cached.is_valid_for(Β€t_meta)`
+/// 3. If invalid or missing, compute new value and call `put(path, new_value)`
+///
+/// # Internal details
+///
+/// The `memory_limit` controls the maximum size of the cache, which uses a
+/// Least Recently Used eviction algorithm. When adding a new entry, if the
total
+/// size of the cached entries exceeds `memory_limit`, the least recently used
entries
+/// are evicted until the total size is lower than `memory_limit`.
+///
+///
+/// [`FileStatisticsCache`]: crate::cache::cache_manager::FileStatisticsCache
+#[derive(Default)]
+pub struct DefaultFileStatisticsCache {
+state: Mutex,
+}
+
+impl DefaultFileStatisticsCache {
+pub fn new(memory_limit: usize) -> Self {
+Self {
+state:
Mutex::new(DefaultFileStatisticsCacheState::new(memory_limit)),
+}
+}
+
+/// Returns the size of the cached memory, in bytes.
+pub fn memory_used(&self) -> usize {
+let state = self.state.lock().unwrap();
+state.memory_used
+}
+}
+
+struct DefaultFileStatisticsCacheState {
+lru_queue: LruQueue,
+memory_limit: usize,
+memory_used: usize,
+}
+
+pub const DEFAULT_FILE_STATISTICS_MEMORY_LIMIT: usize = 20 * 1024 * 1024; //
20MiB
+
+impl Default for DefaultFileStatisticsCacheState {
+fn default() -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit: DEFAULT_FILE_STATISTICS_MEMORY_LIMIT,
+memory_used: 0,
+}
+}
+}
+
+impl DefaultFileStatisticsCacheState {
+fn new(memory_limit: usize) -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit,
+memory_used: 0,
+}
+}
+fn get(&mut self, key: &TableScopedPath) -> Option {
+self.lru_queue.get(key).cloned()
+}
+
+fn put(
+&mut self,
+key: &TableScopedPath,
+value: CachedFileMetadata,
+) -> Option {
+let key_size = key.path.as_ref().heap_size();
Review Comment:
The cache key is now `TableScopedPath { table, path }`, but the memory
accounting here still only charges for the path bytes. That leaves the new
`table_reference` portion effectively unaccounted for, so `memory_used` can
undercount in multi-table scenarios.
Would it make sense to include the `TableReference` heap usage as well so
the cache limit is enforced more accurately?
##
datafusion/execution/src/cache/file_statistics_cache.rs:
##
@@ -0,0 +1,739 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing pe
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3123770486
##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1752,7 +1766,9 @@ impl SessionContext {
let config = ListingTableConfig::new(table_path)
.with_listing_options(options)
.with_schema(resolved_schema);
-let table =
ListingTable::try_new(config)?.with_definition(sql_definition);
+let table = ListingTable::try_new(config)?
+.with_definition(sql_definition)
+
.with_cache(self.runtime_env().cache_manager.get_file_statistic_cache());
Review Comment:
@kosiew I now added the cache on the listing table in `_read_type()` and on
all locations where a listing table is created with `with_collect_stat(true)`
where therefore caching makes sense. Could you please do another review?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3093035238
##
datafusion/execution/src/runtime_env.rs:
##
@@ -514,6 +535,7 @@ impl RuntimeEnvBuilder {
Some("50M".to_owned()),
Some("1M".to_owned()),
None,
+Some("1M".to_owned()),
Review Comment:
Yes, good point.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3093032919
##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1752,7 +1766,9 @@ impl SessionContext {
let config = ListingTableConfig::new(table_path)
.with_listing_options(options)
.with_schema(resolved_schema);
-let table =
ListingTable::try_new(config)?.with_definition(sql_definition);
+let table = ListingTable::try_new(config)?
+.with_definition(sql_definition)
+
.with_cache(self.runtime_env().cache_manager.get_file_statistic_cache());
Review Comment:
No this was not intentional. This needs to be fixed. Thanks a lot!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
kosiew commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3092631607
##
datafusion/execution/src/runtime_env.rs:
##
@@ -514,6 +535,7 @@ impl RuntimeEnvBuilder {
Some("50M".to_owned()),
Some("1M".to_owned()),
None,
+Some("1M".to_owned()),
Review Comment:
I see that `DEFAULT_FILE_STATISTICS_MEMORY_LIMIT` was increased to 20 MiB,
but `RuntimeEnvBuilder::entries()` still hard-codes `Some("1M".to_owned())` for
`datafusion.runtime.file_statistics_cache_limit`.
Would it make sense to update `entries()` to match the new default?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
kosiew commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3092631605
##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1752,7 +1766,9 @@ impl SessionContext {
let config = ListingTableConfig::new(table_path)
.with_listing_options(options)
.with_schema(resolved_schema);
-let table =
ListingTable::try_new(config)?.with_definition(sql_definition);
+let table = ListingTable::try_new(config)?
+.with_definition(sql_definition)
+
.with_cache(self.runtime_env().cache_manager.get_file_statistic_cache());
Review Comment:
I noticed that only the registered-listing-table path was updated to call
`.with_cache(...)`. Was it intentional that the generic reader path in
`_read_type()` still uses `ListingTable::try_new(config)?` without attaching
the runtime cache?
With `ListingTable` now defaulting `collected_statistics` to `None`, could
this mean that `SessionContext::read_parquet`, `read_csv`, `read_json`,
`read_avro`, and similar methods no longer benefit from the file statistics
cache?
If so, would this be considered a regression in the default-cache rollout?
How do you see this affecting users who rely on those reader paths?
##
datafusion/execution/src/runtime_env.rs:
##
@@ -514,6 +535,7 @@ impl RuntimeEnvBuilder {
Some("50M".to_owned()),
Some("1M".to_owned()),
None,
+Some("1M".to_owned()),
Review Comment:
I see that `DEFAULT_FILE_STATISTICS_MEMORY_LIMIT` was increased to 20 MiB,
but `RuntimeEnvBuilder::entries()` still hard-codes `Some("1M".to_owned())` for
`datafusion.runtime.file_statistics_cache_limit`.
Should these two defaults be aligned? As it stands, could this create
confusion where generated configuration docs and callers of `entries()` still
reflect the old default while runtime behavior uses the new one?
Would it make sense to update `entries()` to match the new default?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4253678703 Ok, the benchmarks are looking good now again. At the moment we have 20 mb default cache size, with 10 mb the stats for clickhouse_partitioned did not fit yet. I wonder what would be a reasonable default value? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4252997551 π€ Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4252876807) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Details ``` Comparing HEAD and file-stats-cache Benchmark clickbench_partitioned.json βββββ³ββββ³ββββ³ββββ β Query β HEAD β file-stats-cache βChange β β‘ββββββββββββββββ© β QQuery 0 β 1.20 / 4.41 Β±6.34 / 17.09 ms β 1.23 / 4.47 Β±6.37 / 17.22 ms β no change β β QQuery 1 β14.66 / 14.94 Β±0.24 / 15.25 ms β14.41 / 14.57 Β±0.22 / 15.00 ms β no change β β QQuery 2 β44.00 / 44.14 Β±0.13 / 44.37 ms β43.38 / 43.86 Β±0.28 / 44.25 ms β no change β β QQuery 3 β42.56 / 44.96 Β±1.66 / 47.06 ms β39.38 / 39.80 Β±0.27 / 40.11 ms β +1.13x faster β β QQuery 4 β 277.84 / 285.17 Β±4.27 / 288.65 ms β 277.58 / 283.29 Β±3.88 / 287.93 ms β no change β β QQuery 5 β 337.90 / 341.30 Β±2.18 / 343.91 ms β 331.05 / 336.37 Β±3.60 / 342.10 ms β no change β β QQuery 6 β 5.81 / 6.72 Β±0.74 / 7.97 ms β 5.61 / 6.57 Β±1.06 / 8.58 ms β no change β β QQuery 7 β16.68 / 17.12 Β±0.25 / 17.45 ms β16.62 / 16.71 Β±0.10 / 16.90 ms β no change β β QQuery 8 β 412.71 / 420.72 Β±8.99 / 438.06 ms β401.33 / 417.04 Β±12.01 / 434.91 ms β no change β β QQuery 9 β 628.36 / 639.32 Β±6.84 / 648.31 ms β 621.94 / 634.97 Β±6.93 / 640.84 ms β no change β β QQuery 10 β91.30 / 92.43 Β±1.55 / 95.45 ms β89.73 / 91.25 Β±1.11 / 92.92 ms β no change β β QQuery 11 β 102.80 / 104.42 Β±1.63 / 106.92 ms β 102.82 / 103.85 Β±0.81 / 105.11 ms β no change β β QQuery 12 β 330.37 / 342.33 Β±7.28 / 353.21 ms β 329.29 / 339.39 Β±8.93 / 353.37 ms β no change β β QQuery 13 β 442.00 / 457.80 Β±9.56 / 469.38 ms β446.65 / 468.59 Β±18.24 / 501.15 ms β no change β β QQuery 14 β 340.00 / 343.67 Β±3.22 / 348.81 ms β 338.85 / 341.73 Β±2.69 / 346.04 ms β no change β β QQuery 15 β
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4252893228 π€ Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4252876807) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4252876807-1300-wbhhh 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing file-stats-cache (3b64005590fb19acfdc4ae3e7cf05403733830bb) to 244f891 (merge-base) [diff](https://github.com/apache/datafusion/compare/244f89120b254f824b6fed0f8a88c23a74bf09d7..3b64005590fb19acfdc4ae3e7cf05403733830bb) using: clickbench_partitioned Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangb commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4252900371 @mkleen you should be able to trigger benchmark runs now fwiw -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangb commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4252876807 run benchmark clickbench_partitioned -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4252380128 @kosiew I added all the changes you requested. Could you do another review please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4252322673 This looks much better now: ``` βββββ³ββββ³β³ββββ β Query β HEAD β file-stats-cache βChange β β‘βββββββββββββ© β QQuery 0 β 1.21 / 4.43 Β±6.35 / 17.13 ms β 1.24 / 4.48 Β±6.39 / 17.27 ms β no change β ``` There are a few minor regressions which could be just noise. Could we do a second run? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4252033177 π€ Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4251865503) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Details ``` Comparing HEAD and file-stats-cache Benchmark clickbench_partitioned.json βββββ³ββββ³β³ββββ β Query β HEAD β file-stats-cache βChange β β‘βββββββββββββ© β QQuery 0 β 1.21 / 4.43 Β±6.35 / 17.13 ms β 1.24 / 4.48 Β±6.39 / 17.27 ms β no change β β QQuery 1 β14.71 / 14.92 Β±0.18 / 15.23 ms β 14.17 / 14.68 Β±0.32 / 15.17 ms β no change β β QQuery 2 β43.45 / 43.82 Β±0.38 / 44.52 ms β 43.72 / 44.07 Β±0.33 / 44.61 ms β no change β β QQuery 3 β40.71 / 44.62 Β±2.98 / 48.23 ms β 42.83 / 45.30 Β±1.54 / 47.53 ms β no change β β QQuery 4 β 279.65 / 290.03 Β±5.81 / 297.44 ms β 278.66 / 297.03 Β±16.87 / 321.40 ms β no change β β QQuery 5 β 334.32 / 342.47 Β±4.70 / 347.62 ms β 341.14 / 347.92 Β±5.80 / 357.84 ms β no change β β QQuery 6 β 5.96 / 6.35 Β±0.28 / 6.69 ms β 6.50 / 8.50 Β±1.55 / 10.30 ms β 1.34x slower β β QQuery 7 β16.21 / 16.97 Β±0.71 / 18.27 ms β 16.43 / 19.59 Β±5.88 / 31.34 ms β 1.15x slower β β QQuery 8 β 407.88 / 418.65 Β±8.08 / 429.99 ms β 413.70 / 422.51 Β±11.66 / 445.28 ms β no change β β QQuery 9 β 630.52 / 639.24 Β±5.76 / 647.14 ms β 636.38 / 644.87 Β±5.68 / 653.84 ms β no change β β QQuery 10 β91.55 / 93.38 Β±2.06 / 97.40 ms β 91.34 / 93.25 Β±2.69 / 98.47 ms β no change β β QQuery 11 β 104.43 / 106.09 Β±1.48 / 108.07 ms β 103.53 / 104.35 Β±0.69 / 105.38 ms β no change β β QQuery 12 β 336.53 / 340.35 Β±2.76 / 342.96 ms β 334.51 / 344.06 Β±10.30 / 363.61 ms β no change β β QQuery 13 β442.59 / 461.29 Β±10.16 / 469.46 ms β 453.08 / 464.30 Β±11.72 / 486.13 ms β no change β β QQuery 14 β 341.31 / 347.52 Β±4.45 / 355.08 ms β 331.97 / 342.74 Β±6.03 / 349.90 ms β no change β
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4251910707 π€ Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4251865503) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4251865503-1285-n44mp 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing file-stats-cache (3b64005590fb19acfdc4ae3e7cf05403733830bb) to 244f891 (merge-base) [diff](https://github.com/apache/datafusion/compare/244f89120b254f824b6fed0f8a88c23a74bf09d7..3b64005590fb19acfdc4ae3e7cf05403733830bb) using: clickbench_partitioned Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4251884799 > run benchmark clickbench_partitioned @Dandandan Thank you, i broke the linter. It's fixed now. Does this affect the benchmark run? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
Dandandan commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4251865503 run benchmark clickbench_partitioned -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4251546359 @adriangb Could please do another benchmark run? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
nuno-faria commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4231834426 > There seem to be reproducible slowdowns ``` βββββ³ββββ³ββββ³ββββ β Query β HEAD β file-stats-cache βChange β β‘ββββββββββββββββ© β QQuery 0 β 1.30 / 4.72 Β±6.59 / 17.91 ms β16.67 / 16.91 Β±0.33 / 17.54 ms β 3.58x slower β ``` The first query in clickbench is a simple `count(*)`, which should complete immediately after getting the statistics as far as I'm aware. @mkleen I would start by checking if the statistics are being collected in the benchmark. Maybe the cache is not enabled there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4231263601 > Thanks @mkleen, LGTM after applying @kosiew suggestions. @alamb or @adriangb could you run some benchmarks here? @nuno-faria Thank you for your review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
mkleen commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4230402657 > There seem to be reproducible slowdowns @adriangb Thanks, i will look into it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangb commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229900383 There seem to be reproducible slowdowns -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229898903 π€ Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4229873645) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Details ``` Comparing HEAD and file-stats-cache Benchmark clickbench_partitioned.json βββββ³ββββ³ββββ³ββββ β Query β HEAD β file-stats-cache βChange β β‘ββββββββββββββββ© β QQuery 0 β 1.21 / 4.53 Β±6.48 / 17.50 ms β17.33 / 17.53 Β±0.23 / 17.98 ms β 3.87x slower β β QQuery 1 β14.59 / 14.93 Β±0.20 / 15.09 ms β30.67 / 31.65 Β±0.61 / 32.32 ms β 2.12x slower β β QQuery 2 β44.19 / 44.59 Β±0.29 / 44.86 ms β60.88 / 61.86 Β±0.65 / 62.88 ms β 1.39x slower β β QQuery 3 β44.33 / 47.04 Β±2.16 / 49.51 ms β61.76 / 64.10 Β±2.28 / 66.98 ms β 1.36x slower β β QQuery 4 β 303.29 / 313.85 Β±9.30 / 326.83 ms β 322.38 / 327.04 Β±3.20 / 331.66 ms β no change β β QQuery 5 β 358.04 / 360.84 Β±2.17 / 363.75 ms β 376.83 / 387.56 Β±6.42 / 396.69 ms β 1.07x slower β β QQuery 6 β 5.84 / 6.98 Β±0.95 / 8.18 ms β21.35 / 23.37 Β±1.78 / 26.52 ms β 3.35x slower β β QQuery 7 β17.26 / 17.76 Β±0.68 / 19.11 ms β33.96 / 35.59 Β±1.21 / 37.11 ms β 2.00x slower β β QQuery 8 β 440.93 / 452.26 Β±8.97 / 466.73 ms β 457.56 / 470.54 Β±8.33 / 479.19 ms β no change β β QQuery 9 β 690.41 / 704.52 Β±8.51 / 712.50 ms β 707.69 / 722.72 Β±9.59 / 736.12 ms β no change β β QQuery 10 β92.89 / 94.97 Β±1.36 / 96.73 ms β 113.09 / 116.63 Β±2.62 / 119.67 ms β 1.23x slower β β QQuery 11 β 106.91 / 107.82 Β±1.06 / 109.88 ms β 124.36 / 125.87 Β±1.44 / 127.77 ms β 1.17x slower β β QQuery 12 β 355.12 / 358.54 Β±2.16 / 361.28 ms β 377.88 / 389.74 Β±7.10 / 396.88 ms β 1.09x slower β β QQuery 13 β462.19 / 482.73 Β±20.92 / 522.20 ms β 507.09 / 518.90 Β±9.90 / 533.63 ms β 1.07x slower β β QQuery 14 β 363.85 / 367.26 Β±3.03 / 372.32 ms β 390.10 / 393.15 Β±3.05 / 397.77 ms β 1.07x slower β β QQuery 15 β
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229877329 π€ Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4229873645) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4229873645-1091-dm6wp 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing file-stats-cache (c49c90669333b9bd35a3c5d9ad64ce3b49935907) to 8cf70ec (merge-base) [diff](https://github.com/apache/datafusion/compare/8cf70ec76c6f9de52709577f43559a0c93dd26d7..c49c90669333b9bd35a3c5d9ad64ce3b49935907) using: clickbench_partitioned Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangb commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229873645 run benchmark clickbench_partitioned -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229863622 π€ Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4229831576) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Details ``` Comparing HEAD and file-stats-cache Benchmark clickbench_partitioned.json βββββ³ββββ³ββββ³ββββ β Query β HEAD β file-stats-cache βChange β β‘ββββββββββββββββ© β QQuery 0 β 1.30 / 4.72 Β±6.59 / 17.91 ms β16.67 / 16.91 Β±0.33 / 17.54 ms β 3.58x slower β β QQuery 1 β14.65 / 15.36 Β±0.50 / 16.20 ms β29.65 / 30.60 Β±0.50 / 31.12 ms β 1.99x slower β β QQuery 2 β44.38 / 44.92 Β±0.54 / 45.87 ms β59.24 / 60.16 Β±0.65 / 60.80 ms β 1.34x slower β β QQuery 3 β43.47 / 46.97 Β±2.24 / 49.29 ms β57.55 / 60.30 Β±2.65 / 64.24 ms β 1.28x slower β β QQuery 4 β 329.53 / 337.16 Β±4.30 / 342.23 ms β 297.91 / 304.08 Β±4.96 / 311.45 ms β +1.11x faster β β QQuery 5 β 370.65 / 378.86 Β±7.38 / 387.77 ms β 357.48 / 364.60 Β±6.14 / 372.17 ms β no change β β QQuery 6 β 5.56 / 7.47 Β±2.44 / 12.14 ms β22.10 / 23.10 Β±0.84 / 24.28 ms β 3.09x slower β β QQuery 7 β16.74 / 18.10 Β±1.68 / 21.35 ms β32.82 / 33.16 Β±0.32 / 33.71 ms β 1.83x slower β β QQuery 8 β452.99 / 474.46 Β±16.45 / 495.91 ms β 439.01 / 450.72 Β±9.82 / 468.47 ms β +1.05x faster β β QQuery 9 β 731.49 / 742.40 Β±6.52 / 750.18 ms β689.90 / 699.61 Β±10.62 / 716.40 ms β +1.06x faster β β QQuery 10 β 94.37 / 98.95 Β±2.72 / 101.62 ms β 109.09 / 112.41 Β±3.01 / 117.26 ms β 1.14x slower β β QQuery 11 β 106.48 / 108.64 Β±2.36 / 113.15 ms β 120.00 / 123.95 Β±2.67 / 126.62 ms β 1.14x slower β β QQuery 12 β 374.05 / 380.20 Β±4.86 / 388.29 ms β 362.45 / 374.59 Β±7.21 / 383.02 ms β no change β β QQuery 13 β493.79 / 516.90 Β±16.59 / 535.83 ms β476.63 / 494.16 Β±10.23 / 505.06 ms β no change β β QQuery 14 β 375.25 / 379.26 Β±3.48 / 385.17 ms β 359.14 / 369.75 Β±6.43 / 376.08 ms β no change β β QQuery 15 β
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229863906 π€ Benchmark completed (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4229831576) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Details ``` Comparing HEAD and file-stats-cache Benchmark tpcds_sf1.json βββββ³βββ³βββ³ββββ β Query β HEAD β file-stats-cache βChange β β‘ββββββββββββββ© β QQuery 1 β 6.58 / 7.03 Β±0.82 / 8.67 ms β 6.85 / 7.22 Β±0.68 / 8.58 ms β no change β β QQuery 2 β144.77 / 146.06 Β±1.15 / 147.79 ms β146.32 / 147.72 Β±0.82 / 148.80 ms β no change β β QQuery 3 β113.28 / 114.21 Β±0.73 / 115.36 ms β114.43 / 115.16 Β±1.10 / 117.34 ms β no change β β QQuery 4 β 1329.81 / 1344.00 Β±9.90 / 1354.79 ms β1323.90 / 1354.62 Β±29.12 / 1397.33 ms β no change β β QQuery 5 β171.23 / 173.76 Β±1.44 / 175.18 ms β171.81 / 174.91 Β±1.95 / 177.02 ms β no change β β QQuery 6 β 837.56 / 855.54 Β±16.82 / 879.95 ms β844.15 / 859.49 Β±9.51 / 869.10 ms β no change β β QQuery 7 β341.05 / 345.35 Β±2.45 / 348.44 ms β342.94 / 345.55 Β±2.45 / 349.44 ms β no change β β QQuery 8 β115.26 / 117.32 Β±1.61 / 120.16 ms β115.32 / 116.88 Β±0.96 / 118.31 ms β no change β β QQuery 9 β 101.50 / 108.71 Β±10.42 / 129.31 ms β 101.93 / 110.30 Β±10.86 / 131.81 ms β no change β β QQuery 10 β106.11 / 107.63 Β±1.06 / 108.81 ms β105.63 / 107.30 Β±1.44 / 109.82 ms β no change β β QQuery 11 β 924.17 / 945.49 Β±11.69 / 955.93 ms β 901.97 / 925.97 Β±14.31 / 939.91 ms β no change β β QQuery 12 β 45.55 / 46.26 Β±0.62 / 47.07 ms β 44.08 / 46.08 Β±1.72 / 48.53 ms β no change β β QQuery 13 β400.62 / 404.33 Β±2.62 / 408.77 ms β400.05 / 402.54 Β±2.15 / 406.06 ms β no change β β QQuery 14 β 998.37 / 1002.10 Β±3.72 / 1008.21 ms β 995.99 / 999.43 Β±2.77 / 1004.26 ms β no change β β QQuery 15 β 15.29 / 16.22 Β±
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229835339 π€ Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4229831576) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4229831576-1089-4qn6f 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing file-stats-cache (c49c90669333b9bd35a3c5d9ad64ce3b49935907) to 8cf70ec (merge-base) [diff](https://github.com/apache/datafusion/compare/8cf70ec76c6f9de52709577f43559a0c93dd26d7..c49c90669333b9bd35a3c5d9ad64ce3b49935907) using: tpcds Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229835291 π€ Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4229831576) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4229831576-1088-qfzb5 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing file-stats-cache (c49c90669333b9bd35a3c5d9ad64ce3b49935907) to 8cf70ec (merge-base) [diff](https://github.com/apache/datafusion/compare/8cf70ec76c6f9de52709577f43559a0c93dd26d7..c49c90669333b9bd35a3c5d9ad64ce3b49935907) using: clickbench_partitioned Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]
adriangbot commented on PR #20047: URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229835382 π€ Benchmark running (GKE) | [trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4229831576) **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux bench-c4229831576-1090-64z4q 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux` CPU Details (lscpu) ``` Architecture:aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: ARM Model name: Neoverse-V2 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 16 Socket(s): - Cluster(s): 1 Stepping:r0p1 BogoMIPS:2000.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti L1d cache: 1 MiB (16 instances) L1i cache: 1 MiB (16 instances) L2 cache:32 MiB (16 instances) L3 cache:80 MiB (1 instance) NUMA node(s):1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling:Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1:Mitigation; __user pointer sanitization Vulnerability Spectre v2:Mitigation; CSV2, BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected ``` Comparing file-stats-cache (c49c90669333b9bd35a3c5d9ad64ce3b49935907) to 8cf70ec (merge-base) [diff](https://github.com/apache/datafusion/compare/8cf70ec76c6f9de52709577f43559a0c93dd26d7..c49c90669333b9bd35a3c5d9ad64ce3b49935907) using: tpch Results will be posted here when complete --- [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) against this benchmark runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
