Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-20 Thread via GitHub


alamb commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4500085124

   πŸŽ‰  -- amazing - Thank you so much @mkleen and @nuno-faria and @kosiew 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-18 Thread via GitHub


alamb merged PR #20047:
URL: https://github.com/apache/datafusion/pull/20047


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-18 Thread via GitHub


alamb commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4482080876

   All right, adding to the merge queue to get this in for 54. Thanks again 
@mkleen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-18 Thread via GitHub


nuno-faria commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4481397868

   > @nuno-faria and @kosiew -- is there anything else we need to do before 
merging this PR?
   
   Not on my end. Thanks again @mkleen for the hard work!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-18 Thread via GitHub


alamb commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4481195950

   @nuno-faria and @kosiew  -- is there anything else we need to do before 
merging this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-18 Thread via GitHub


alamb commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4481189144

   > @alamb It looks to me that the slowdowns are not real, but only noise.
   
   Agreed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-16 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4466796224

   @alamb It looks to me the benchmark regression is not real, but only noise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-15 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4462833622

   πŸ€– Benchmark completed (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4462702395)
   
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB)
   
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Details
   
   
   ```
   Comparing HEAD and file-stats-cache
   
   Benchmark clickbench_partitioned.json
   
   
┏━━━┳━━━┳━━━┳━━━┓
   ┃ Query ┃  HEAD ┃  
file-stats-cache ┃Change ┃
   
┑━━━╇━━━╇━━━╇━━━┩
   β”‚ QQuery 0  β”‚  1.21 / 4.71 Β±6.86 / 18.44 ms β”‚  1.30 / 4.84 
Β±6.93 / 18.69 ms β”‚ no change β”‚
   β”‚ QQuery 1  β”‚12.53 / 12.78 Β±0.18 / 12.99 ms β”‚12.60 / 13.00 
Β±0.21 / 13.17 ms β”‚ no change β”‚
   β”‚ QQuery 2  β”‚35.58 / 35.85 Β±0.28 / 36.38 ms β”‚35.98 / 36.25 
Β±0.32 / 36.72 ms β”‚ no change β”‚
   β”‚ QQuery 3  β”‚31.46 / 31.91 Β±0.46 / 32.79 ms β”‚31.57 / 32.23 
Β±0.41 / 32.73 ms β”‚ no change β”‚
   β”‚ QQuery 4  β”‚ 241.91 / 244.85 Β±1.66 / 246.64 ms β”‚ 244.34 / 247.59 
Β±4.37 / 256.22 ms β”‚ no change β”‚
   β”‚ QQuery 5  β”‚ 285.58 / 286.91 Β±1.81 / 290.48 ms β”‚ 288.66 / 290.41 
Β±1.83 / 293.24 ms β”‚ no change β”‚
   β”‚ QQuery 6  β”‚   5.91 / 7.20 Β±0.68 / 7.73 ms β”‚  6.60 / 8.43 
Β±2.02 / 12.36 ms β”‚  1.17x slower β”‚
   β”‚ QQuery 7  β”‚13.95 / 14.13 Β±0.16 / 14.37 ms β”‚14.08 / 14.20 
Β±0.09 / 14.32 ms β”‚ no change β”‚
   β”‚ QQuery 8  β”‚ 339.54 / 342.56 Β±2.50 / 347.08 ms β”‚ 345.51 / 348.46 
Β±2.50 / 351.80 ms β”‚ no change β”‚
   β”‚ QQuery 9  β”‚ 465.89 / 473.60 Β±5.72 / 480.85 ms β”‚ 455.62 / 469.15 
Β±8.05 / 480.71 ms β”‚ no change β”‚
   β”‚ QQuery 10 β”‚71.54 / 74.50 Β±2.23 / 78.03 ms β”‚70.03 / 73.47 
Β±4.08 / 81.48 ms β”‚ no change β”‚
   β”‚ QQuery 11 β”‚82.55 / 83.09 Β±0.47 / 83.84 ms β”‚80.75 / 81.63 
Β±0.67 / 82.35 ms β”‚ no change β”‚
   β”‚ QQuery 12 β”‚ 282.26 / 286.51 Β±3.69 / 292.47 ms β”‚ 284.36 / 287.10 
Β±2.21 / 289.95 ms β”‚ no change β”‚
   β”‚ QQuery 13 β”‚ 398.79 / 404.83 Β±5.96 / 415.54 ms β”‚ 404.20 / 409.11 
Β±5.41 / 419.41 ms β”‚ no change β”‚
   β”‚ QQuery 14 β”‚ 293.48 / 298.78 Β±6.53 / 311.32 ms β”‚ 295.85 / 299.17 
Β±3.20 / 305.00 ms β”‚ no change β”‚
   β”‚ QQuery 15 β”‚ 

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-15 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4462717687

   πŸ€– Benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4462702395)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4462702395-151-55fcl 6.12.68+ #1 SMP Wed Apr  1 02:23:28 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing file-stats-cache (f648bee41c1d50502b9b1fda63dd2e86dc8e41ea) to 
9d92944 (merge-base) 
[diff](https://github.com/apache/datafusion/compare/9d92944900f8a50b688a4bd8685629ebd76999f3..f648bee41c1d50502b9b1fda63dd2e86dc8e41ea)
 using: clickbench_partitioned
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-15 Thread via GitHub


alamb commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4462706714

   I want to see if some of these slowdowns are reproduced
   
   ```
   β”‚ QQuery 35 β”‚277.44 / 311.98 Β±42.77 / 392.86 ms β”‚ 309.15 / 317.20 
Β±7.72 / 327.72 ms β”‚no change β”‚
   β”‚ QQuery 36 β”‚66.14 / 70.52 Β±2.72 / 74.25 ms β”‚70.52 / 75.12 
Β±4.11 / 81.54 ms β”‚ 1.07x slower β”‚
   β”‚ QQuery 37 β”‚35.02 / 35.94 Β±0.57 / 36.58 ms β”‚38.04 / 40.48 
Β±2.45 / 44.69 ms β”‚ 1.13x slower β”‚
   β”‚ QQuery 38 β”‚39.71 / 42.03 Β±1.39 / 43.54 ms β”‚42.75 / 46.11 
Β±2.71 / 50.99 ms β”‚ 1.10x slower β”‚
   β”‚ QQuery 39 β”‚125.16 / 140.06 Β±10.52 / 149.46 ms β”‚ 143.78 / 149.00 
Β±7.79 / 164.51 ms β”‚ 1.06x slower β”‚
   β”‚ QQuery 40 β”‚13.80 / 14.09 Β±0.27 / 14.59 ms β”‚15.42 / 16.94 
Β±2.43 / 21.77 ms β”‚ 1.20x slower β”‚
   β”‚ QQuery 41 β”‚13.52 / 13.55 Β±0.03 / 13.60 ms β”‚14.99 / 15.20 
Β±0.19 / 15.50 ms β”‚ 1.12x slower β”‚
   β”‚ QQuery 42 β”‚12.96 / 13.16 Β±0.14 / 13.28 ms β”‚14.28 / 16.59 
Β±4.25 / 25.09 ms β”‚ 1.26x slower β”‚
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-15 Thread via GitHub


alamb commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4462702395

   
   run benchmark clickbench_partitioned
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-15 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4459100954

   @kosiew Thank you for all the work and the reviews! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-15 Thread via GitHub


kosiew commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4458125471

   @mkleen 
   I will add this to the merge queue after 24 hours if there are no further 
issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-15 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4458055885

   πŸ€– Benchmark completed (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4457908563)
   
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB)
   
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Details
   
   
   ```
   Comparing HEAD and file-stats-cache
   
   Benchmark clickbench_partitioned.json
   
   
┏━━━┳━━━┳━━━┳━━┓
   ┃ Query ┃  HEAD ┃  
file-stats-cache ┃   Change ┃
   
┑━━━╇━━━╇━━━╇━━┩
   β”‚ QQuery 0  β”‚  1.21 / 4.58 Β±6.67 / 17.93 ms β”‚  1.27 / 4.69 
Β±6.79 / 18.26 ms β”‚no change β”‚
   β”‚ QQuery 1  β”‚12.46 / 12.62 Β±0.12 / 12.78 ms β”‚12.80 / 12.90 
Β±0.10 / 13.04 ms β”‚no change β”‚
   β”‚ QQuery 2  β”‚35.22 / 35.57 Β±0.29 / 36.08 ms β”‚35.57 / 35.83 
Β±0.18 / 36.12 ms β”‚no change β”‚
   β”‚ QQuery 3  β”‚30.78 / 32.84 Β±1.91 / 35.66 ms β”‚30.74 / 31.63 
Β±0.85 / 32.91 ms β”‚no change β”‚
   β”‚ QQuery 4  β”‚ 230.28 / 234.55 Β±3.18 / 238.52 ms β”‚ 226.34 / 231.59 
Β±3.27 / 234.96 ms β”‚no change β”‚
   β”‚ QQuery 5  β”‚ 275.85 / 277.08 Β±1.61 / 280.21 ms β”‚ 271.34 / 275.08 
Β±2.60 / 278.63 ms β”‚no change β”‚
   β”‚ QQuery 6  β”‚   6.78 / 7.05 Β±0.27 / 7.55 ms β”‚   6.94 / 7.18 
Β±0.20 / 7.50 ms β”‚no change β”‚
   β”‚ QQuery 7  β”‚13.43 / 13.52 Β±0.06 / 13.59 ms β”‚13.58 / 13.81 
Β±0.16 / 14.06 ms β”‚no change β”‚
   β”‚ QQuery 8  β”‚ 324.80 / 328.39 Β±2.52 / 331.95 ms β”‚ 323.04 / 327.41 
Β±2.72 / 331.44 ms β”‚no change β”‚
   β”‚ QQuery 9  β”‚ 448.01 / 452.30 Β±3.96 / 457.57 ms β”‚ 443.43 / 449.73 
Β±6.36 / 460.10 ms β”‚no change β”‚
   β”‚ QQuery 10 β”‚70.36 / 71.07 Β±0.63 / 72.11 ms β”‚69.35 / 70.94 
Β±2.41 / 75.74 ms β”‚no change β”‚
   β”‚ QQuery 11 β”‚81.10 / 81.51 Β±0.38 / 81.97 ms β”‚79.23 / 82.35 
Β±3.64 / 89.48 ms β”‚no change β”‚
   β”‚ QQuery 12 β”‚ 268.41 / 270.50 Β±2.01 / 273.33 ms β”‚ 268.41 / 271.12 
Β±3.02 / 276.25 ms β”‚no change β”‚
   β”‚ QQuery 13 β”‚ 381.12 / 389.06 Β±5.76 / 396.66 ms β”‚ 384.90 / 392.02 
Β±9.20 / 409.94 ms β”‚no change β”‚
   β”‚ QQuery 14 β”‚ 280.97 / 283.81 Β±3.36 / 290.36 ms β”‚ 281.68 / 284.83 
Β±3.82 / 292.30 ms β”‚no change β”‚
   β”‚ QQuery 15 β”‚ 275.82 / 277.7

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-15 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4457998524

   @kosiew Thank you! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-15 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4457928339

   πŸ€– Benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4457908563)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4457908563-122-6mxc6 6.12.68+ #1 SMP Wed Apr  1 02:23:28 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing file-stats-cache (f648bee41c1d50502b9b1fda63dd2e86dc8e41ea) to 
9d92944 (merge-base) 
[diff](https://github.com/apache/datafusion/compare/9d92944900f8a50b688a4bd8685629ebd76999f3..f648bee41c1d50502b9b1fda63dd2e86dc8e41ea)
 using: clickbench_partitioned
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-15 Thread via GitHub


kosiew commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4457908563

   run benchmark clickbench_partitioned
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-14 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4457457101

   Hi @kosiew This is ready for another review. I fixed all remaining open 
issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-11 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3221422772


##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1420,14 +1429,19 @@ impl SessionContext {
 && table_provider.table_type() == table_type
 {
 schema.deregister_table(&table)?;
-if table_type == TableType::Base
-&& let Some(lfc) = 
self.runtime_env().cache_manager.get_list_files_cache()
-{
-lfc.drop_table_entries(&Some(table_ref))?;
+if table_type == TableType::Base {
+if let Some(lfc) = 
self.runtime_env().cache_manager.get_list_files_cache()
+{
+lfc.drop_table_entries(&Some(table_ref.clone()))?;
+}
+if let Some(fsc) =
+self.runtime_env().cache_manager.get_file_statistic_cache()
+{

Review Comment:
   See 
https://github.com/apache/datafusion/pull/20047/changes/63fafe41ba315149dad37980a7370f8995e53499



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-11 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3220859959


##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1879,6 +1881,7 @@ impl SessionContext {
 ) -> Result>> {
 let table_ref = table_ref.into();
 let table = table_ref.table().to_owned();
+self.invalidate_caches(&Some(table_ref.clone()))?;

Review Comment:
   This invalidates the caches when `deregister_table()` is called.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-11 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3220859959


##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1879,6 +1881,7 @@ impl SessionContext {
 ) -> Result>> {
 let table_ref = table_ref.into();
 let table = table_ref.table().to_owned();
+self.invalidate_caches(&Some(table_ref.clone()))?;

Review Comment:
   This invalidates the caches when `deregister_table()` is called.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-11 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3220859959


##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1879,6 +1881,7 @@ impl SessionContext {
 ) -> Result>> {
 let table_ref = table_ref.into();
 let table = table_ref.table().to_owned();
+self.invalidate_caches(&Some(table_ref.clone()))?;

Review Comment:
   This invalidates the caches when `deregister_table` is called.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-09 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4413607400

   run benchmark clickbench_partitioned
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-09 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4413107933

   run benchmark clickbench_partitioned


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-08 Thread via GitHub


kosiew commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3212246881


##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1420,14 +1429,19 @@ impl SessionContext {
 && table_provider.table_type() == table_type
 {
 schema.deregister_table(&table)?;
-if table_type == TableType::Base
-&& let Some(lfc) = 
self.runtime_env().cache_manager.get_list_files_cache()
-{
-lfc.drop_table_entries(&Some(table_ref))?;
+if table_type == TableType::Base {
+if let Some(lfc) = 
self.runtime_env().cache_manager.get_list_files_cache()
+{
+lfc.drop_table_entries(&Some(table_ref.clone()))?;
+}
+if let Some(fsc) =
+self.runtime_env().cache_manager.get_file_statistic_cache()
+{

Review Comment:
   Nice improvement adding cache cleanup through `find_and_deregister()`. One 
thing still missing though is the public `SessionContext::deregister_table()` 
path.
   
   Right now `register_listing_table()` can populate table-scoped list-files 
and file-statistics cache entries, but calling the public `deregister_table()` 
API only removes the provider and leaves those cache entries behind.
   
   That means API users can still end up with stale table-scoped entries 
visible in cache introspection until they age out via LRU eviction.
   
   Could we mirror the same cleanup logic used in `find_and_deregister()` here 
by removing entries from both `get_list_files_cache()` and 
`get_file_statistic_cache()` after a successful deregistration?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-08 Thread via GitHub


kosiew commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3212211721


##
datafusion/sqllogictest/test_files/set_variable.slt:
##
@@ -605,6 +611,15 @@ SHOW datafusion.runtime.max_temp_directory_size
 
 datafusion.runtime.max_temp_directory_size 10G
 
+# Test SET and SHOW runtime.file_statistics_cache_limit
+statement ok
+SET datafusion.runtime.file_statistics_cache_limit = '42M'

Review Comment:
   This SLT block sets `datafusion.runtime.file_statistics_cache_limit` and 
verifies it with `SHOW`, but it does not reset the value afterward.
   
   To avoid order-dependent behavior in later SLT statements, could we add:
   
   ```sql
   statement ok
   RESET datafusion.runtime.file_statistics_cache_limit
   ```
   
   after the `SHOW` verification block?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-08 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3208656616


##
datafusion-cli/src/main.rs:
##
@@ -687,68 +682,17 @@ mod tests {
 .await?;
 }
 
-// When the cache manager creates a StatisticsCache by default,
-// the contents will show up here
-let sql = "SELECT split_part(path, '/', -1) as filename, 
file_size_bytes, num_rows, num_columns, table_size_bytes from 
statistics_cache() order by filename";
+let sql = "SELECT split_part(path, '/', -1) as filename, table, 
file_size_bytes, num_rows, num_columns, table_size_bytes from 
statistics_cache() order by filename";
 let df = ctx.sql(sql).await?;
 let rbs = df.collect().await?;
 assert_snapshot!(batches_to_string(&rbs),@r"
-++
-++
-");
-
-Ok(())
-}
-
-// Can be removed when https://github.com/apache/datafusion/issues/19217 
is resolved
-#[tokio::test]
-async fn test_statistics_cache_override() -> Result<(), DataFusionError> {
-// Install a specific StatisticsCache implementation
-let file_statistics_cache = 
Arc::new(DefaultFileStatisticsCache::default());
-let cache_config = CacheManagerConfig::default()
-.with_files_statistics_cache(Some(file_statistics_cache.clone()));
-let runtime = RuntimeEnvBuilder::new()
-.with_cache_manager(cache_config)
-.build()?;
-let config = SessionConfig::new().with_collect_statistics(true);
-let ctx = SessionContext::new_with_config_rt(config, 
Arc::new(runtime));
-
-ctx.register_udtf(
-"statistics_cache",
-Arc::new(StatisticsCacheFunc::new(
-ctx.task_ctx().runtime_env().cache_manager.clone(),
-)),
-);
-
-for filename in [
-"alltypes_plain",
-"alltypes_tiny_pages",
-"lz4_raw_compressed_larger",
-] {
-ctx.sql(
-format!(
-"create external table {filename}
-stored as parquet
-location '../parquet-testing/data/{filename}.parquet'",
-)
-.as_str(),
-)
-.await?
-.collect()
-.await?;
-}
-
-let sql = "SELECT split_part(path, '/', -1) as filename, 
file_size_bytes, num_rows, num_columns, table_size_bytes from 
statistics_cache() order by filename";
-let df = ctx.sql(sql).await?;
-let rbs = df.collect().await?;
-assert_snapshot!(batches_to_string(&rbs),@r"
-
+---+-+--+-+--+
-| filename  | file_size_bytes | num_rows | 
num_columns | table_size_bytes |
-
+---+-+--+-+--+
-| alltypes_plain.parquet| 1851| Exact(8) | 
11  | Absent   |
-| alltypes_tiny_pages.parquet   | 454233  | Exact(7300)  | 
13  | Absent   |
-| lz4_raw_compressed_larger.parquet | 380836  | Exact(1) | 
1   | Absent   |
-
+---+-+--+-+--+
+  
+---+---+-+--+-+--+
+  | filename  | table | 
file_size_bytes | num_rows | num_columns | table_size_bytes |
+  
+---+---+-+--+-+--+
+  | alltypes_plain.parquet| alltypes_plain| 
1851| Exact(8) | 11  | Absent   |

Review Comment:
   Since the file statistics cache is now tablescoped, this should be reflected 
in the output. This also tests that the table references are always included in 
the key now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-08 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3208656616


##
datafusion-cli/src/main.rs:
##
@@ -687,68 +682,17 @@ mod tests {
 .await?;
 }
 
-// When the cache manager creates a StatisticsCache by default,
-// the contents will show up here
-let sql = "SELECT split_part(path, '/', -1) as filename, 
file_size_bytes, num_rows, num_columns, table_size_bytes from 
statistics_cache() order by filename";
+let sql = "SELECT split_part(path, '/', -1) as filename, table, 
file_size_bytes, num_rows, num_columns, table_size_bytes from 
statistics_cache() order by filename";
 let df = ctx.sql(sql).await?;
 let rbs = df.collect().await?;
 assert_snapshot!(batches_to_string(&rbs),@r"
-++
-++
-");
-
-Ok(())
-}
-
-// Can be removed when https://github.com/apache/datafusion/issues/19217 
is resolved
-#[tokio::test]
-async fn test_statistics_cache_override() -> Result<(), DataFusionError> {
-// Install a specific StatisticsCache implementation
-let file_statistics_cache = 
Arc::new(DefaultFileStatisticsCache::default());
-let cache_config = CacheManagerConfig::default()
-.with_files_statistics_cache(Some(file_statistics_cache.clone()));
-let runtime = RuntimeEnvBuilder::new()
-.with_cache_manager(cache_config)
-.build()?;
-let config = SessionConfig::new().with_collect_statistics(true);
-let ctx = SessionContext::new_with_config_rt(config, 
Arc::new(runtime));
-
-ctx.register_udtf(
-"statistics_cache",
-Arc::new(StatisticsCacheFunc::new(
-ctx.task_ctx().runtime_env().cache_manager.clone(),
-)),
-);
-
-for filename in [
-"alltypes_plain",
-"alltypes_tiny_pages",
-"lz4_raw_compressed_larger",
-] {
-ctx.sql(
-format!(
-"create external table {filename}
-stored as parquet
-location '../parquet-testing/data/{filename}.parquet'",
-)
-.as_str(),
-)
-.await?
-.collect()
-.await?;
-}
-
-let sql = "SELECT split_part(path, '/', -1) as filename, 
file_size_bytes, num_rows, num_columns, table_size_bytes from 
statistics_cache() order by filename";
-let df = ctx.sql(sql).await?;
-let rbs = df.collect().await?;
-assert_snapshot!(batches_to_string(&rbs),@r"
-
+---+-+--+-+--+
-| filename  | file_size_bytes | num_rows | 
num_columns | table_size_bytes |
-
+---+-+--+-+--+
-| alltypes_plain.parquet| 1851| Exact(8) | 
11  | Absent   |
-| alltypes_tiny_pages.parquet   | 454233  | Exact(7300)  | 
13  | Absent   |
-| lz4_raw_compressed_larger.parquet | 380836  | Exact(1) | 
1   | Absent   |
-
+---+-+--+-+--+
+  
+---+---+-+--+-+--+
+  | filename  | table | 
file_size_bytes | num_rows | num_columns | table_size_bytes |
+  
+---+---+-+--+-+--+
+  | alltypes_plain.parquet| alltypes_plain| 
1851| Exact(8) | 11  | Absent   |

Review Comment:
   Since the file statistics cache is now tablescoped, this should be reflected 
in the output. This also tests that the tablereferences are always included.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-08 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3208663879


##
datafusion/catalog-listing/src/helpers.rs:
##
@@ -416,8 +419,15 @@ pub async fn pruned_partition_list<'a>(
 table_path
 );
 
-// if no partition col => simply list all the files
-Ok(objects.map_ok(|object_meta| object_meta.into()).boxed())
+// if no partition col => list all the files
+Ok(objects
+.try_filter_map(|object_meta| {
+futures::future::ready(object_meta_to_partitioned_file(
+object_meta,
+table_path.get_table_ref(),

Review Comment:
   Table-reference needs to be always passed on the `PartitionedFile` because 
file statistics cache is now table-scoped and need the Table-reference for the 
caching for the key.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-08 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4406461353

   I also made sure https://github.com/apache/datafusion/pull/22014 is now part 
of this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-08 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4406450456

   @nuno-faria @kosiew @martin-g  
https://github.com/apache/datafusion/commit/d09a9195e711abe60d25fdfe3f7b3ddb1d7a78ab
 revealed up for this pr. The were cache entries which had no table reference 
and would therefore not be removed when the table is dropped. This is now 
fixed. Could you do another review please?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-08 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3208663879


##
datafusion/catalog-listing/src/helpers.rs:
##
@@ -416,8 +419,15 @@ pub async fn pruned_partition_list<'a>(
 table_path
 );
 
-// if no partition col => simply list all the files
-Ok(objects.map_ok(|object_meta| object_meta.into()).boxed())
+// if no partition col => list all the files
+Ok(objects
+.try_filter_map(|object_meta| {
+futures::future::ready(object_meta_to_partitioned_file(
+object_meta,
+table_path.get_table_ref(),

Review Comment:
   Table-reference needs to be always passed on the `PartitionedFile` because 
file statistics cache is now table-scoped and need the Table-reference for the 
caching.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-08 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3208656616


##
datafusion-cli/src/main.rs:
##
@@ -687,68 +682,17 @@ mod tests {
 .await?;
 }
 
-// When the cache manager creates a StatisticsCache by default,
-// the contents will show up here
-let sql = "SELECT split_part(path, '/', -1) as filename, 
file_size_bytes, num_rows, num_columns, table_size_bytes from 
statistics_cache() order by filename";
+let sql = "SELECT split_part(path, '/', -1) as filename, table, 
file_size_bytes, num_rows, num_columns, table_size_bytes from 
statistics_cache() order by filename";
 let df = ctx.sql(sql).await?;
 let rbs = df.collect().await?;
 assert_snapshot!(batches_to_string(&rbs),@r"
-++
-++
-");
-
-Ok(())
-}
-
-// Can be removed when https://github.com/apache/datafusion/issues/19217 
is resolved
-#[tokio::test]
-async fn test_statistics_cache_override() -> Result<(), DataFusionError> {
-// Install a specific StatisticsCache implementation
-let file_statistics_cache = 
Arc::new(DefaultFileStatisticsCache::default());
-let cache_config = CacheManagerConfig::default()
-.with_files_statistics_cache(Some(file_statistics_cache.clone()));
-let runtime = RuntimeEnvBuilder::new()
-.with_cache_manager(cache_config)
-.build()?;
-let config = SessionConfig::new().with_collect_statistics(true);
-let ctx = SessionContext::new_with_config_rt(config, 
Arc::new(runtime));
-
-ctx.register_udtf(
-"statistics_cache",
-Arc::new(StatisticsCacheFunc::new(
-ctx.task_ctx().runtime_env().cache_manager.clone(),
-)),
-);
-
-for filename in [
-"alltypes_plain",
-"alltypes_tiny_pages",
-"lz4_raw_compressed_larger",
-] {
-ctx.sql(
-format!(
-"create external table {filename}
-stored as parquet
-location '../parquet-testing/data/{filename}.parquet'",
-)
-.as_str(),
-)
-.await?
-.collect()
-.await?;
-}
-
-let sql = "SELECT split_part(path, '/', -1) as filename, 
file_size_bytes, num_rows, num_columns, table_size_bytes from 
statistics_cache() order by filename";
-let df = ctx.sql(sql).await?;
-let rbs = df.collect().await?;
-assert_snapshot!(batches_to_string(&rbs),@r"
-
+---+-+--+-+--+
-| filename  | file_size_bytes | num_rows | 
num_columns | table_size_bytes |
-
+---+-+--+-+--+
-| alltypes_plain.parquet| 1851| Exact(8) | 
11  | Absent   |
-| alltypes_tiny_pages.parquet   | 454233  | Exact(7300)  | 
13  | Absent   |
-| lz4_raw_compressed_larger.parquet | 380836  | Exact(1) | 
1   | Absent   |
-
+---+-+--+-+--+
+  
+---+---+-+--+-+--+
+  | filename  | table | 
file_size_bytes | num_rows | num_columns | table_size_bytes |
+  
+---+---+-+--+-+--+
+  | alltypes_plain.parquet| alltypes_plain| 
1851| Exact(8) | 11  | Absent   |

Review Comment:
   Since the file statistics cache is now tablescoped, this should be reflected 
in the output.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-06 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4388401223

   The filter pushdown does not work as expected anymore when the stats are 
cached. The commit 
https://github.com/apache/datafusion/commit/d09a9195e711abe60d25fdfe3f7b3ddb1d7a78ab
 changes the stats handling for this case. I will look into it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-06 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4388347487

   I would say it's a bug which wasn't exposed before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-06 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4388341446

   > We'll probably have to figure out what is causing that failure / if we 
should update the expected output or if there is a bug to fix
   
   Rhe query in the test is executed twice, once for the output and once for 
the plan (EXPLAIN) and now the caching kicks in, which wasn't the case before. 
I will look into the details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-06 Thread via GitHub


alamb commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4388315179

   We'll probably have to figure out what is causing that failure / if we 
should update the expected output or if there is a bug to fix


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-06 Thread via GitHub


alamb commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4388308017

   > This is related to the this commit 
https://github.com/apache/datafusion/commit/d09a9195e711abe60d25fdfe3f7b3ddb1d7a78ab
 If I revert this commit everything is passing. How should i proceed here?
   
   I recommend merging up from main and hopefully it passes; I'll click the 
button and see what happens


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-06 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4388281109

   > @nuno-faria and @kosiew -- shall we merge this PR? Or is it waiting on 
anything else?
   
   @alamb No, there is a blocker. While merging in the master branch i noticed 
on failing test: 
   
   ```
   at 
/Users/mkleen/Test/datafusion/datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:700
   
   
   2. query result mismatch:
   [SQL] EXPLAIN SELECT b FROM (
 SELECT a, b FROM t_union_mem
 UNION ALL
 SELECT a, b FROM t_union_parquet
   ) WHERE b > 2;
   [Diff] (-expected|+actual)
   logical_plan
   01)Projection: b
   02)--Filter: b > Int64(2)
   03)Union
   04)--Projection: t_union_mem.a, t_union_mem.b
   05)TableScan: t_union_mem
   06)--Projection: t_union_parquet.a, t_union_parquet.b
   -   07)TableScan: t_union_parquet
   -   physical_plan
   -   01)UnionExec
   -   02)--FilterExec: b@0 > 2
   -   03)DataSourceExec: partitions=1, partition_sizes=[1]
   -   04)--DataSourceExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet]]},
 projection=[b], file_type=parquet, predicate=b@1 > 2, 
pruning_predicate=b_null_count@1 != row_count@2 AND b_max@0 > 2, 
required_guarantees=[]
   +   07)TableScan: t_union_parquet
   at 
/Users/mkleen/Test/datafusion/datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:713
   ```
   
   This is related to the this commit 
https://github.com/apache/datafusion/commit/d09a9195e711abe60d25fdfe3f7b3ddb1d7a78ab
 If revert this commit everything is passing. How should i proceed here? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-06 Thread via GitHub


alamb commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4388041815

   @nuno-faria  and @kosiew  -- shall we merge this PR? Or is it waiting on 
anything else?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-05 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3187757931


##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,561 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano, 
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use hashbrown::HashSet;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until 
 and
+///  are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize;
+}
+
+#[derive(Default)]
+pub struct DFHeapSizeCtx {
+seen: HashSet,
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.num_rows.heap_size(ctx)
++ self.total_byte_size.heap_size(ctx)
++ self.column_statistics.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+match self {
+TableReference::Bare { table } => table.heap_size(ctx),
+TableReference::Partial { schema, table } => {
+schema.heap_size(ctx) + table.heap_size(ctx)
+}
+TableReference::Full {
+catalog,
+schema,
+table,
+} => catalog.heap_size(ctx) + schema.heap_size(ctx) + 
table.heap_size(ctx),
+}
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size(ctx))
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.null_count.heap_size(ctx)
++ self.max_value.heap_size(ctx)
++ self.min_value.heap_size(ctx)
++ self.sum_value.heap_size(ctx)
++ self.distinct_count.heap_size(ctx)
++ self.byte_size.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(ctx),
+Float16(f) => f.heap_size(ctx),
+Float32(f) => f.heap_size(ctx),
+Float64(f) => f.heap_size(ctx),
+Decimal32(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) + 
c.heap_size(ctx),
+Decimal64(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) + 
c.heap_size(ctx),
+Decimal128(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) + 
c.heap_size(ctx),
+Decimal256(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) + 
c.heap_size(ctx),
+Int8(i) => i.heap_size(ctx),
+Int16(i) => i.heap_size(ctx),
+Int32(i) => i.heap_size(ctx),
+Int64(i) => i.heap_size(ctx),
+UInt8(u) => u.heap_size(ctx),
+UInt16(u) => u.heap_size(ctx),
+UInt32(u) => u.heap_size(ctx),
+UInt64(u) => u.heap_size(ctx),
+Utf8(u) => u.heap_size(ctx),
+Utf8View(u) => u.heap_size(ctx),
+LargeUtf8(l) => l.heap_size(ctx),
+Binary(b) => b.heap_size(ctx),
+BinaryView(b) => b.heap_size(ctx),
+FixedSizeBinary(a, b) => a.heap_size(ctx) + b.heap_size(ctx),
+Large

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-05 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3187716765


##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,561 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano, 
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use hashbrown::HashSet;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until 
 and
+///  are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize;
+}
+
+#[derive(Default)]
+pub struct DFHeapSizeCtx {
+seen: HashSet,
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.num_rows.heap_size(ctx)
++ self.total_byte_size.heap_size(ctx)
++ self.column_statistics.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+match self {
+TableReference::Bare { table } => table.heap_size(ctx),
+TableReference::Partial { schema, table } => {
+schema.heap_size(ctx) + table.heap_size(ctx)
+}
+TableReference::Full {
+catalog,
+schema,
+table,
+} => catalog.heap_size(ctx) + schema.heap_size(ctx) + 
table.heap_size(ctx),
+}
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size(ctx))
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.null_count.heap_size(ctx)
++ self.max_value.heap_size(ctx)
++ self.min_value.heap_size(ctx)
++ self.sum_value.heap_size(ctx)
++ self.distinct_count.heap_size(ctx)
++ self.byte_size.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(ctx),
+Float16(f) => f.heap_size(ctx),
+Float32(f) => f.heap_size(ctx),
+Float64(f) => f.heap_size(ctx),
+Decimal32(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) + 
c.heap_size(ctx),
+Decimal64(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) + 
c.heap_size(ctx),
+Decimal128(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) + 
c.heap_size(ctx),
+Decimal256(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) + 
c.heap_size(ctx),
+Int8(i) => i.heap_size(ctx),
+Int16(i) => i.heap_size(ctx),
+Int32(i) => i.heap_size(ctx),
+Int64(i) => i.heap_size(ctx),
+UInt8(u) => u.heap_size(ctx),
+UInt16(u) => u.heap_size(ctx),
+UInt32(u) => u.heap_size(ctx),
+UInt64(u) => u.heap_size(ctx),
+Utf8(u) => u.heap_size(ctx),
+Utf8View(u) => u.heap_size(ctx),
+LargeUtf8(l) => l.heap_size(ctx),
+Binary(b) => b.heap_size(ctx),
+BinaryView(b) => b.heap_size(ctx),
+FixedSizeBinary(a, b) => a.heap_size(ctx) + b.heap_size(ctx),
+Large

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-05 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3186723956


##
docs/source/library-user-guide/upgrading/54.0.0.md:
##
@@ -460,3 +460,42 @@ impl Default for MyTreeNode {
 }
 }
 ```
+
+[20047]: https://github.com/apache/datafusion/pull/20047
+
+### File statistics cache is now memory-limited and managed by the 
`CacheManager`
+
+The file statistics cache used by `ListingTable` is now memory-limited and
+centrally managed through the `CacheManager`.
+
+To configure the cache size use the `file_statistics_cache_limit` setting:
+
+```sql
+SET datafusion.runtime.file_statistics_cache_limit = '10MB'
+```
+
+To disable the file statistics cache, set the limit to 0.
+
+The file statistics cache is no longer created inside the `ListingTable`.
+Instead, it is created within the `CacheManager` and must be passed to the 
`ListingTable`.
+
+**Who is affected:**
+
+- Users who want to limit the memory usage of the file statistics cache.
+- Users who want to disable the file statistics cache.
+- Users who want to create a `ListingTable` programmatically with a file 
statistics cache.
+
+**Migration guide:**
+
+Disable the cache by setting the configuration value to 0:

Review Comment:
   See https://github.com/apache/datafusion/pull/22014



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-05 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3187716765


##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,561 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano, 
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use hashbrown::HashSet;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until 
 and
+///  are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize;
+}
+
+#[derive(Default)]
+pub struct DFHeapSizeCtx {
+seen: HashSet,
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.num_rows.heap_size(ctx)
++ self.total_byte_size.heap_size(ctx)
++ self.column_statistics.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+match self {
+TableReference::Bare { table } => table.heap_size(ctx),
+TableReference::Partial { schema, table } => {
+schema.heap_size(ctx) + table.heap_size(ctx)
+}
+TableReference::Full {
+catalog,
+schema,
+table,
+} => catalog.heap_size(ctx) + schema.heap_size(ctx) + 
table.heap_size(ctx),
+}
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size(ctx))
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.null_count.heap_size(ctx)
++ self.max_value.heap_size(ctx)
++ self.min_value.heap_size(ctx)
++ self.sum_value.heap_size(ctx)
++ self.distinct_count.heap_size(ctx)
++ self.byte_size.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(ctx),
+Float16(f) => f.heap_size(ctx),
+Float32(f) => f.heap_size(ctx),
+Float64(f) => f.heap_size(ctx),
+Decimal32(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) + 
c.heap_size(ctx),
+Decimal64(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) + 
c.heap_size(ctx),
+Decimal128(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) + 
c.heap_size(ctx),
+Decimal256(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) + 
c.heap_size(ctx),
+Int8(i) => i.heap_size(ctx),
+Int16(i) => i.heap_size(ctx),
+Int32(i) => i.heap_size(ctx),
+Int64(i) => i.heap_size(ctx),
+UInt8(u) => u.heap_size(ctx),
+UInt16(u) => u.heap_size(ctx),
+UInt32(u) => u.heap_size(ctx),
+UInt64(u) => u.heap_size(ctx),
+Utf8(u) => u.heap_size(ctx),
+Utf8View(u) => u.heap_size(ctx),
+LargeUtf8(l) => l.heap_size(ctx),
+Binary(b) => b.heap_size(ctx),
+BinaryView(b) => b.heap_size(ctx),
+FixedSizeBinary(a, b) => a.heap_size(ctx) + b.heap_size(ctx),
+Large

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-05 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3187716765


##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,561 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano, 
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use hashbrown::HashSet;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until 
 and
+///  are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize;
+}
+
+#[derive(Default)]
+pub struct DFHeapSizeCtx {
+seen: HashSet,
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.num_rows.heap_size(ctx)
++ self.total_byte_size.heap_size(ctx)
++ self.column_statistics.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+match self {
+TableReference::Bare { table } => table.heap_size(ctx),
+TableReference::Partial { schema, table } => {
+schema.heap_size(ctx) + table.heap_size(ctx)
+}
+TableReference::Full {
+catalog,
+schema,
+table,
+} => catalog.heap_size(ctx) + schema.heap_size(ctx) + 
table.heap_size(ctx),
+}
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size(ctx))
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.null_count.heap_size(ctx)
++ self.max_value.heap_size(ctx)
++ self.min_value.heap_size(ctx)
++ self.sum_value.heap_size(ctx)
++ self.distinct_count.heap_size(ctx)
++ self.byte_size.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(ctx),
+Float16(f) => f.heap_size(ctx),
+Float32(f) => f.heap_size(ctx),
+Float64(f) => f.heap_size(ctx),
+Decimal32(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) + 
c.heap_size(ctx),
+Decimal64(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) + 
c.heap_size(ctx),
+Decimal128(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) + 
c.heap_size(ctx),
+Decimal256(a, b, c) => a.heap_size(ctx) + b.heap_size(ctx) + 
c.heap_size(ctx),
+Int8(i) => i.heap_size(ctx),
+Int16(i) => i.heap_size(ctx),
+Int32(i) => i.heap_size(ctx),
+Int64(i) => i.heap_size(ctx),
+UInt8(u) => u.heap_size(ctx),
+UInt16(u) => u.heap_size(ctx),
+UInt32(u) => u.heap_size(ctx),
+UInt64(u) => u.heap_size(ctx),
+Utf8(u) => u.heap_size(ctx),
+Utf8View(u) => u.heap_size(ctx),
+LargeUtf8(l) => l.heap_size(ctx),
+Binary(b) => b.heap_size(ctx),
+BinaryView(b) => b.heap_size(ctx),
+FixedSizeBinary(a, b) => a.heap_size(ctx) + b.heap_size(ctx),
+Large

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-05 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3186723956


##
docs/source/library-user-guide/upgrading/54.0.0.md:
##
@@ -460,3 +460,42 @@ impl Default for MyTreeNode {
 }
 }
 ```
+
+[20047]: https://github.com/apache/datafusion/pull/20047
+
+### File statistics cache is now memory-limited and managed by the 
`CacheManager`
+
+The file statistics cache used by `ListingTable` is now memory-limited and
+centrally managed through the `CacheManager`.
+
+To configure the cache size use the `file_statistics_cache_limit` setting:
+
+```sql
+SET datafusion.runtime.file_statistics_cache_limit = '10MB'
+```
+
+To disable the file statistics cache, set the limit to 0.
+
+The file statistics cache is no longer created inside the `ListingTable`.
+Instead, it is created within the `CacheManager` and must be passed to the 
`ListingTable`.
+
+**Who is affected:**
+
+- Users who want to limit the memory usage of the file statistics cache.
+- Users who want to disable the file statistics cache.
+- Users who want to create a `ListingTable` programmatically with a file 
statistics cache.
+
+**Migration guide:**
+
+Disable the cache by setting the configuration value to 0:

Review Comment:
   https://github.com/apache/datafusion/pull/22014



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-05-04 Thread via GitHub


martin-g commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3181769626


##
docs/source/library-user-guide/upgrading/54.0.0.md:
##
@@ -460,3 +460,42 @@ impl Default for MyTreeNode {
 }
 }
 ```
+
+[20047]: https://github.com/apache/datafusion/pull/20047
+
+### File statistics cache is now memory-limited and managed by the 
`CacheManager`
+
+The file statistics cache used by `ListingTable` is now memory-limited and
+centrally managed through the `CacheManager`.
+
+To configure the cache size use the `file_statistics_cache_limit` setting:
+
+```sql
+SET datafusion.runtime.file_statistics_cache_limit = '10MB'

Review Comment:
   ```suggestion
   SET datafusion.runtime.file_statistics_cache_limit = '10M'
   ```



##
docs/source/library-user-guide/upgrading/54.0.0.md:
##
@@ -460,3 +460,42 @@ impl Default for MyTreeNode {
 }
 }
 ```
+
+[20047]: https://github.com/apache/datafusion/pull/20047
+
+### File statistics cache is now memory-limited and managed by the 
`CacheManager`
+
+The file statistics cache used by `ListingTable` is now memory-limited and
+centrally managed through the `CacheManager`.
+
+To configure the cache size use the `file_statistics_cache_limit` setting:
+
+```sql
+SET datafusion.runtime.file_statistics_cache_limit = '10MB'
+```
+
+To disable the file statistics cache, set the limit to 0.
+
+The file statistics cache is no longer created inside the `ListingTable`.
+Instead, it is created within the `CacheManager` and must be passed to the 
`ListingTable`.
+
+**Who is affected:**
+
+- Users who want to limit the memory usage of the file statistics cache.
+- Users who want to disable the file statistics cache.
+- Users who want to create a `ListingTable` programmatically with a file 
statistics cache.
+
+**Migration guide:**
+
+Disable the cache by setting the configuration value to 0:
+
+```sql
+SET datafusion.runtime.file_statistics_cache_limit = '0k'

Review Comment:
   ```suggestion
   SET datafusion.runtime.file_statistics_cache_limit = '0K'
   ```



##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,561 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano, 
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use hashbrown::HashSet;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until 
 and
+///  are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize;
+}
+
+#[derive(Default)]
+pub struct DFHeapSizeCtx {
+seen: HashSet,
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.num_rows.heap_size(ctx)
++ self.total_byte_size.heap_size(ctx)
++ self.column_statistics.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+match self {
+TableReference::Bare { table } => table.heap_size(ctx),
+TableReference::Partial { schema, table } => {
+schema.heap_size(ctx) + table.heap_size(ctx)
+}
+TableReference::Full {
+catalog,
+schema,
+table,
+} => catalog.heap_size(ctx) + schema.heap_size(ctx) + 
table.heap_size(ctx),
+}
+ 

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-30 Thread via GitHub


github-actions[bot] commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4351882236

   
   Thank you for opening this pull request!
   
   Reviewer note: 
[cargo-semver-checks](https://github.com/obi1kenobi/cargo-semver-checks) 
reported the current version number is not SemVer-compatible with the changes 
in this pull request (compared against the base branch).
   
   
   Details
   
   ```
Cloning origin/main
   Building datafusion v53.1.0 (current)
  Built [  82.893s] (current)
Parsing datafusion v53.1.0 (current)
 Parsed [   0.033s] (current)
   Building datafusion v53.1.0 (baseline)
  Built [  81.011s] (baseline)
Parsing datafusion v53.1.0 (baseline)
 Parsed [   0.033s] (baseline)
   Checking datafusion v53.1.0 -> v53.1.0 (no change; assume patch)
Checked [   0.750s] 222 checks: 222 pass, 30 skip
Summary no semver update required
   Finished [ 167.572s] datafusion
   Building datafusion-catalog-listing v53.1.0 (current)
  Built [  41.429s] (current)
Parsing datafusion-catalog-listing v53.1.0 (current)
 Parsed [   0.011s] (current)
   Building datafusion-catalog-listing v53.1.0 (baseline)
  Built [  41.688s] (baseline)
Parsing datafusion-catalog-listing v53.1.0 (baseline)
 Parsed [   0.011s] (baseline)
   Checking datafusion-catalog-listing v53.1.0 -> v53.1.0 (no change; 
assume patch)
Checked [   0.103s] 222 checks: 222 pass, 30 skip
Summary no semver update required
   Finished [  85.122s] datafusion-catalog-listing
   Building datafusion-cli v53.1.0 (current)
  Built [ 138.721s] (current)
Parsing datafusion-cli v53.1.0 (current)
 Parsed [   0.031s] (current)
   Building datafusion-cli v53.1.0 (baseline)
  Built [ 137.963s] (baseline)
Parsing datafusion-cli v53.1.0 (baseline)
 Parsed [   0.031s] (baseline)
   Checking datafusion-cli v53.1.0 -> v53.1.0 (no change; assume patch)
Checked [   0.145s] 222 checks: 222 pass, 30 skip
Summary no semver update required
   Finished [ 280.601s] datafusion-cli
   Building datafusion-common v53.1.0 (current)
   error: running cargo-doc on crate 'datafusion-common' failed with output:
   -
  Compiling proc-macro2 v1.0.106
  Compiling unicode-ident v1.0.24
  Compiling quote v1.0.45
  Compiling libc v0.2.186
  Compiling libm v0.2.16
  Compiling autocfg v1.5.0
   Checking cfg-if v1.0.4
  Compiling num-traits v0.2.19
   Checking memchr v2.8.0
  Compiling syn v2.0.117
  Compiling shlex v1.3.0
  Compiling find-msvc-tools v0.1.9
  Compiling serde_core v1.0.228
  Compiling zerocopy v0.8.48
   Checking itoa v1.0.18
   Checking bytes v1.11.1
  Compiling zmij v1.0.21
  Compiling jobserver v0.1.34
  Compiling serde_json v1.0.149
  Compiling cc v1.2.61
   Checking num-integer v0.1.46
   Checking siphasher v1.0.2
  Compiling getrandom v0.3.4
  Compiling version_check v0.9.5
   Checking iana-time-zone v0.1.65
   Checking stable_deref_trait v1.2.1
   Checking chrono v0.4.44
  Compiling ahash v0.8.12
   Checking phf_shared v0.12.1
   Checking num-bigint v0.4.6
  Compiling chrono-tz v0.10.4
   Checking phf v0.12.1
   Checking once_cell v1.21.4
  Compiling synstructure v0.13.2
   Checking arrow-schema v58.1.0
   Checking num-complex v0.4.6
   Checking hashbrown v0.16.1
   Checking writeable v0.6.3
   Checking litemap v0.8.2
   Checking lexical-util v1.0.7
  Compiling pkg-config v0.3.33
  Compiling icu_properties_data v2.2.0
  Compiling icu_normalizer_data v2.2.0
  Compiling zerocopy-derive v0.8.48
  Compiling zerofrom-derive v0.1.7
  Compiling yoke-derive v0.8.2
  Compiling zerovec-derive v0.11.3
   Checking zerofrom v0.1.7
   Checking yoke v0.8.2
  Compiling displaydoc v0.2.5
   Checking zerotrie v0.2.4
  Compiling zstd-sys v2.0.16+zstd.1.5.7
   Checking zerovec v0.11.6
   Checking utf8_iter v1.0.4
  Compiling object v0.37.3
   Checking smallvec v1.15.1
   Checking tinystr v0.8.3
   Checking potential_utf v0.1.5
   Checking icu_collections v2.2.0
   Checking icu_locale_core v2.2.0
   Checking icu_provider v2.2.0
  Compiling semver v1.0.28
  Compiling rustc_version v0.4.1
   Checking lexical-write-integer v1.0.6
   Checking lexical-parse-integer v1.0.6
  Compiling zstd-safe v7.2.4
   Checking lexical-parse-float v1.0.6
   Checking lexical-write-float v1.0.6
   Checking icu_normalizer v2.2.0
   Checking half v2.7.1
   Checking arrow-buffer v58.1.0
   Checking icu_properties v2.2.0
   Checking arrow-data v58.1.0
   Checking arrow-array v58.1.0
  Compiling flatbuffers v25.12.19
   Checking aho-cora

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-30 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4351783039

   > hi @mkleen Can you merge with latest `main` to get #21913 which would fix 
the [CI 
error](https://github.com/apache/datafusion/actions/runs/25147830630/job/73712482487?pr=20047)
   
   Done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-30 Thread via GitHub


kosiew commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4351739294

   hi @mkleen 
   Can you merge with latest `main` to get #21913 which would fix the [CI 
error](https://github.com/apache/datafusion/actions/runs/25147830630/job/73712482487?pr=20047)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-29 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4342271621

   @kosiew Thanks a lot, for all these excellent review, this was really a 
fantastic job. I made one more iteration adressing the last outstanding issue 
and I hope this pr is now ready-to-go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-29 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3159660521


##
datafusion/execution/src/cache/file_statistics_cache.rs:
##
@@ -0,0 +1,745 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::cache::cache_manager::{
+CachedFileMetadata, FileStatisticsCache, FileStatisticsCacheEntry,
+};
+use crate::cache::{CacheAccessor, TableScopedPath};
+use std::collections::HashMap;
+use std::sync::Mutex;
+
+pub use crate::cache::DefaultFilesMetadataCache;
+use crate::cache::lru_queue::LruQueue;
+use datafusion_common::TableReference;
+use datafusion_common::heap_size::{DFHeapSize, DFHeapSizeCtx};
+
+/// Default implementation of [`FileStatisticsCache`]
+///
+/// Stores cached file metadata (statistics and orderings) for files.
+///
+/// The typical usage pattern is:
+/// 1. Call `get(path)` to check for cached value
+/// 2. If `Some(cached)`, validate with `cached.is_valid_for(Β€t_meta)`
+/// 3. If invalid or missing, compute new value and call `put(path, new_value)`
+///
+/// # Internal details
+///
+/// The `memory_limit` controls the maximum size of the cache, which uses a
+/// Least Recently Used eviction algorithm. When adding a new entry, if the 
total
+/// size of the cached entries exceeds `memory_limit`, the least recently used 
entries
+/// are evicted until the total size is lower than `memory_limit`.
+///
+///
+/// [`FileStatisticsCache`]: crate::cache::cache_manager::FileStatisticsCache
+#[derive(Default)]
+pub struct DefaultFileStatisticsCache {
+state: Mutex,
+}
+
+impl DefaultFileStatisticsCache {
+pub fn new(memory_limit: usize) -> Self {
+Self {
+state: 
Mutex::new(DefaultFileStatisticsCacheState::new(memory_limit)),
+}
+}
+
+/// Returns the size of the cached memory, in bytes.
+pub fn memory_used(&self) -> usize {
+let state = self.state.lock().unwrap();
+state.memory_used
+}
+}
+
+struct DefaultFileStatisticsCacheState {
+lru_queue: LruQueue,
+memory_limit: usize,
+memory_used: usize,
+}
+
+pub const DEFAULT_FILE_STATISTICS_MEMORY_LIMIT: usize = 20 * 1024 * 1024; // 
20MiB
+
+impl Default for DefaultFileStatisticsCacheState {
+fn default() -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit: DEFAULT_FILE_STATISTICS_MEMORY_LIMIT,
+memory_used: 0,
+}
+}
+}
+
+impl DefaultFileStatisticsCacheState {
+fn new(memory_limit: usize) -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit,
+memory_used: 0,
+}
+}
+fn get(&mut self, key: &TableScopedPath) -> Option {
+self.lru_queue.get(key).cloned()
+}
+
+fn put(
+&mut self,
+key: &TableScopedPath,
+value: CachedFileMetadata,
+) -> Option {
+let mut ctx = DFHeapSizeCtx::default();

Review Comment:
   Thank you, i made one more iteration to streamline the memory accounting for 
inserting.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-29 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3159664174


##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,551 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano, 
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use hashbrown::HashSet;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until 
 and
+///  are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize;
+}
+
+#[derive(Default)]
+pub struct DFHeapSizeCtx {
+seen: HashSet,
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.num_rows.heap_size(ctx)
++ self.total_byte_size.heap_size(ctx)
++ self.column_statistics.heap_size(ctx)
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self, ctx: &mut DFHeapSizeCtx) -> usize {
+self.table().heap_size(ctx)

Review Comment:
   This was a good catch, i improved the heapsize estimation. It was actually 
really off.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-28 Thread via GitHub


kosiew commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4341379343

   I ran the recommended `clickbench_partitioned.json` benchmark comparing HEAD 
with `file-stats-cache`. The total runtime was effectively unchanged: 
`20045.25ms` on HEAD vs `20005.27ms` with the PR. 
   
   Most queries showed no change, with 4 reported faster and 2 reported slower. 
   The two slower queries were very small absolute regressions on short-running 
queries and had relatively high variance, so I don’t think they indicate a 
meaningful regression. 
   Based on this, I don’t see evidence that further performance investigation 
is needed before merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-28 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4340870470

   πŸ€– Benchmark completed (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4340783583)
   
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB)
   
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Details
   
   
   ```
   Comparing HEAD and file-stats-cache
   
   Benchmark clickbench_partitioned.json
   
   
┏━━━┳━━━┳━━━┳━━━┓
   ┃ Query ┃  HEAD ┃  
file-stats-cache ┃Change ┃
   
┑━━━╇━━━╇━━━╇━━━┩
   β”‚ QQuery 0  β”‚  1.20 / 4.59 Β±6.72 / 18.03 ms β”‚  1.25 / 4.74 
Β±6.87 / 18.47 ms β”‚ no change β”‚
   β”‚ QQuery 1  β”‚12.57 / 12.84 Β±0.18 / 13.11 ms β”‚12.06 / 12.46 
Β±0.20 / 12.60 ms β”‚ no change β”‚
   β”‚ QQuery 2  β”‚36.40 / 36.87 Β±0.32 / 37.33 ms β”‚36.41 / 36.58 
Β±0.26 / 37.09 ms β”‚ no change β”‚
   β”‚ QQuery 3  β”‚31.68 / 32.60 Β±0.73 / 33.90 ms β”‚30.88 / 31.57 
Β±0.49 / 32.32 ms β”‚ no change β”‚
   β”‚ QQuery 4  β”‚ 240.46 / 244.16 Β±3.00 / 248.49 ms β”‚ 241.65 / 243.43 
Β±1.70 / 246.27 ms β”‚ no change β”‚
   β”‚ QQuery 5  β”‚ 282.44 / 284.79 Β±2.09 / 287.37 ms β”‚ 281.03 / 283.28 
Β±1.21 / 284.50 ms β”‚ no change β”‚
   β”‚ QQuery 6  β”‚   7.14 / 7.38 Β±0.18 / 7.67 ms β”‚   6.84 / 7.15 
Β±0.31 / 7.72 ms β”‚ no change β”‚
   β”‚ QQuery 7  β”‚13.73 / 13.83 Β±0.09 / 13.97 ms β”‚13.53 / 15.53 
Β±3.92 / 23.37 ms β”‚  1.12x slower β”‚
   β”‚ QQuery 8  β”‚ 328.30 / 330.07 Β±1.61 / 332.38 ms β”‚ 328.45 / 332.13 
Β±2.78 / 335.14 ms β”‚ no change β”‚
   β”‚ QQuery 9  β”‚ 443.05 / 450.06 Β±8.21 / 465.68 ms β”‚430.37 / 447.80 
Β±10.49 / 462.69 ms β”‚ no change β”‚
   β”‚ QQuery 10 β”‚74.68 / 76.06 Β±1.08 / 77.86 ms β”‚72.95 / 73.83 
Β±0.83 / 75.18 ms β”‚ no change β”‚
   β”‚ QQuery 11 β”‚85.29 / 87.00 Β±1.64 / 89.50 ms β”‚83.83 / 84.81 
Β±0.85 / 86.37 ms β”‚ no change β”‚
   β”‚ QQuery 12 β”‚ 272.49 / 276.98 Β±4.74 / 283.90 ms β”‚ 274.31 / 277.15 
Β±2.31 / 280.87 ms β”‚ no change β”‚
   β”‚ QQuery 13 β”‚ 394.14 / 402.44 Β±5.78 / 409.35 ms β”‚ 396.80 / 404.30 
Β±6.96 / 415.39 ms β”‚ no change β”‚
   β”‚ QQuery 14 β”‚ 284.33 / 291.75 Β±7.44 / 304.65 ms β”‚ 285.16 / 286.64 
Β±1.49 / 289.31 ms β”‚ no change β”‚
   β”‚ QQuery 15 β”‚ 

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-28 Thread via GitHub


kosiew commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3158625956


##
datafusion/execution/src/cache/file_statistics_cache.rs:
##
@@ -0,0 +1,745 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::cache::cache_manager::{
+CachedFileMetadata, FileStatisticsCache, FileStatisticsCacheEntry,
+};
+use crate::cache::{CacheAccessor, TableScopedPath};
+use std::collections::HashMap;
+use std::sync::Mutex;
+
+pub use crate::cache::DefaultFilesMetadataCache;
+use crate::cache::lru_queue::LruQueue;
+use datafusion_common::TableReference;
+use datafusion_common::heap_size::{DFHeapSize, DFHeapSizeCtx};
+
+/// Default implementation of [`FileStatisticsCache`]
+///
+/// Stores cached file metadata (statistics and orderings) for files.
+///
+/// The typical usage pattern is:
+/// 1. Call `get(path)` to check for cached value
+/// 2. If `Some(cached)`, validate with `cached.is_valid_for(Β€t_meta)`
+/// 3. If invalid or missing, compute new value and call `put(path, new_value)`
+///
+/// # Internal details
+///
+/// The `memory_limit` controls the maximum size of the cache, which uses a
+/// Least Recently Used eviction algorithm. When adding a new entry, if the 
total
+/// size of the cached entries exceeds `memory_limit`, the least recently used 
entries
+/// are evicted until the total size is lower than `memory_limit`.
+///
+///
+/// [`FileStatisticsCache`]: crate::cache::cache_manager::FileStatisticsCache
+#[derive(Default)]
+pub struct DefaultFileStatisticsCache {
+state: Mutex,
+}
+
+impl DefaultFileStatisticsCache {
+pub fn new(memory_limit: usize) -> Self {
+Self {
+state: 
Mutex::new(DefaultFileStatisticsCacheState::new(memory_limit)),
+}
+}
+
+/// Returns the size of the cached memory, in bytes.
+pub fn memory_used(&self) -> usize {
+let state = self.state.lock().unwrap();
+state.memory_used
+}
+}
+
+struct DefaultFileStatisticsCacheState {
+lru_queue: LruQueue,
+memory_limit: usize,
+memory_used: usize,
+}
+
+pub const DEFAULT_FILE_STATISTICS_MEMORY_LIMIT: usize = 20 * 1024 * 1024; // 
20MiB
+
+impl Default for DefaultFileStatisticsCacheState {
+fn default() -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit: DEFAULT_FILE_STATISTICS_MEMORY_LIMIT,
+memory_used: 0,
+}
+}
+}
+
+impl DefaultFileStatisticsCacheState {
+fn new(memory_limit: usize) -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit,
+memory_used: 0,
+}
+}
+fn get(&mut self, key: &TableScopedPath) -> Option {
+self.lru_queue.get(key).cloned()
+}
+
+fn put(
+&mut self,
+key: &TableScopedPath,
+value: CachedFileMetadata,
+) -> Option {
+let mut ctx = DFHeapSizeCtx::default();

Review Comment:
   This logic is correct today, but it feels a bit fragile.
   
   Right now it works because `TableScopedPath::heap_size` does not interact 
with the context, so recomputing `key.heap_size` with a fresh `DFHeapSizeCtx` 
gives the same result.
   
   However, if the key ever includes something like an `Arc` that *does* 
register in the context, the second computation could return 0 and silently 
skew `memory_used`.
   
   It might be safer to reuse the already computed `key_size` instead of 
recomputing it with a different context. That would make the intent clearer and 
avoid future footguns.



##
datafusion/execution/src/cache/cache_manager.rs:
##


Review Comment:
   This comment says "Default is disabled", but in practice the cache is 
created when `list_files_cache_limit > 0` (which it is by default).
   
   So effectively this is enabled by default, similar to the earlier issue we 
discussed. Might be worth tweaking the comment to reflect the actual behavior 
so it does not surprise future readers.



##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,551 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for a

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-28 Thread via GitHub


kosiew commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3158582525


##
datafusion/execution/src/cache/cache_manager.rs:
##
@@ -410,8 +454,10 @@ pub const DEFAULT_METADATA_CACHE_LIMIT: usize = 50 * 1024 
* 1024; // 50M
 pub struct CacheManagerConfig {
 /// Enable caching of file statistics when listing files.
 /// Enabling the cache avoids repeatedly reading file statistics in a 
DataFusion session.
-/// Default is disabled. Currently only Parquet files are supported.
-pub table_files_statistics_cache: Option>,
+/// Default is enabled with 1MiB. Currently only Parquet files are 
supported.

Review Comment:
   Small doc nit here. The comment is a bit out of sync with the current 
behavior.
   
   First, "Default is enabled" is misleading. The field itself defaults to 
None, and the cache only gets created in CacheManager::try_new when 
file_statistics_cache_limit > 0.
   
   Second, the "1MiB" value is stale. The default limit was bumped to 20 MiB, 
and the nearby docs already reflect that.
   
   Could you update this comment to match the current behavior and default? It 
will save future readers a bit of head scratching πŸ™‚



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-28 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4340794653

   πŸ€– Benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4340783583)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4340783583-1903-vbrzn 6.12.55+ #1 SMP Sun Feb  1 08:59:41 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing file-stats-cache (6808feffadc88aa18be810e48ee32a08ee865591) to 
54a5515 (merge-base) 
[diff](https://github.com/apache/datafusion/compare/54a551583bb13356c42e3c1d9e11b63079d2e40b..6808feffadc88aa18be810e48ee32a08ee865591)
 using: clickbench_partitioned
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-28 Thread via GitHub


kosiew commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4340783583

   run benchmark clickbench_partitioned


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-27 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3146316357


##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,476 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano, 
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until 
 and
+///  are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self) -> usize;
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self) -> usize {
+self.num_rows.heap_size()
++ self.total_byte_size.heap_size()
++ self.column_statistics.heap_size()
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self) -> usize {
+self.table().heap_size()
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size())
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self) -> usize {
+self.null_count.heap_size()
++ self.max_value.heap_size()
++ self.min_value.heap_size()
++ self.sum_value.heap_size()
++ self.distinct_count.heap_size()
++ self.byte_size.heap_size()
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(),
+Float16(f) => f.heap_size(),
+Float32(f) => f.heap_size(),
+Float64(f) => f.heap_size(),
+Decimal32(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Decimal64(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Decimal128(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Decimal256(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Int8(i) => i.heap_size(),
+Int16(i) => i.heap_size(),
+Int32(i) => i.heap_size(),
+Int64(i) => i.heap_size(),
+UInt8(u) => u.heap_size(),
+UInt16(u) => u.heap_size(),
+UInt32(u) => u.heap_size(),
+UInt64(u) => u.heap_size(),
+Utf8(u) => u.heap_size(),
+Utf8View(u) => u.heap_size(),
+LargeUtf8(l) => l.heap_size(),
+Binary(b) => b.heap_size(),
+BinaryView(b) => b.heap_size(),
+FixedSizeBinary(a, b) => a.heap_size() + b.heap_size(),
+LargeBinary(l) => l.heap_size(),
+FixedSizeList(f) => f.heap_size(),
+List(l) => l.heap_size(),
+LargeList(l) => l.heap_size(),
+Struct(s) => s.heap_size(),
+Map(m) => m.heap_size(),
+Date32(d) => d.heap_size(),
+Date64(d) => d.heap_size(),
+Time32Second(t) => t.heap_size(),
+Time32Millisecond(t) => t.heap_size(),
+Time64Microsecond(t) => t.heap_size(),
+Time64Nanosecond(t) => t.heap_size(),
+TimestampSecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampMillisecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampMicrosecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampNanosecond(a, b)

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-27 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3146316357


##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,476 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano, 
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until 
 and
+///  are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self) -> usize;
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self) -> usize {
+self.num_rows.heap_size()
++ self.total_byte_size.heap_size()
++ self.column_statistics.heap_size()
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self) -> usize {
+self.table().heap_size()
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size())
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self) -> usize {
+self.null_count.heap_size()
++ self.max_value.heap_size()
++ self.min_value.heap_size()
++ self.sum_value.heap_size()
++ self.distinct_count.heap_size()
++ self.byte_size.heap_size()
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(),
+Float16(f) => f.heap_size(),
+Float32(f) => f.heap_size(),
+Float64(f) => f.heap_size(),
+Decimal32(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Decimal64(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Decimal128(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Decimal256(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Int8(i) => i.heap_size(),
+Int16(i) => i.heap_size(),
+Int32(i) => i.heap_size(),
+Int64(i) => i.heap_size(),
+UInt8(u) => u.heap_size(),
+UInt16(u) => u.heap_size(),
+UInt32(u) => u.heap_size(),
+UInt64(u) => u.heap_size(),
+Utf8(u) => u.heap_size(),
+Utf8View(u) => u.heap_size(),
+LargeUtf8(l) => l.heap_size(),
+Binary(b) => b.heap_size(),
+BinaryView(b) => b.heap_size(),
+FixedSizeBinary(a, b) => a.heap_size() + b.heap_size(),
+LargeBinary(l) => l.heap_size(),
+FixedSizeList(f) => f.heap_size(),
+List(l) => l.heap_size(),
+LargeList(l) => l.heap_size(),
+Struct(s) => s.heap_size(),
+Map(m) => m.heap_size(),
+Date32(d) => d.heap_size(),
+Date64(d) => d.heap_size(),
+Time32Second(t) => t.heap_size(),
+Time32Millisecond(t) => t.heap_size(),
+Time64Microsecond(t) => t.heap_size(),
+Time64Nanosecond(t) => t.heap_size(),
+TimestampSecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampMillisecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampMicrosecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampNanosecond(a, b)

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-27 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3146316357


##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,476 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano, 
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until 
 and
+///  are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self) -> usize;
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self) -> usize {
+self.num_rows.heap_size()
++ self.total_byte_size.heap_size()
++ self.column_statistics.heap_size()
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self) -> usize {
+self.table().heap_size()
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size())
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self) -> usize {
+self.null_count.heap_size()
++ self.max_value.heap_size()
++ self.min_value.heap_size()
++ self.sum_value.heap_size()
++ self.distinct_count.heap_size()
++ self.byte_size.heap_size()
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(),
+Float16(f) => f.heap_size(),
+Float32(f) => f.heap_size(),
+Float64(f) => f.heap_size(),
+Decimal32(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Decimal64(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Decimal128(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Decimal256(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Int8(i) => i.heap_size(),
+Int16(i) => i.heap_size(),
+Int32(i) => i.heap_size(),
+Int64(i) => i.heap_size(),
+UInt8(u) => u.heap_size(),
+UInt16(u) => u.heap_size(),
+UInt32(u) => u.heap_size(),
+UInt64(u) => u.heap_size(),
+Utf8(u) => u.heap_size(),
+Utf8View(u) => u.heap_size(),
+LargeUtf8(l) => l.heap_size(),
+Binary(b) => b.heap_size(),
+BinaryView(b) => b.heap_size(),
+FixedSizeBinary(a, b) => a.heap_size() + b.heap_size(),
+LargeBinary(l) => l.heap_size(),
+FixedSizeList(f) => f.heap_size(),
+List(l) => l.heap_size(),
+LargeList(l) => l.heap_size(),
+Struct(s) => s.heap_size(),
+Map(m) => m.heap_size(),
+Date32(d) => d.heap_size(),
+Date64(d) => d.heap_size(),
+Time32Second(t) => t.heap_size(),
+Time32Millisecond(t) => t.heap_size(),
+Time64Microsecond(t) => t.heap_size(),
+Time64Nanosecond(t) => t.heap_size(),
+TimestampSecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampMillisecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampMicrosecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampNanosecond(a, b)

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-27 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3146323961


##
datafusion/execution/src/cache/cache_manager.rs:
##
@@ -330,8 +356,19 @@ pub struct CacheManager {
 
 impl CacheManager {
 pub fn try_new(config: &CacheManagerConfig) -> Result> {
-let file_statistic_cache =
-config.table_files_statistics_cache.as_ref().map(Arc::clone);
+let file_statistic_cache = match &config.file_statistics_cache {
+Some(fsc) if config.file_statistics_cache_limit > 0 => {
+fsc.update_cache_limit(config.file_statistics_cache_limit);
+Some(Arc::clone(fsc))
+}
+None if config.file_statistics_cache_limit > 0 => {

Review Comment:
   The behaviour change is on purpose see 
https://github.com/apache/datafusion/issues/19217 for more details. I 
documented now the change in the upgrade guide. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-27 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3146316357


##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,476 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano, 
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until 
 and
+///  are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self) -> usize;
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self) -> usize {
+self.num_rows.heap_size()
++ self.total_byte_size.heap_size()
++ self.column_statistics.heap_size()
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self) -> usize {
+self.table().heap_size()
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size())
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self) -> usize {
+self.null_count.heap_size()
++ self.max_value.heap_size()
++ self.min_value.heap_size()
++ self.sum_value.heap_size()
++ self.distinct_count.heap_size()
++ self.byte_size.heap_size()
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(),
+Float16(f) => f.heap_size(),
+Float32(f) => f.heap_size(),
+Float64(f) => f.heap_size(),
+Decimal32(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Decimal64(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Decimal128(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Decimal256(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Int8(i) => i.heap_size(),
+Int16(i) => i.heap_size(),
+Int32(i) => i.heap_size(),
+Int64(i) => i.heap_size(),
+UInt8(u) => u.heap_size(),
+UInt16(u) => u.heap_size(),
+UInt32(u) => u.heap_size(),
+UInt64(u) => u.heap_size(),
+Utf8(u) => u.heap_size(),
+Utf8View(u) => u.heap_size(),
+LargeUtf8(l) => l.heap_size(),
+Binary(b) => b.heap_size(),
+BinaryView(b) => b.heap_size(),
+FixedSizeBinary(a, b) => a.heap_size() + b.heap_size(),
+LargeBinary(l) => l.heap_size(),
+FixedSizeList(f) => f.heap_size(),
+List(l) => l.heap_size(),
+LargeList(l) => l.heap_size(),
+Struct(s) => s.heap_size(),
+Map(m) => m.heap_size(),
+Date32(d) => d.heap_size(),
+Date64(d) => d.heap_size(),
+Time32Second(t) => t.heap_size(),
+Time32Millisecond(t) => t.heap_size(),
+Time64Microsecond(t) => t.heap_size(),
+Time64Nanosecond(t) => t.heap_size(),
+TimestampSecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampMillisecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampMicrosecond(a, b) => a.heap_size() + b.heap_size(),
+TimestampNanosecond(a, b)

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-23 Thread via GitHub


kosiew commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3135429326


##
datafusion/execution/src/cache/cache_manager.rs:
##
@@ -330,8 +356,19 @@ pub struct CacheManager {
 
 impl CacheManager {
 pub fn try_new(config: &CacheManagerConfig) -> Result> {
-let file_statistic_cache =
-config.table_files_statistics_cache.as_ref().map(Arc::clone);
+let file_statistic_cache = match &config.file_statistics_cache {
+Some(fsc) if config.file_statistics_cache_limit > 0 => {
+fsc.update_cache_limit(config.file_statistics_cache_limit);
+Some(Arc::clone(fsc))
+}
+None if config.file_statistics_cache_limit > 0 => {

Review Comment:
   This change subtly alters the behavior of 
`with_file_statistics_cache(None)`. It no longer disables the cache if 
`file_statistics_cache_limit > 0`, because `try_new` will create a default 
cache in that case.
   
   Previously, passing `None` was enough to disable the cache. Now callers also 
need to know to set the limit to 0, which is not obvious.
   
   This feels like a silent behavior change and it also contradicts the current 
builder documentation. It would be good to either restore the original behavior 
or make the contract explicit in both code and docs.



##
datafusion/common/src/heap_size.rs:
##
@@ -0,0 +1,476 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::stats::Precision;
+use crate::{ColumnStatistics, ScalarValue, Statistics, TableReference};
+use arrow::array::{
+Array, FixedSizeListArray, LargeListArray, LargeListViewArray, ListArray,
+ListViewArray, MapArray, StructArray,
+};
+use arrow::datatypes::{
+DataType, Field, Fields, IntervalDayTime, IntervalMonthDayNano, 
IntervalUnit,
+TimeUnit, UnionFields, UnionMode, i256,
+};
+use chrono::{DateTime, Utc};
+use half::f16;
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::Arc;
+
+/// This is a temporary solution until 
 and
+///  are resolved.
+/// Trait for calculating the size of various containers
+pub trait DFHeapSize {
+/// Return the size of any bytes allocated on the heap by this object,
+/// including heap memory in those structures
+///
+/// Note that the size of the type itself is not included in the result --
+/// instead, that size is added by the caller (e.g. container).
+fn heap_size(&self) -> usize;
+}
+
+impl DFHeapSize for Statistics {
+fn heap_size(&self) -> usize {
+self.num_rows.heap_size()
++ self.total_byte_size.heap_size()
++ self.column_statistics.heap_size()
+}
+}
+
+impl DFHeapSize for TableReference {
+fn heap_size(&self) -> usize {
+self.table().heap_size()
+}
+}
+
+impl DFHeapSize
+for Precision
+{
+fn heap_size(&self) -> usize {
+self.get_value().map_or_else(|| 0, |v| v.heap_size())
+}
+}
+
+impl DFHeapSize for ColumnStatistics {
+fn heap_size(&self) -> usize {
+self.null_count.heap_size()
++ self.max_value.heap_size()
++ self.min_value.heap_size()
++ self.sum_value.heap_size()
++ self.distinct_count.heap_size()
++ self.byte_size.heap_size()
+}
+}
+
+impl DFHeapSize for ScalarValue {
+fn heap_size(&self) -> usize {
+use crate::scalar::ScalarValue::*;
+match self {
+Null => 0,
+Boolean(b) => b.heap_size(),
+Float16(f) => f.heap_size(),
+Float32(f) => f.heap_size(),
+Float64(f) => f.heap_size(),
+Decimal32(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Decimal64(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Decimal128(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Decimal256(a, b, c) => a.heap_size() + b.heap_size() + 
c.heap_size(),
+Int8(i) => i.heap_size(),
+Int16(i) => i.heap_size(),
+Int32(i) => i.heap_size(),
+Int64(i) => i.he

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-23 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r312248


##
datafusion/execution/src/cache/file_statistics_cache.rs:
##
@@ -0,0 +1,739 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::cache::cache_manager::{
+CachedFileMetadata, FileStatisticsCache, FileStatisticsCacheEntry,
+};
+use crate::cache::{CacheAccessor, TableScopedPath};
+use object_store::path::Path;
+use std::collections::HashMap;
+use std::sync::Mutex;
+
+pub use crate::cache::DefaultFilesMetadataCache;
+use crate::cache::lru_queue::LruQueue;
+use datafusion_common::TableReference;
+use datafusion_common::heap_size::DFHeapSize;
+
+/// Default implementation of [`FileStatisticsCache`]
+///
+/// Stores cached file metadata (statistics and orderings) for files.
+///
+/// The typical usage pattern is:
+/// 1. Call `get(path)` to check for cached value
+/// 2. If `Some(cached)`, validate with `cached.is_valid_for(Β€t_meta)`
+/// 3. If invalid or missing, compute new value and call `put(path, new_value)`
+///
+/// # Internal details
+///
+/// The `memory_limit` controls the maximum size of the cache, which uses a
+/// Least Recently Used eviction algorithm. When adding a new entry, if the 
total
+/// size of the cached entries exceeds `memory_limit`, the least recently used 
entries
+/// are evicted until the total size is lower than `memory_limit`.
+///
+///
+/// [`FileStatisticsCache`]: crate::cache::cache_manager::FileStatisticsCache
+#[derive(Default)]
+pub struct DefaultFileStatisticsCache {
+state: Mutex,
+}
+
+impl DefaultFileStatisticsCache {
+pub fn new(memory_limit: usize) -> Self {
+Self {
+state: 
Mutex::new(DefaultFileStatisticsCacheState::new(memory_limit)),
+}
+}
+
+/// Returns the size of the cached memory, in bytes.
+pub fn memory_used(&self) -> usize {
+let state = self.state.lock().unwrap();
+state.memory_used
+}
+}
+
+struct DefaultFileStatisticsCacheState {
+lru_queue: LruQueue,
+memory_limit: usize,
+memory_used: usize,
+}
+
+pub const DEFAULT_FILE_STATISTICS_MEMORY_LIMIT: usize = 20 * 1024 * 1024; // 
20MiB
+
+impl Default for DefaultFileStatisticsCacheState {
+fn default() -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit: DEFAULT_FILE_STATISTICS_MEMORY_LIMIT,
+memory_used: 0,
+}
+}
+}
+
+impl DefaultFileStatisticsCacheState {
+fn new(memory_limit: usize) -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit,
+memory_used: 0,
+}
+}
+fn get(&mut self, key: &TableScopedPath) -> Option {
+self.lru_queue.get(key).cloned()
+}
+
+fn put(
+&mut self,
+key: &TableScopedPath,
+value: CachedFileMetadata,
+) -> Option {
+let key_size = key.path.as_ref().heap_size();
+let entry_size = value.heap_size();
+
+if entry_size + key_size > self.memory_limit {
+// Remove potential stale entry
+self.remove(key);
+return None;
+}
+
+let old_value = self.lru_queue.put(key.clone(), value);
+self.memory_used += entry_size;
+self.memory_used += key.path.as_ref().heap_size();
+
+if let Some(old_entry) = &old_value {
+self.memory_used -= old_entry.heap_size();
+self.memory_used -= key.path.as_ref().heap_size();
+}
+
+self.evict_entries();
+
+old_value
+}
+
+fn remove(&mut self, k: &TableScopedPath) -> Option {
+if let Some(old_entry) = self.lru_queue.remove(k) {
+self.memory_used -= k.path.as_ref().heap_size();
+self.memory_used -= old_entry.heap_size();
+Some(old_entry)
+} else {
+None
+}
+}
+
+fn contains_key(&self, k: &TableScopedPath) -> bool {
+self.lru_queue.contains_key(k)
+}
+
+fn len(&self) -> usize {
+self.lru_queue.len()
+}
+
+fn clear(&mut self) {
+self.lru_queue.clear();
+self.memory_used = 0;
+}
+
+fn evict_entries(&mut self) {
+while 

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-23 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3129697841


##
datafusion/execution/src/cache/file_statistics_cache.rs:
##
@@ -0,0 +1,739 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::cache::cache_manager::{
+CachedFileMetadata, FileStatisticsCache, FileStatisticsCacheEntry,
+};
+use crate::cache::{CacheAccessor, TableScopedPath};
+use object_store::path::Path;
+use std::collections::HashMap;
+use std::sync::Mutex;
+
+pub use crate::cache::DefaultFilesMetadataCache;
+use crate::cache::lru_queue::LruQueue;
+use datafusion_common::TableReference;
+use datafusion_common::heap_size::DFHeapSize;
+
+/// Default implementation of [`FileStatisticsCache`]
+///
+/// Stores cached file metadata (statistics and orderings) for files.
+///
+/// The typical usage pattern is:
+/// 1. Call `get(path)` to check for cached value
+/// 2. If `Some(cached)`, validate with `cached.is_valid_for(Β€t_meta)`
+/// 3. If invalid or missing, compute new value and call `put(path, new_value)`
+///
+/// # Internal details
+///
+/// The `memory_limit` controls the maximum size of the cache, which uses a
+/// Least Recently Used eviction algorithm. When adding a new entry, if the 
total
+/// size of the cached entries exceeds `memory_limit`, the least recently used 
entries
+/// are evicted until the total size is lower than `memory_limit`.
+///
+///
+/// [`FileStatisticsCache`]: crate::cache::cache_manager::FileStatisticsCache
+#[derive(Default)]
+pub struct DefaultFileStatisticsCache {
+state: Mutex,
+}
+
+impl DefaultFileStatisticsCache {
+pub fn new(memory_limit: usize) -> Self {
+Self {
+state: 
Mutex::new(DefaultFileStatisticsCacheState::new(memory_limit)),
+}
+}
+
+/// Returns the size of the cached memory, in bytes.
+pub fn memory_used(&self) -> usize {
+let state = self.state.lock().unwrap();
+state.memory_used
+}
+}
+
+struct DefaultFileStatisticsCacheState {
+lru_queue: LruQueue,
+memory_limit: usize,
+memory_used: usize,
+}
+
+pub const DEFAULT_FILE_STATISTICS_MEMORY_LIMIT: usize = 20 * 1024 * 1024; // 
20MiB
+
+impl Default for DefaultFileStatisticsCacheState {
+fn default() -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit: DEFAULT_FILE_STATISTICS_MEMORY_LIMIT,
+memory_used: 0,
+}
+}
+}
+
+impl DefaultFileStatisticsCacheState {
+fn new(memory_limit: usize) -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit,
+memory_used: 0,
+}
+}
+fn get(&mut self, key: &TableScopedPath) -> Option {
+self.lru_queue.get(key).cloned()
+}
+
+fn put(
+&mut self,
+key: &TableScopedPath,
+value: CachedFileMetadata,
+) -> Option {
+let key_size = key.path.as_ref().heap_size();

Review Comment:
   Good catch, thank you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-23 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3129581031


##
datafusion/core/src/execution/context/mod.rs:
##


Review Comment:
   Yes, thank you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-23 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3129343757


##
datafusion/execution/src/cache/file_statistics_cache.rs:
##
@@ -0,0 +1,739 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::cache::cache_manager::{
+CachedFileMetadata, FileStatisticsCache, FileStatisticsCacheEntry,
+};
+use crate::cache::{CacheAccessor, TableScopedPath};
+use object_store::path::Path;
+use std::collections::HashMap;
+use std::sync::Mutex;
+
+pub use crate::cache::DefaultFilesMetadataCache;
+use crate::cache::lru_queue::LruQueue;
+use datafusion_common::TableReference;
+use datafusion_common::heap_size::DFHeapSize;
+
+/// Default implementation of [`FileStatisticsCache`]
+///
+/// Stores cached file metadata (statistics and orderings) for files.
+///
+/// The typical usage pattern is:
+/// 1. Call `get(path)` to check for cached value
+/// 2. If `Some(cached)`, validate with `cached.is_valid_for(Β€t_meta)`
+/// 3. If invalid or missing, compute new value and call `put(path, new_value)`
+///
+/// # Internal details
+///
+/// The `memory_limit` controls the maximum size of the cache, which uses a
+/// Least Recently Used eviction algorithm. When adding a new entry, if the 
total
+/// size of the cached entries exceeds `memory_limit`, the least recently used 
entries
+/// are evicted until the total size is lower than `memory_limit`.
+///
+///
+/// [`FileStatisticsCache`]: crate::cache::cache_manager::FileStatisticsCache
+#[derive(Default)]
+pub struct DefaultFileStatisticsCache {
+state: Mutex,
+}
+
+impl DefaultFileStatisticsCache {
+pub fn new(memory_limit: usize) -> Self {
+Self {
+state: 
Mutex::new(DefaultFileStatisticsCacheState::new(memory_limit)),
+}
+}
+
+/// Returns the size of the cached memory, in bytes.
+pub fn memory_used(&self) -> usize {
+let state = self.state.lock().unwrap();
+state.memory_used
+}
+}
+
+struct DefaultFileStatisticsCacheState {
+lru_queue: LruQueue,
+memory_limit: usize,
+memory_used: usize,
+}
+
+pub const DEFAULT_FILE_STATISTICS_MEMORY_LIMIT: usize = 20 * 1024 * 1024; // 
20MiB
+
+impl Default for DefaultFileStatisticsCacheState {
+fn default() -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit: DEFAULT_FILE_STATISTICS_MEMORY_LIMIT,
+memory_used: 0,
+}
+}
+}
+
+impl DefaultFileStatisticsCacheState {
+fn new(memory_limit: usize) -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit,
+memory_used: 0,
+}
+}
+fn get(&mut self, key: &TableScopedPath) -> Option {
+self.lru_queue.get(key).cloned()
+}
+
+fn put(
+&mut self,
+key: &TableScopedPath,
+value: CachedFileMetadata,
+) -> Option {
+let key_size = key.path.as_ref().heap_size();

Review Comment:
   Yes, let's do this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-23 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3129341511


##
datafusion/core/src/execution/context/mod.rs:
##


Review Comment:
   Yes, this makes sense!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-22 Thread via GitHub


kosiew commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3128191420


##
datafusion/execution/src/cache/file_statistics_cache.rs:
##
@@ -0,0 +1,739 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::cache::cache_manager::{
+CachedFileMetadata, FileStatisticsCache, FileStatisticsCacheEntry,
+};
+use crate::cache::{CacheAccessor, TableScopedPath};
+use object_store::path::Path;
+use std::collections::HashMap;
+use std::sync::Mutex;
+
+pub use crate::cache::DefaultFilesMetadataCache;
+use crate::cache::lru_queue::LruQueue;
+use datafusion_common::TableReference;
+use datafusion_common::heap_size::DFHeapSize;
+
+/// Default implementation of [`FileStatisticsCache`]
+///
+/// Stores cached file metadata (statistics and orderings) for files.
+///
+/// The typical usage pattern is:
+/// 1. Call `get(path)` to check for cached value
+/// 2. If `Some(cached)`, validate with `cached.is_valid_for(Β€t_meta)`
+/// 3. If invalid or missing, compute new value and call `put(path, new_value)`
+///
+/// # Internal details
+///
+/// The `memory_limit` controls the maximum size of the cache, which uses a
+/// Least Recently Used eviction algorithm. When adding a new entry, if the 
total
+/// size of the cached entries exceeds `memory_limit`, the least recently used 
entries
+/// are evicted until the total size is lower than `memory_limit`.
+///
+///
+/// [`FileStatisticsCache`]: crate::cache::cache_manager::FileStatisticsCache
+#[derive(Default)]
+pub struct DefaultFileStatisticsCache {
+state: Mutex,
+}
+
+impl DefaultFileStatisticsCache {
+pub fn new(memory_limit: usize) -> Self {
+Self {
+state: 
Mutex::new(DefaultFileStatisticsCacheState::new(memory_limit)),
+}
+}
+
+/// Returns the size of the cached memory, in bytes.
+pub fn memory_used(&self) -> usize {
+let state = self.state.lock().unwrap();
+state.memory_used
+}
+}
+
+struct DefaultFileStatisticsCacheState {
+lru_queue: LruQueue,
+memory_limit: usize,
+memory_used: usize,
+}
+
+pub const DEFAULT_FILE_STATISTICS_MEMORY_LIMIT: usize = 20 * 1024 * 1024; // 
20MiB
+
+impl Default for DefaultFileStatisticsCacheState {
+fn default() -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit: DEFAULT_FILE_STATISTICS_MEMORY_LIMIT,
+memory_used: 0,
+}
+}
+}
+
+impl DefaultFileStatisticsCacheState {
+fn new(memory_limit: usize) -> Self {
+Self {
+lru_queue: LruQueue::new(),
+memory_limit,
+memory_used: 0,
+}
+}
+fn get(&mut self, key: &TableScopedPath) -> Option {
+self.lru_queue.get(key).cloned()
+}
+
+fn put(
+&mut self,
+key: &TableScopedPath,
+value: CachedFileMetadata,
+) -> Option {
+let key_size = key.path.as_ref().heap_size();

Review Comment:
   The cache key is now `TableScopedPath { table, path }`, but the memory 
accounting here still only charges for the path bytes. That leaves the new 
`table_reference` portion effectively unaccounted for, so `memory_used` can 
undercount in multi-table scenarios.
   
   Would it make sense to include the `TableReference` heap usage as well so 
the cache limit is enforced more accurately?



##
datafusion/execution/src/cache/file_statistics_cache.rs:
##
@@ -0,0 +1,739 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing pe

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-22 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3123770486


##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1752,7 +1766,9 @@ impl SessionContext {
 let config = ListingTableConfig::new(table_path)
 .with_listing_options(options)
 .with_schema(resolved_schema);
-let table = 
ListingTable::try_new(config)?.with_definition(sql_definition);
+let table = ListingTable::try_new(config)?
+.with_definition(sql_definition)
+
.with_cache(self.runtime_env().cache_manager.get_file_statistic_cache());

Review Comment:
   @kosiew I now added the cache on the listing table in `_read_type()` and on 
all locations where a listing table is created with `with_collect_stat(true)` 
where therefore caching makes sense. Could you please do another review?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-16 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3093035238


##
datafusion/execution/src/runtime_env.rs:
##
@@ -514,6 +535,7 @@ impl RuntimeEnvBuilder {
 Some("50M".to_owned()),
 Some("1M".to_owned()),
 None,
+Some("1M".to_owned()),

Review Comment:
   Yes, good point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-16 Thread via GitHub


mkleen commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3093032919


##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1752,7 +1766,9 @@ impl SessionContext {
 let config = ListingTableConfig::new(table_path)
 .with_listing_options(options)
 .with_schema(resolved_schema);
-let table = 
ListingTable::try_new(config)?.with_definition(sql_definition);
+let table = ListingTable::try_new(config)?
+.with_definition(sql_definition)
+
.with_cache(self.runtime_env().cache_manager.get_file_statistic_cache());

Review Comment:
   No this was not intentional. This needs to be fixed. Thanks a lot!  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-16 Thread via GitHub


kosiew commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3092631607


##
datafusion/execution/src/runtime_env.rs:
##
@@ -514,6 +535,7 @@ impl RuntimeEnvBuilder {
 Some("50M".to_owned()),
 Some("1M".to_owned()),
 None,
+Some("1M".to_owned()),

Review Comment:
   I see that `DEFAULT_FILE_STATISTICS_MEMORY_LIMIT` was increased to 20 MiB, 
but `RuntimeEnvBuilder::entries()` still hard-codes `Some("1M".to_owned())` for 
`datafusion.runtime.file_statistics_cache_limit`.
   
   Would it make sense to update `entries()` to match the new default?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-16 Thread via GitHub


kosiew commented on code in PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#discussion_r3092631605


##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1752,7 +1766,9 @@ impl SessionContext {
 let config = ListingTableConfig::new(table_path)
 .with_listing_options(options)
 .with_schema(resolved_schema);
-let table = 
ListingTable::try_new(config)?.with_definition(sql_definition);
+let table = ListingTable::try_new(config)?
+.with_definition(sql_definition)
+
.with_cache(self.runtime_env().cache_manager.get_file_statistic_cache());

Review Comment:
   I noticed that only the registered-listing-table path was updated to call 
`.with_cache(...)`. Was it intentional that the generic reader path in 
`_read_type()` still uses `ListingTable::try_new(config)?` without attaching 
the runtime cache?
   
   With `ListingTable` now defaulting `collected_statistics` to `None`, could 
this mean that `SessionContext::read_parquet`, `read_csv`, `read_json`, 
`read_avro`, and similar methods no longer benefit from the file statistics 
cache?
   
   If so, would this be considered a regression in the default-cache rollout? 
How do you see this affecting users who rely on those reader paths?



##
datafusion/execution/src/runtime_env.rs:
##
@@ -514,6 +535,7 @@ impl RuntimeEnvBuilder {
 Some("50M".to_owned()),
 Some("1M".to_owned()),
 None,
+Some("1M".to_owned()),

Review Comment:
   I see that `DEFAULT_FILE_STATISTICS_MEMORY_LIMIT` was increased to 20 MiB, 
but `RuntimeEnvBuilder::entries()` still hard-codes `Some("1M".to_owned())` for 
`datafusion.runtime.file_statistics_cache_limit`.
   
   Should these two defaults be aligned? As it stands, could this create 
confusion where generated configuration docs and callers of `entries()` still 
reflect the old default while runtime behavior uses the new one?
   
   Would it make sense to update `entries()` to match the new default?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-15 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4253678703

   Ok, the benchmarks are looking good now again. At the moment we have 20 mb 
default cache size, with 10 mb the stats for clickhouse_partitioned did not fit 
yet. I wonder what would be a reasonable default value?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-15 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4252997551

   πŸ€– Benchmark completed (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4252876807)
   
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB)
   
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Details
   
   
   ```
   Comparing HEAD and file-stats-cache
   
   Benchmark clickbench_partitioned.json
   
   
┏━━━┳━━━┳━━━┳━━━┓
   ┃ Query ┃  HEAD ┃  
file-stats-cache ┃Change ┃
   
┑━━━╇━━━╇━━━╇━━━┩
   β”‚ QQuery 0  β”‚  1.20 / 4.41 Β±6.34 / 17.09 ms β”‚  1.23 / 4.47 
Β±6.37 / 17.22 ms β”‚ no change β”‚
   β”‚ QQuery 1  β”‚14.66 / 14.94 Β±0.24 / 15.25 ms β”‚14.41 / 14.57 
Β±0.22 / 15.00 ms β”‚ no change β”‚
   β”‚ QQuery 2  β”‚44.00 / 44.14 Β±0.13 / 44.37 ms β”‚43.38 / 43.86 
Β±0.28 / 44.25 ms β”‚ no change β”‚
   β”‚ QQuery 3  β”‚42.56 / 44.96 Β±1.66 / 47.06 ms β”‚39.38 / 39.80 
Β±0.27 / 40.11 ms β”‚ +1.13x faster β”‚
   β”‚ QQuery 4  β”‚ 277.84 / 285.17 Β±4.27 / 288.65 ms β”‚ 277.58 / 283.29 
Β±3.88 / 287.93 ms β”‚ no change β”‚
   β”‚ QQuery 5  β”‚ 337.90 / 341.30 Β±2.18 / 343.91 ms β”‚ 331.05 / 336.37 
Β±3.60 / 342.10 ms β”‚ no change β”‚
   β”‚ QQuery 6  β”‚   5.81 / 6.72 Β±0.74 / 7.97 ms β”‚   5.61 / 6.57 
Β±1.06 / 8.58 ms β”‚ no change β”‚
   β”‚ QQuery 7  β”‚16.68 / 17.12 Β±0.25 / 17.45 ms β”‚16.62 / 16.71 
Β±0.10 / 16.90 ms β”‚ no change β”‚
   β”‚ QQuery 8  β”‚ 412.71 / 420.72 Β±8.99 / 438.06 ms β”‚401.33 / 417.04 
Β±12.01 / 434.91 ms β”‚ no change β”‚
   β”‚ QQuery 9  β”‚ 628.36 / 639.32 Β±6.84 / 648.31 ms β”‚ 621.94 / 634.97 
Β±6.93 / 640.84 ms β”‚ no change β”‚
   β”‚ QQuery 10 β”‚91.30 / 92.43 Β±1.55 / 95.45 ms β”‚89.73 / 91.25 
Β±1.11 / 92.92 ms β”‚ no change β”‚
   β”‚ QQuery 11 β”‚ 102.80 / 104.42 Β±1.63 / 106.92 ms β”‚ 102.82 / 103.85 
Β±0.81 / 105.11 ms β”‚ no change β”‚
   β”‚ QQuery 12 β”‚ 330.37 / 342.33 Β±7.28 / 353.21 ms β”‚ 329.29 / 339.39 
Β±8.93 / 353.37 ms β”‚ no change β”‚
   β”‚ QQuery 13 β”‚ 442.00 / 457.80 Β±9.56 / 469.38 ms β”‚446.65 / 468.59 
Β±18.24 / 501.15 ms β”‚ no change β”‚
   β”‚ QQuery 14 β”‚ 340.00 / 343.67 Β±3.22 / 348.81 ms β”‚ 338.85 / 341.73 
Β±2.69 / 346.04 ms β”‚ no change β”‚
   β”‚ QQuery 15 β”‚ 

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-15 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4252893228

   πŸ€– Benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4252876807)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4252876807-1300-wbhhh 6.12.55+ #1 SMP Sun Feb  1 08:59:41 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing file-stats-cache (3b64005590fb19acfdc4ae3e7cf05403733830bb) to 
244f891 (merge-base) 
[diff](https://github.com/apache/datafusion/compare/244f89120b254f824b6fed0f8a88c23a74bf09d7..3b64005590fb19acfdc4ae3e7cf05403733830bb)
 using: clickbench_partitioned
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-15 Thread via GitHub


adriangb commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4252900371

   @mkleen you should be able to trigger benchmark runs now fwiw


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-15 Thread via GitHub


adriangb commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4252876807

   run benchmark clickbench_partitioned


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-15 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4252380128

   @kosiew I added all the changes you requested. Could you do another review 
please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-15 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4252322673

   This looks much better now: 
   
   ```
   
┏━━━┳━━━┳┳━━━┓
   ┃ Query ┃  HEAD ┃   
file-stats-cache ┃Change ┃
   
┑━━━╇━━━╇╇━━━┩
   β”‚ QQuery 0  β”‚  1.21 / 4.43 Β±6.35 / 17.13 ms β”‚   1.24 / 4.48 
Β±6.39 / 17.27 ms β”‚ no change β”‚
   ```
   
   There are a few minor regressions which could be just noise. Could we do a 
second run?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-15 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4252033177

   πŸ€– Benchmark completed (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4251865503)
   
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB)
   
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Details
   
   
   ```
   Comparing HEAD and file-stats-cache
   
   Benchmark clickbench_partitioned.json
   
   
┏━━━┳━━━┳┳━━━┓
   ┃ Query ┃  HEAD ┃   
file-stats-cache ┃Change ┃
   
┑━━━╇━━━╇╇━━━┩
   β”‚ QQuery 0  β”‚  1.21 / 4.43 Β±6.35 / 17.13 ms β”‚   1.24 / 4.48 
Β±6.39 / 17.27 ms β”‚ no change β”‚
   β”‚ QQuery 1  β”‚14.71 / 14.92 Β±0.18 / 15.23 ms β”‚ 14.17 / 14.68 
Β±0.32 / 15.17 ms β”‚ no change β”‚
   β”‚ QQuery 2  β”‚43.45 / 43.82 Β±0.38 / 44.52 ms β”‚ 43.72 / 44.07 
Β±0.33 / 44.61 ms β”‚ no change β”‚
   β”‚ QQuery 3  β”‚40.71 / 44.62 Β±2.98 / 48.23 ms β”‚ 42.83 / 45.30 
Β±1.54 / 47.53 ms β”‚ no change β”‚
   β”‚ QQuery 4  β”‚ 279.65 / 290.03 Β±5.81 / 297.44 ms β”‚ 278.66 / 297.03 
Β±16.87 / 321.40 ms β”‚ no change β”‚
   β”‚ QQuery 5  β”‚ 334.32 / 342.47 Β±4.70 / 347.62 ms β”‚  341.14 / 347.92 
Β±5.80 / 357.84 ms β”‚ no change β”‚
   β”‚ QQuery 6  β”‚   5.96 / 6.35 Β±0.28 / 6.69 ms β”‚   6.50 / 8.50 
Β±1.55 / 10.30 ms β”‚  1.34x slower β”‚
   β”‚ QQuery 7  β”‚16.21 / 16.97 Β±0.71 / 18.27 ms β”‚ 16.43 / 19.59 
Β±5.88 / 31.34 ms β”‚  1.15x slower β”‚
   β”‚ QQuery 8  β”‚ 407.88 / 418.65 Β±8.08 / 429.99 ms β”‚ 413.70 / 422.51 
Β±11.66 / 445.28 ms β”‚ no change β”‚
   β”‚ QQuery 9  β”‚ 630.52 / 639.24 Β±5.76 / 647.14 ms β”‚  636.38 / 644.87 
Β±5.68 / 653.84 ms β”‚ no change β”‚
   β”‚ QQuery 10 β”‚91.55 / 93.38 Β±2.06 / 97.40 ms β”‚ 91.34 / 93.25 
Β±2.69 / 98.47 ms β”‚ no change β”‚
   β”‚ QQuery 11 β”‚ 104.43 / 106.09 Β±1.48 / 108.07 ms β”‚  103.53 / 104.35 
Β±0.69 / 105.38 ms β”‚ no change β”‚
   β”‚ QQuery 12 β”‚ 336.53 / 340.35 Β±2.76 / 342.96 ms β”‚ 334.51 / 344.06 
Β±10.30 / 363.61 ms β”‚ no change β”‚
   β”‚ QQuery 13 β”‚442.59 / 461.29 Β±10.16 / 469.46 ms β”‚ 453.08 / 464.30 
Β±11.72 / 486.13 ms β”‚ no change β”‚
   β”‚ QQuery 14 β”‚ 341.31 / 347.52 Β±4.45 / 355.08 ms β”‚  331.97 / 342.74 
Β±6.03 / 349.90 ms β”‚ no change β”‚

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-15 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4251910707

   πŸ€– Benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4251865503)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4251865503-1285-n44mp 6.12.55+ #1 SMP Sun Feb  1 08:59:41 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing file-stats-cache (3b64005590fb19acfdc4ae3e7cf05403733830bb) to 
244f891 (merge-base) 
[diff](https://github.com/apache/datafusion/compare/244f89120b254f824b6fed0f8a88c23a74bf09d7..3b64005590fb19acfdc4ae3e7cf05403733830bb)
 using: clickbench_partitioned
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-15 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4251884799

   > run benchmark clickbench_partitioned
   
   @Dandandan Thank you, i broke the linter. It's fixed now. Does this affect 
the benchmark run?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-15 Thread via GitHub


Dandandan commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4251865503

   
   run benchmark clickbench_partitioned
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-15 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4251546359

   @adriangb Could please do another benchmark run?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-12 Thread via GitHub


nuno-faria commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4231834426

   > There seem to be reproducible slowdowns
   
   ```
   
┏━━━┳━━━┳━━━┳━━━┓
   ┃ Query ┃  HEAD ┃  
file-stats-cache ┃Change ┃
   
┑━━━╇━━━╇━━━╇━━━┩
   β”‚ QQuery 0  β”‚  1.30 / 4.72 Β±6.59 / 17.91 ms β”‚16.67 / 16.91 
Β±0.33 / 17.54 ms β”‚  3.58x slower β”‚
   ```
   
   The first query in clickbench is a simple `count(*)`, which should complete 
immediately after getting the statistics as far as I'm aware. @mkleen I would 
start by checking if the statistics are being collected in the benchmark. Maybe 
the cache is not enabled there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-12 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4231263601

   > Thanks @mkleen, LGTM after applying @kosiew suggestions. @alamb or 
@adriangb could you run some benchmarks here?
   
   @nuno-faria Thank you for your review! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-11 Thread via GitHub


mkleen commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4230402657

   > There seem to be reproducible slowdowns
   
   @adriangb Thanks, i will look into it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-11 Thread via GitHub


adriangb commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229900383

   There seem to be reproducible slowdowns


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-11 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229898903

   πŸ€– Benchmark completed (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4229873645)
   
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB)
   
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Details
   
   
   ```
   Comparing HEAD and file-stats-cache
   
   Benchmark clickbench_partitioned.json
   
   
┏━━━┳━━━┳━━━┳━━━┓
   ┃ Query ┃  HEAD ┃  
file-stats-cache ┃Change ┃
   
┑━━━╇━━━╇━━━╇━━━┩
   β”‚ QQuery 0  β”‚  1.21 / 4.53 Β±6.48 / 17.50 ms β”‚17.33 / 17.53 
Β±0.23 / 17.98 ms β”‚  3.87x slower β”‚
   β”‚ QQuery 1  β”‚14.59 / 14.93 Β±0.20 / 15.09 ms β”‚30.67 / 31.65 
Β±0.61 / 32.32 ms β”‚  2.12x slower β”‚
   β”‚ QQuery 2  β”‚44.19 / 44.59 Β±0.29 / 44.86 ms β”‚60.88 / 61.86 
Β±0.65 / 62.88 ms β”‚  1.39x slower β”‚
   β”‚ QQuery 3  β”‚44.33 / 47.04 Β±2.16 / 49.51 ms β”‚61.76 / 64.10 
Β±2.28 / 66.98 ms β”‚  1.36x slower β”‚
   β”‚ QQuery 4  β”‚ 303.29 / 313.85 Β±9.30 / 326.83 ms β”‚ 322.38 / 327.04 
Β±3.20 / 331.66 ms β”‚ no change β”‚
   β”‚ QQuery 5  β”‚ 358.04 / 360.84 Β±2.17 / 363.75 ms β”‚ 376.83 / 387.56 
Β±6.42 / 396.69 ms β”‚  1.07x slower β”‚
   β”‚ QQuery 6  β”‚   5.84 / 6.98 Β±0.95 / 8.18 ms β”‚21.35 / 23.37 
Β±1.78 / 26.52 ms β”‚  3.35x slower β”‚
   β”‚ QQuery 7  β”‚17.26 / 17.76 Β±0.68 / 19.11 ms β”‚33.96 / 35.59 
Β±1.21 / 37.11 ms β”‚  2.00x slower β”‚
   β”‚ QQuery 8  β”‚ 440.93 / 452.26 Β±8.97 / 466.73 ms β”‚ 457.56 / 470.54 
Β±8.33 / 479.19 ms β”‚ no change β”‚
   β”‚ QQuery 9  β”‚ 690.41 / 704.52 Β±8.51 / 712.50 ms β”‚ 707.69 / 722.72 
Β±9.59 / 736.12 ms β”‚ no change β”‚
   β”‚ QQuery 10 β”‚92.89 / 94.97 Β±1.36 / 96.73 ms β”‚ 113.09 / 116.63 
Β±2.62 / 119.67 ms β”‚  1.23x slower β”‚
   β”‚ QQuery 11 β”‚ 106.91 / 107.82 Β±1.06 / 109.88 ms β”‚ 124.36 / 125.87 
Β±1.44 / 127.77 ms β”‚  1.17x slower β”‚
   β”‚ QQuery 12 β”‚ 355.12 / 358.54 Β±2.16 / 361.28 ms β”‚ 377.88 / 389.74 
Β±7.10 / 396.88 ms β”‚  1.09x slower β”‚
   β”‚ QQuery 13 β”‚462.19 / 482.73 Β±20.92 / 522.20 ms β”‚ 507.09 / 518.90 
Β±9.90 / 533.63 ms β”‚  1.07x slower β”‚
   β”‚ QQuery 14 β”‚ 363.85 / 367.26 Β±3.03 / 372.32 ms β”‚ 390.10 / 393.15 
Β±3.05 / 397.77 ms β”‚  1.07x slower β”‚
   β”‚ QQuery 15 β”‚ 

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-11 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229877329

   πŸ€– Benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4229873645)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4229873645-1091-dm6wp 6.12.55+ #1 SMP Sun Feb  1 08:59:41 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing file-stats-cache (c49c90669333b9bd35a3c5d9ad64ce3b49935907) to 
8cf70ec (merge-base) 
[diff](https://github.com/apache/datafusion/compare/8cf70ec76c6f9de52709577f43559a0c93dd26d7..c49c90669333b9bd35a3c5d9ad64ce3b49935907)
 using: clickbench_partitioned
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-11 Thread via GitHub


adriangb commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229873645

   run benchmark clickbench_partitioned


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-11 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229863622

   πŸ€– Benchmark completed (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4229831576)
   
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB)
   
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Details
   
   
   ```
   Comparing HEAD and file-stats-cache
   
   Benchmark clickbench_partitioned.json
   
   
┏━━━┳━━━┳━━━┳━━━┓
   ┃ Query ┃  HEAD ┃  
file-stats-cache ┃Change ┃
   
┑━━━╇━━━╇━━━╇━━━┩
   β”‚ QQuery 0  β”‚  1.30 / 4.72 Β±6.59 / 17.91 ms β”‚16.67 / 16.91 
Β±0.33 / 17.54 ms β”‚  3.58x slower β”‚
   β”‚ QQuery 1  β”‚14.65 / 15.36 Β±0.50 / 16.20 ms β”‚29.65 / 30.60 
Β±0.50 / 31.12 ms β”‚  1.99x slower β”‚
   β”‚ QQuery 2  β”‚44.38 / 44.92 Β±0.54 / 45.87 ms β”‚59.24 / 60.16 
Β±0.65 / 60.80 ms β”‚  1.34x slower β”‚
   β”‚ QQuery 3  β”‚43.47 / 46.97 Β±2.24 / 49.29 ms β”‚57.55 / 60.30 
Β±2.65 / 64.24 ms β”‚  1.28x slower β”‚
   β”‚ QQuery 4  β”‚ 329.53 / 337.16 Β±4.30 / 342.23 ms β”‚ 297.91 / 304.08 
Β±4.96 / 311.45 ms β”‚ +1.11x faster β”‚
   β”‚ QQuery 5  β”‚ 370.65 / 378.86 Β±7.38 / 387.77 ms β”‚ 357.48 / 364.60 
Β±6.14 / 372.17 ms β”‚ no change β”‚
   β”‚ QQuery 6  β”‚  5.56 / 7.47 Β±2.44 / 12.14 ms β”‚22.10 / 23.10 
Β±0.84 / 24.28 ms β”‚  3.09x slower β”‚
   β”‚ QQuery 7  β”‚16.74 / 18.10 Β±1.68 / 21.35 ms β”‚32.82 / 33.16 
Β±0.32 / 33.71 ms β”‚  1.83x slower β”‚
   β”‚ QQuery 8  β”‚452.99 / 474.46 Β±16.45 / 495.91 ms β”‚ 439.01 / 450.72 
Β±9.82 / 468.47 ms β”‚ +1.05x faster β”‚
   β”‚ QQuery 9  β”‚ 731.49 / 742.40 Β±6.52 / 750.18 ms β”‚689.90 / 699.61 
Β±10.62 / 716.40 ms β”‚ +1.06x faster β”‚
   β”‚ QQuery 10 β”‚   94.37 / 98.95 Β±2.72 / 101.62 ms β”‚ 109.09 / 112.41 
Β±3.01 / 117.26 ms β”‚  1.14x slower β”‚
   β”‚ QQuery 11 β”‚ 106.48 / 108.64 Β±2.36 / 113.15 ms β”‚ 120.00 / 123.95 
Β±2.67 / 126.62 ms β”‚  1.14x slower β”‚
   β”‚ QQuery 12 β”‚ 374.05 / 380.20 Β±4.86 / 388.29 ms β”‚ 362.45 / 374.59 
Β±7.21 / 383.02 ms β”‚ no change β”‚
   β”‚ QQuery 13 β”‚493.79 / 516.90 Β±16.59 / 535.83 ms β”‚476.63 / 494.16 
Β±10.23 / 505.06 ms β”‚ no change β”‚
   β”‚ QQuery 14 β”‚ 375.25 / 379.26 Β±3.48 / 385.17 ms β”‚ 359.14 / 369.75 
Β±6.43 / 376.08 ms β”‚ no change β”‚
   β”‚ QQuery 15 β”‚ 

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-11 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229863906

   πŸ€– Benchmark completed (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4229831576)
   
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB)
   
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Details
   
   
   ```
   Comparing HEAD and file-stats-cache
   
   Benchmark tpcds_sf1.json
   
   
┏━━━┳━━┳━━┳━━━┓
   ┃ Query ┃ HEAD ┃ 
file-stats-cache ┃Change ┃
   
┑━━━╇━━╇━━╇━━━┩
   β”‚ QQuery 1  β”‚  6.58 / 7.03 Β±0.82 / 8.67 ms β”‚  6.85 / 
7.22 Β±0.68 / 8.58 ms β”‚ no change β”‚
   β”‚ QQuery 2  β”‚144.77 / 146.06 Β±1.15 / 147.79 ms β”‚146.32 / 
147.72 Β±0.82 / 148.80 ms β”‚ no change β”‚
   β”‚ QQuery 3  β”‚113.28 / 114.21 Β±0.73 / 115.36 ms β”‚114.43 / 
115.16 Β±1.10 / 117.34 ms β”‚ no change β”‚
   β”‚ QQuery 4  β”‚ 1329.81 / 1344.00 Β±9.90 / 1354.79 ms β”‚1323.90 / 
1354.62 Β±29.12 / 1397.33 ms β”‚ no change β”‚
   β”‚ QQuery 5  β”‚171.23 / 173.76 Β±1.44 / 175.18 ms β”‚171.81 / 
174.91 Β±1.95 / 177.02 ms β”‚ no change β”‚
   β”‚ QQuery 6  β”‚   837.56 / 855.54 Β±16.82 / 879.95 ms β”‚844.15 / 
859.49 Β±9.51 / 869.10 ms β”‚ no change β”‚
   β”‚ QQuery 7  β”‚341.05 / 345.35 Β±2.45 / 348.44 ms β”‚342.94 / 
345.55 Β±2.45 / 349.44 ms β”‚ no change β”‚
   β”‚ QQuery 8  β”‚115.26 / 117.32 Β±1.61 / 120.16 ms β”‚115.32 / 
116.88 Β±0.96 / 118.31 ms β”‚ no change β”‚
   β”‚ QQuery 9  β”‚   101.50 / 108.71 Β±10.42 / 129.31 ms β”‚   101.93 / 
110.30 Β±10.86 / 131.81 ms β”‚ no change β”‚
   β”‚ QQuery 10 β”‚106.11 / 107.63 Β±1.06 / 108.81 ms β”‚105.63 / 
107.30 Β±1.44 / 109.82 ms β”‚ no change β”‚
   β”‚ QQuery 11 β”‚   924.17 / 945.49 Β±11.69 / 955.93 ms β”‚   901.97 / 
925.97 Β±14.31 / 939.91 ms β”‚ no change β”‚
   β”‚ QQuery 12 β”‚   45.55 / 46.26 Β±0.62 / 47.07 ms β”‚   44.08 / 
46.08 Β±1.72 / 48.53 ms β”‚ no change β”‚
   β”‚ QQuery 13 β”‚400.62 / 404.33 Β±2.62 / 408.77 ms β”‚400.05 / 
402.54 Β±2.15 / 406.06 ms β”‚ no change β”‚
   β”‚ QQuery 14 β”‚  998.37 / 1002.10 Β±3.72 / 1008.21 ms β”‚   995.99 / 
999.43 Β±2.77 / 1004.26 ms β”‚ no change β”‚
   β”‚ QQuery 15 β”‚   15.29 / 16.22 Β±

Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-11 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229835339

   πŸ€– Benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4229831576)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4229831576-1089-4qn6f 6.12.55+ #1 SMP Sun Feb  1 08:59:41 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing file-stats-cache (c49c90669333b9bd35a3c5d9ad64ce3b49935907) to 
8cf70ec (merge-base) 
[diff](https://github.com/apache/datafusion/compare/8cf70ec76c6f9de52709577f43559a0c93dd26d7..c49c90669333b9bd35a3c5d9ad64ce3b49935907)
 using: tpcds
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-11 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229835291

   πŸ€– Benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4229831576)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4229831576-1088-qfzb5 6.12.55+ #1 SMP Sun Feb  1 08:59:41 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing file-stats-cache (c49c90669333b9bd35a3c5d9ad64ce3b49935907) to 
8cf70ec (merge-base) 
[diff](https://github.com/apache/datafusion/compare/8cf70ec76c6f9de52709577f43559a0c93dd26d7..c49c90669333b9bd35a3c5d9ad64ce3b49935907)
 using: clickbench_partitioned
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add a memory bound FileStatisticsCache for the Listing Table [datafusion]

2026-04-11 Thread via GitHub


adriangbot commented on PR #20047:
URL: https://github.com/apache/datafusion/pull/20047#issuecomment-4229835382

   πŸ€– Benchmark running (GKE) | 
[trigger](https://github.com/apache/datafusion/pull/20047#issuecomment-4229831576)
   **Instance:** `c4a-highmem-16` (12 vCPU / 65 GiB) | `Linux 
bench-c4229831576-1090-64z4q 6.12.55+ #1 SMP Sun Feb  1 08:59:41 UTC 2026 
aarch64 GNU/Linux`
   CPU Details (lscpu)
   
   ```
   Architecture:aarch64
   CPU op-mode(s):  64-bit
   Byte Order:  Little Endian
   CPU(s):  16
   On-line CPU(s) list: 0-15
   Vendor ID:   ARM
   Model name:  Neoverse-V2
   Model:   1
   Thread(s) per core:  1
   Core(s) per cluster: 16
   Socket(s):   -
   Cluster(s):  1
   Stepping:r0p1
   BogoMIPS:2000.00
   Flags:   fp asimd evtstrm aes pmull sha1 
sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 
sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 
sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm 
bf16 dgh rng bti
   L1d cache:   1 MiB (16 instances)
   L1i cache:   1 MiB (16 instances)
   L2 cache:32 MiB (16 instances)
   L3 cache:80 MiB (1 instance)
   NUMA node(s):1
   NUMA node0 CPU(s):   0-15
   Vulnerability Gather data sampling:  Not affected
   Vulnerability Indirect target selection: Not affected
   Vulnerability Itlb multihit: Not affected
   Vulnerability L1tf:  Not affected
   Vulnerability Mds:   Not affected
   Vulnerability Meltdown:  Not affected
   Vulnerability Mmio stale data:   Not affected
   Vulnerability Reg file data sampling:Not affected
   Vulnerability Retbleed:  Not affected
   Vulnerability Spec rstack overflow:  Not affected
   Vulnerability Spec store bypass: Mitigation; Speculative Store 
Bypass disabled via prctl
   Vulnerability Spectre v1:Mitigation; __user pointer 
sanitization
   Vulnerability Spectre v2:Mitigation; CSV2, BHB
   Vulnerability Srbds: Not affected
   Vulnerability Tsa:   Not affected
   Vulnerability Tsx async abort:   Not affected
   Vulnerability Vmscape:   Not affected
   ```
   
   
   
   Comparing file-stats-cache (c49c90669333b9bd35a3c5d9ad64ce3b49935907) to 
8cf70ec (merge-base) 
[diff](https://github.com/apache/datafusion/compare/8cf70ec76c6f9de52709577f43559a0c93dd26d7..c49c90669333b9bd35a3c5d9ad64ce3b49935907)
 using: tpch
   Results will be posted here when complete
   
   ---
   [File an issue](https://github.com/adriangb/datafusion-benchmarking/issues) 
against this benchmark runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



  1   2   >