Re: Should io_method=worker remain the default?

2025-09-03 Thread Tomas Vondra
case it makes sense, because the reads are random enough to prevent I/O combining. But for a sequential workload I'd expect I/O combining to help. Could it be that it ends up evicting buffers randomly, which (I guess) might interfere with the combining? What's shared_buffers set to? Have you watched how large the I/O requests are? iostat, iosnoop or strace would tell you. regards -- Tomas Vondra

Re: Changing the state of data checksums in a running cluster

2025-09-01 Thread Tomas Vondra
On 8/29/25 16:38, Tomas Vondra wrote: > On 8/29/25 16:26, Tomas Vondra wrote: >> ... >> >> I've seen these failures after changing checksums in both directions, >> both after enabling and disabling. But I've only ever saw this after >> immediate

Re: Adding skip scan (including MDAM style range skip scan) to nbtree

2025-08-29 Thread Tomas Vondra
On 8/29/25 21:03, Peter Geoghegan wrote: > On Fri, Aug 29, 2025 at 9:10 AM Tomas Vondra wrote: >> Peter, any thoughts on this. Do you think it's reasonable / feasible to >> push the fix? > > I don't feel comfortable pushing that fix today. > Understood. >

Re: Changing the state of data checksums in a running cluster

2025-08-29 Thread Tomas Vondra
On 8/29/25 16:26, Tomas Vondra wrote: > ... > > I've seen these failures after changing checksums in both directions, > both after enabling and disabling. But I've only ever saw this after > immediate shutdown, never after fast shutdown. (It's interesting the > pg

Re: Changing the state of data checksums in a running cluster

2025-08-29 Thread Tomas Vondra
On 8/27/25 14:42, Tomas Vondra wrote: > On 8/27/25 14:39, Tomas Vondra wrote: >> ... >> >> And this happened on Friday: >> >> commit c13070a27b63d9ce4850d88a63bf889a6fde26f0 >> Author: Alexander Korotkov >> Date: Fri Aug 22 18:44:39 2025 +0300 &

Re: Adding skip scan (including MDAM style range skip scan) to nbtree

2025-08-29 Thread Tomas Vondra
of index access > methods' handler_function output to const static, from dynamic in > memctx. > IIRC both approaches address the issue. I'd go with Peter's patch for 18. The other patch is much more invasive / bigger, and we're right before RC1 freeze. Maybe it's a good idea, but I'd say it's for 19. Peter, any thoughts on this. Do you think it's reasonable / feasible to push the fix? regards -- Tomas Vondra

Re: index prefetching

2025-08-28 Thread Tomas Vondra
On 8/29/25 01:57, Peter Geoghegan wrote: > On Thu, Aug 28, 2025 at 7:52 PM Tomas Vondra wrote: >> Use this branch: >> >> https://github.com/tvondra/postgres/commits/index-prefetch-master/ >> >> and then Thomas' patch that increases the prefetch distanc

Re: index prefetching

2025-08-28 Thread Tomas Vondra
On 8/29/25 01:27, Andres Freund wrote: > Hi, > > On 2025-08-29 01:00:58 +0200, Tomas Vondra wrote: >> I'm not sure how to determine what concurrency it "wants". All I know is >> that for "warm" runs [1], the basic index prefetch patch uses dista

Re: index prefetching

2025-08-28 Thread Tomas Vondra
On 8/28/25 21:52, Andres Freund wrote: > Hi, > > On 2025-08-28 19:08:40 +0200, Tomas Vondra wrote: >> On 8/28/25 18:16, Andres Freund wrote: >>>> So I think the IPC overhead with "worker" can be quite significant, >>>> especially for cases

Re: index prefetching

2025-08-28 Thread Tomas Vondra
On 8/28/25 23:50, Thomas Munro wrote: > On Fri, Aug 29, 2025 at 7:52 AM Andres Freund wrote: >> On 2025-08-28 19:08:40 +0200, Tomas Vondra wrote: >>> From the 2x regression (compared to master) it might seem like that, but >>> even with the increased distance it

Re: index prefetching

2025-08-28 Thread Tomas Vondra
On 8/28/25 18:16, Andres Freund wrote: > Hi, > > On 2025-08-28 14:45:24 +0200, Tomas Vondra wrote: >> On 8/26/25 17:06, Tomas Vondra wrote: >> I kept thinking about this, and in the end I decided to try to measure >> this IPC overhead. The backend/ioworker communicate

Re: Changing the state of data checksums in a running cluster

2025-08-28 Thread Tomas Vondra
even without primary shutdown. But the standby "fast" shutdown is always there. But this also shows a limitation of the TAP test - it never triggers the shutdowns while flipping the checksums (in flip_data_checksums). I think that's something worth testing. regards -- Toma

Re: index prefetching

2025-08-28 Thread Tomas Vondra
On 8/26/25 17:06, Tomas Vondra wrote: > > > On 8/26/25 01:48, Andres Freund wrote: >> Hi, >> >> On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote: >>> >>> ... >>> >>> I'm not sure what's causing this, but almost all regr

Re: Changing the state of data checksums in a running cluster

2025-08-27 Thread Tomas Vondra
On 8/27/25 14:39, Tomas Vondra wrote: > ... > > And this happened on Friday: > > commit c13070a27b63d9ce4850d88a63bf889a6fde26f0 > Author: Alexander Korotkov > Date: Fri Aug 22 18:44:39 2025 +0300 > > Revert "Get rid of WALBufMappingLo

Re: Changing the state of data checksums in a running cluster

2025-08-27 Thread Tomas Vondra
On 8/27/25 13:00, Daniel Gustafsson wrote: >> On 27 Aug 2025, at 11:39, Tomas Vondra wrote: > >> Just to be clear - I don't see any pg_checksums failures either. I only >> see failures in the standby log, and I don't think the script checks >> that (it prob

Re: Changing the state of data checksums in a running cluster

2025-08-27 Thread Tomas Vondra
On 8/27/25 10:30, Daniel Gustafsson wrote: >> On 26 Aug 2025, at 01:06, Tomas Vondra wrote: > >> I think this TAP looks very nice, but there's a couple issues with it. >> See the attached patch fixing those. > > Thanks, I have incorporated (most of) your pa

Re: index prefetching

2025-08-26 Thread Tomas Vondra
On 8/26/25 01:48, Andres Freund wrote: > Hi, > > On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote: >> Thanks. Based on the testing so far, the patch seems to be a substantial >> improvement. What's needed to make this prototype committable? > > Mainly some

Re: index prefetching

2025-08-26 Thread Tomas Vondra
On 8/26/25 03:08, Peter Geoghegan wrote: > On Mon Aug 25, 2025 at 10:18 AM EDT, Tomas Vondra wrote: >> The attached patch is a PoC implementing this. The core idea is that if >> we measure "miss probability" for a chunk of requests, we can use that >> to estimate

Re: Changing the state of data checksums in a running cluster

2025-08-25 Thread Tomas Vondra
On 8/25/25 20:32, Daniel Gustafsson wrote: >> On 20 Aug 2025, at 16:37, Tomas Vondra wrote: > >> This happens quite regularly, it's not hard to hit. But I've only seen >> it to happen on a FSM, and only right after immediate shutdown. I don't >> think

Re: index prefetching

2025-08-25 Thread Tomas Vondra
On 8/25/25 19:57, Peter Geoghegan wrote: > On Mon, Aug 25, 2025 at 10:18 AM Tomas Vondra wrote: >> Almost all regressions (at least the top ones) now look like this, i.e. >> distance collapses to ~2.0, which essentially disables prefetching. > > Good to know. > >&

Re: index prefetching

2025-08-25 Thread Tomas Vondra
On 8/25/25 17:43, Thomas Munro wrote: > On Tue, Aug 26, 2025 at 2:18 AM Tomas Vondra wrote: >> Of course, this can happen even with other hit ratios, there's nothing >> special about 50%. > > Right, that's what this patch was attacking directly, basically only

Re: index prefetching

2025-08-25 Thread Tomas Vondra
On 8/25/25 16:18, Tomas Vondra wrote: > ... > > But with more hits, the hit/miss ratio simply determines the "stable" > distance. Let's say there's 80% hits, so 4 hits to 1 miss. Then the > stable distance is ~4, because we get a miss, double to 8, and then 4

Re: index prefetching

2025-08-25 Thread Tomas Vondra
) So it's more a case of "mitigating a regression" (finding regressions like this is the purpose of my script). Still, I believe the questions about the distance heuristics are valid. (Another interesting detail is that the regression happens only with io_method=worker, not with io_urin

Re: index prefetching

2025-08-25 Thread Tomas Vondra
437)) Index Searches: 1 Prefetch Distance: 2.032 Prefetch Count: 868165 Prefetch Stalls: 2140228 Prefetch Skips: 6039906 Prefetch Resets: 0 Stream Ungets: 0 Stream Forwarded: 4 Prefetch Histogram: [2,4) => 855753, [4,8) => 12412 Buffers: shar

Re: Changing the state of data checksums in a running cluster

2025-08-20 Thread Tomas Vondra
e that as "off", i.e. error out. regards -- Tomas Vondra

Re: Changing the state of data checksums in a running cluster

2025-08-20 Thread Tomas Vondra
tgresql.org/message-id/f528413c-477a-4ec3-a0df-e22a80ffb...@vondra.me -- Tomas Vondra

Re: index prefetching

2025-08-19 Thread Tomas Vondra
imal to only initialize read_stream after reading the next batch. For some indexes a batch can have hundreds of items, and that certainly could benefit from prefetching. I suppose it should be possible to initialize the read_stream half-way though a batch, right? Or is there a reason why that can't work? regards [1] https://github.com/tvondra/postgres/tree/index-prefetch-master/query-stress-test -- Tomas Vondra

Re: Enable data checksums by default

2025-08-19 Thread Tomas Vondra
On 7/29/25 20:24, Tomas Vondra wrote: > Hi! > > So, what should we do with the PG18 open item? We (the RMT team) would > like to know if we shall keep the checksums enabled by default, and if > there's something that still needs to be done for PG18. > > We don't h

Re: VM corruption on standby

2025-08-19 Thread Tomas Vondra
2dc0e0d. regards -- Tomas Vondra

Re: index prefetching

2025-08-14 Thread Tomas Vondra
On 8/15/25 01:05, Peter Geoghegan wrote: > On Thu, Aug 14, 2025 at 6:24 PM Tomas Vondra wrote: >> FWIW I'm not claiming this explains all odd things we're investigating >> in this thread, it's more a confirmation that the scan direction may >> matter if it t

Re: index prefetching

2025-08-14 Thread Tomas Vondra
cate. It might > make sense to at least place that much of the burden on the > callback/client side. > I don't recall all the details, but IIRC my impression was it'd be best to do this "caching" entirely in the read_stream.c (so the next_block callbacks would probably not need to worry about lastBlock at all), enabled when creating the stream. And then there would be something like read_stream_release_buffer() that'd do the right to release the buffer when it's not needed. regards -- Tomas Vondra

Re: index prefetching

2025-08-14 Thread Tomas Vondra
On 8/14/25 01:19, Andres Freund wrote: > Hi, > > On 2025-08-14 01:11:07 +0200, Tomas Vondra wrote: >> On 8/13/25 23:57, Peter Geoghegan wrote: >>> On Wed, Aug 13, 2025 at 5:19 PM Tomas Vondra wrote: >>>> It's also not very surprising this happens w

Re: index prefetching

2025-08-13 Thread Tomas Vondra
On 8/14/25 01:50, Peter Geoghegan wrote: > On Wed Aug 13, 2025 at 5:19 PM EDT, Tomas Vondra wrote: >> I did investigate this, and I don't think there's anything broken in >> read_stream. It happens because ReadStream has a concept of "ungetting" >> a blo

Re: index prefetching

2025-08-13 Thread Tomas Vondra
On 8/13/25 23:36, Peter Geoghegan wrote: > On Wed, Aug 13, 2025 at 1:01 PM Tomas Vondra wrote: >> This seems rather bizarre, considering the two tables are exactly the >> same, except that in t2 the first column is negative, and the rows are >> fixed-length. Even heap_page_

Re: index prefetching

2025-08-13 Thread Tomas Vondra
On 8/13/25 23:57, Peter Geoghegan wrote: > On Wed, Aug 13, 2025 at 5:19 PM Tomas Vondra wrote: >> It's also not very surprising this happens with backwards scans more. >> The I/O is apparently much slower (due to missing OS prefetch), so we're >> much more likely

Re: index prefetching

2025-08-13 Thread Tomas Vondra
On 8/13/25 23:37, Andres Freund wrote: > Hi, > > On 2025-08-13 23:07:07 +0200, Tomas Vondra wrote: >> On 8/13/25 16:44, Andres Freund wrote: >>> On 2025-08-13 14:15:37 +0200, Tomas Vondra wrote: >>>> In fact, I believe this is about io_method. I initiall

Re: index prefetching

2025-08-13 Thread Tomas Vondra
again. It may seem as if read_stream_get_block() produced the same block twice, but it's really just the block from the last round. All duplicates produced by read_stream_look_ahead were caused by this. I suspected it's a bug in lastBlock optimization, but that's not the case, it happens entirely within read_stream. And it's expected. It's also not very surprising this happens with backwards scans more. The I/O is apparently much slower (due to missing OS prefetch), so we're much more likely to hit the I/O limits (max_ios and various other limits in read_stream_start_pending_read). regards -- Tomas Vondra

Re: index prefetching

2025-08-13 Thread Tomas Vondra
On 8/13/25 16:44, Andres Freund wrote: > Hi, > > On 2025-08-13 14:15:37 +0200, Tomas Vondra wrote: >> In fact, I believe this is about io_method. I initially didn't see the >> difference you described, and then I realized I set io_method=sync to >> make it easie

Re: index prefetching

2025-08-13 Thread Tomas Vondra
On 8/13/25 18:36, Peter Geoghegan wrote: > On Wed, Aug 13, 2025 at 8:15 AM Tomas Vondra wrote: >> 1) created a second table with an "inverse pattern" that's decreasing: >> >> create table t2 (like t) with (fillfactor = 20); >> insert into t2 select

Re: Adding basic NUMA awareness

2025-08-13 Thread Tomas Vondra
On 8/13/25 17:16, Andres Freund wrote: > Hi, > > On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote: >> The patch does a much simpler thing - treat the weight as a "budget", >> i.e. number of buffers to allocate before proceeding to the "next" >> part

Re: index prefetching

2025-08-13 Thread Tomas Vondra
On 8/13/25 01:33, Peter Geoghegan wrote: > On Tue, Aug 12, 2025 at 7:10 PM Tomas Vondra wrote: >> Actually, this might be a consequence of how backwards scans work (at >> least in btree). I logged the block in index_scan_stream_read_next, and >> this is what I see in the

Re: index prefetching

2025-08-12 Thread Tomas Vondra
On 8/12/25 23:52, Tomas Vondra wrote: > > On 8/12/25 23:22, Peter Geoghegan wrote: >> ... >> >> It looks like the patch does significantly better with the forwards scan, >> compared to the backwards scan (though both are improved by a lot). But >> that

Re: index prefetching

2025-08-12 Thread Tomas Vondra
ise, since it looks > like > OS readahead remains a big factor with direct I/O. Did I just miss something > obvious? > I don't think you missed anything. It does seem the assumption relies on the OS handling the underlying I/O patterns equally, and unfortunately that does not seem to be the case. Maybe we could "invert" the data set, i.e. make it "descending" instead of "ascending"? That would make the heap access direction "forward" again ... regards -- Tomas Vondra

Re: index prefetching

2025-08-12 Thread Tomas Vondra
On 8/12/25 18:53, Tomas Vondra wrote: > ... > > EXPLAIN (ANALYZE, COSTS OFF) > SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a ASC; > > QUERY PLAN > > I

Re: index prefetching

2025-08-12 Thread Tomas Vondra
On 8/12/25 13:22, Nazir Bilal Yavuz wrote: > Hi, > > On Tue, 12 Aug 2025 at 08:07, Thomas Munro wrote: >> >> On Tue, Aug 12, 2025 at 11:42 AM Peter Geoghegan wrote: >>> On Mon, Aug 11, 2025 at 5:07 PM Tomas Vondra wrote: >>>> I can do some tests with f

Re: Adding basic NUMA awareness

2025-08-12 Thread Tomas Vondra
On 8/12/25 16:24, Andres Freund wrote: > Hi, > > On 2025-08-12 13:04:07 +0200, Tomas Vondra wrote: >> Right. I don't think the current patch would crash - I can't test it, >> but I don't see why it would crash. In the worst case it'd end up with >&

Re: Adding basic NUMA awareness

2025-08-12 Thread Tomas Vondra
On 8/9/25 02:25, Andres Freund wrote: > Hi, > > On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote: >> 2) I'm a bit unsure what "NUMA nodes" actually means. The patch mostly >> assumes each core / piece of RAM is assigned to a particular NUMA node. > >

Re: index prefetching

2025-08-11 Thread Tomas Vondra
On 8/11/25 22:14, Peter Geoghegan wrote: > On Mon, Aug 11, 2025 at 10:16 AM Tomas Vondra wrote: >> Perhaps. For me benchmarks are a way to learn about stuff and better >> understand the pros/cons of approaches. It's possible some of the >> changes will impact the chara

Re: index prefetching

2025-08-11 Thread Tomas Vondra
On 8/9/25 01:47, Andres Freund wrote: > Hi, > > On 2025-08-06 16:12:53 +0200, Tomas Vondra wrote: >> That's quite possible. What concerns me about using tables like pgbench >> accounts table is reproducibility - initially it's correlated, and then >> it g

Re: Adding basic NUMA awareness

2025-08-07 Thread Tomas Vondra
On 8/7/25 11:24, Tomas Vondra wrote: > Hi! > > Here's a slightly improved version of the patch series. > Ah, I made a mistake when generating the patches. The 0001 and 0002 patches are not part of the NUMA stuff, it's just something related to benchmarking (addressing unr

Re: index prefetching

2025-08-06 Thread Tomas Vondra
On 8/5/25 23:35, Peter Geoghegan wrote: > On Tue, Aug 5, 2025 at 4:56 PM Tomas Vondra wrote: >> Probably. It was hard to predict which values will be interesting, maybe >> we can pick some subset now. I'll start by just doing larger steps, I >> think. Maybe increase by

Re: Bug in brin_minmax_multi_distance_numeric()

2025-08-06 Thread Tomas Vondra
On 8/5/25 22:17, Tom Lane wrote: > Tomas Vondra writes: >> On 8/5/25 20:11, Tom Lane wrote: >>> Yes, I think it ought to be committed/backpatched separately. >>> I was expecting Tomas to do that, but I can if he's busy ... > >> Sorry, I didn't realiz

Re: index prefetching

2025-08-05 Thread Tomas Vondra
On 8/5/25 19:19, Peter Geoghegan wrote: > On Tue, Aug 5, 2025 at 10:52 AM Tomas Vondra wrote: >> I ran some more tests, comparing the two patches, using data sets >> generated in a way to have a more gradual transition between correlated >> and random cases. > >

Re: Bug in brin_minmax_multi_distance_numeric()

2025-08-05 Thread Tomas Vondra
as expecting Tomas to do that, but I can if he's busy ... > Sorry, I didn't realize that - it seemed you're handling this. I can take care of this in the next couple days, if still needed. regards -- Tomas Vondra

Re: Bug in brin_minmax_multi_distance_numeric()

2025-08-01 Thread Tomas Vondra
On 7/31/25 20:35, Tom Lane wrote: > Tomas Vondra writes: >> On 7/31/25 19:33, Tom Lane wrote: >>> ... It is certainly broken on >>> 32-bit machines where the Datum result of numeric_float8 will >>> be a pointer, so that we will convert the numeric pointer

Re: Bug in brin_minmax_multi_distance_numeric()

2025-07-31 Thread Tomas Vondra
ng incorrect results". The distance functions determine in what order we merge points into ranges, and if the distances are bogus then we can build a summary that is less efficient. regards -- Tomas Vondra

Re: Enable data checksums by default

2025-07-31 Thread Tomas Vondra
On 7/31/25 15:39, Greg Burd wrote: > > >> On Jul 30, 2025, at 8:09 AM, Daniel Gustafsson wrote: >> >>> On 30 Jul 2025, at 11:58, Laurenz Albe wrote: >>> >>> On Tue, 2025-07-29 at 20:24 +0200, Tomas Vondra wrote: >>>> So, what shou

Re: Fix tab completion in v18 for ALTER DATABASE/USER/ROLE ... RESET

2025-07-31 Thread Tomas Vondra
"CONSTRAINTS", >> >> "TRANSACTION", > > Instead of adding another !TailMatches() call, why not just change > "DATABASE" to "DATABASE|ROLE|USER"? It seemed to me separate calls would be easier to understand, but I see combine it like this in many other places, so done that way ... Pushed. Thanks for the fixes! regards -- Tomas Vondra

Re: Adding basic NUMA awareness

2025-07-30 Thread Tomas Vondra
On 7/30/25 10:29, Jakub Wartak wrote: > On Mon, Jul 28, 2025 at 4:22 PM Tomas Vondra wrote: > > Hi Tomas, > > just a quick look here: > >> 2) The PGPROC part introduces a similar registry, [..] >> >> There's also a view pg_buffercache_pgproc. The pg_bu

Re: Reduce "Var IS [NOT] NULL" quals during constant folding

2025-07-29 Thread Tomas Vondra
rds [1] https://www.postgresql.org/message-id/602561.1744314879%40sss.pgh.pa.us [2] https://www.postgresql.org/message-id/1514756.1747925490%40sss.pgh.pa.us -- Tomas Vondra

Re: Enable data checksums by default

2025-07-29 Thread Tomas Vondra
41ff1 [2] https://www.postgresql.org/message-id/brdaw5wke274lubirrl4v2k4qdacylvgwwqztifn7m27pkth3s%40rh7wie47pfcp [3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e6eed40e44419e3268d01fe0d2daec08a7df68f7 -- Tomas Vondra

Re: Fix tab completion in v18 for ALTER DATABASE/USER/ROLE ... RESET

2025-07-29 Thread Tomas Vondra
e role, it was offering all matching variables anyway. I believe that's because of the block at line ~5022. The "database" case was already excluded, so I made 0002 to do that for ROLE too. I plan to push the attached fixes soon ... regards -- Tomas Vondra From fa26b62298d7a4221d9bc

Re: should we have a fast-path planning for OLTP starjoins?

2025-07-28 Thread Tomas Vondra
On 2/4/25 22:55, Tom Lane wrote: > Tomas Vondra writes: >>> The interesting thing about this is we pretty much have all the >>> infrastructure for detecting such FK-related join conditions >>> already. Possibly the join order forcing could be done with >&

Re: PoC: adding CustomJoin, separate from CustomScan

2025-07-25 Thread Tomas Vondra
are simply part of the regular join search. We generate all the various paths for a joinrel, and then give the set_join_pathlist_hook hook a chance to add some more. AFAIK it doesn't affect the join order search, or anything like that. At least not directly. regards -- Tomas Vondra

Re: Adding basic NUMA awareness

2025-07-25 Thread Tomas Vondra
On 7/25/25 12:27, Jakub Wartak wrote: > On Thu, Jul 17, 2025 at 11:15 PM Tomas Vondra wrote: >> >> On 7/4/25 20:12, Tomas Vondra wrote: >>> On 7/4/25 13:05, Jakub Wartak wrote: >>>> ... >>>> >>>> 8. v1-0005 2x + /* if (numa_procs_inte

Re: index prefetching

2025-07-24 Thread Tomas Vondra
On 7/24/25 16:40, Peter Geoghegan wrote: > On Thu, Jul 24, 2025 at 7:19 AM Tomas Vondra wrote: >> I got a bit bored yesterday, so I gave this a try and whipped up a patch >> that adds two pgstattuple functins that I think could be useful for >> analyzing index metrics that m

Re: PoC: adding CustomJoin, separate from CustomScan

2025-07-24 Thread Tomas Vondra
On 7/24/25 15:57, Robert Haas wrote: > On Thu, Jul 24, 2025 at 9:04 AM Tomas Vondra wrote: >> With this patch, my custom join can simply do >> >> econtext->ecxt_outertuple = outer; >> econtext->ecxt_innertuple = inner; >> >> return ExecP

PoC: adding CustomJoin, separate from CustomScan

2025-07-24 Thread Tomas Vondra
ow. Note: I mentioned some extensions implementing SmoothScan/G-join. I plan to publish those once I polish that a bit more. It's more a research rather than something ready to use right now. regards [1] https://scholar.harvard.edu/files/stratos/files/smooth_vldbj.pdf [2] https://dl.g

Re: index prefetching

2025-07-24 Thread Tomas Vondra
On 7/23/25 02:37, Tomas Vondra wrote: > ... > >>> Thanks. I wonder how difficult would it be to add something like this to >>> pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and >>> count distinct blocks, right? Seems quite useful.

Re: index prefetching

2025-07-23 Thread Tomas Vondra
On 7/23/25 17:09, Andres Freund wrote: > Hi, > > On 2025-07-23 14:50:15 +0200, Tomas Vondra wrote: >> On 7/23/25 02:59, Andres Freund wrote: >>> Hi, >>> >>> On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote: >>>> But I don't see why woul

Re: index prefetching

2025-07-23 Thread Tomas Vondra
On 7/23/25 02:59, Andres Freund wrote: > Hi, > > On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote: >> But I don't see why would this have any effect on the prefetch distance, >> queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve >> tha

Re: index prefetching

2025-07-23 Thread Tomas Vondra
On 7/23/25 03:31, Peter Geoghegan wrote: > On Tue, Jul 22, 2025 at 8:37 PM Tomas Vondra wrote: >>> I happen to think that that's a very unrealistic assumption. Most >>> standard benchmarks have indexes that almost all look fairly similar >>> to pgbench_accou

Re: index prefetching

2025-07-22 Thread Tomas Vondra
On 7/23/25 02:59, Andres Freund wrote: > Hi, > > On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote: >> But I don't see why would this have any effect on the prefetch distance, >> queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve >> tha

Re: index prefetching

2025-07-22 Thread Tomas Vondra
processing a page takes much more time. Because it reads the page, and passes it to other operators in the query plan, some of which may do CPU stuff, some will trigger some synchronous I/O, etc. Which means T1 grows, and the "minimal" queue depth decreases. Which part of this is not quite right? -- Tomas Vondra

Re: index prefetching

2025-07-22 Thread Tomas Vondra
ically just yet. > I think I mostly picked a value high enough to make it unlikely to hit it in realistic cases, while also not using too much memory, and 64 seemed like a good value. But I don't see why would this have any effect on the prefetch distance, queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve that. I'd have expected exactly the opposite behavior. Could be bug, of course. But it'd be helpful to see the dataset/query. regards -- Tomas Vondra

Re: index prefetching

2025-07-22 Thread Tomas Vondra
On 7/22/25 23:35, Peter Geoghegan wrote: > On Tue, Jul 22, 2025 at 4:50 PM Tomas Vondra wrote: >>> Obviously, whatever advantage that the "complex" patch has is bound to >>> be limited to cases where index characteristics are naturally the >>>

Re: index prefetching

2025-07-22 Thread Tomas Vondra
On 7/22/25 19:35, Peter Geoghegan wrote: > On Tue, Jul 22, 2025 at 9:06 AM Tomas Vondra wrote: >> Real workloads are likely to have multiple misses in a row, which indeed >> ramps up the distance quickly. So maybe it's not that bad. Could we >> track a longer history of

Re: index prefetching

2025-07-22 Thread Tomas Vondra
d look-ahead distance better in cases like that. Needs more > exploration... thoughts/ideas welcome... Thanks! I'll rerun the tests with these patches once the current round of tests (with the simple distance restore after a reset) completes. -- Tomas Vondra

Re: index prefetching

2025-07-19 Thread Tomas Vondra
On 7/19/25 06:03, Thomas Munro wrote: > On Sat, Jul 19, 2025 at 6:31 AM Tomas Vondra wrote: >> Perhaps the ReadStream should do something like this? Of course, the >> simple patch resets the stream very often, likely mcuh more often than >> anything else in the code. But woul

Re: Adding basic NUMA awareness

2025-07-18 Thread Tomas Vondra
On 7/18/25 18:46, Andres Freund wrote: > Hi, > > On 2025-07-17 23:11:16 +0200, Tomas Vondra wrote: >> Here's a v2 of the patch series, with a couple changes: > > Not a deep look at the code, just a quick reply. > > >> * I changed the freelist partitio

Re: index prefetching

2025-07-18 Thread Tomas Vondra
ps the ReadStream should do something like this? Of course, the simple patch resets the stream very often, likely mcuh more often than anything else in the code. But wouldn't it be beneficial for streams reset because of a rescan? Possibly needs to be optional. regards -- Tomas Vondra From

Re: index prefetching

2025-07-18 Thread Tomas Vondra
_stream_reset(). (Will share results from a couple experiments in a separate message later.) This is the context of the benchmarks I've been sharing - me trying to understand the practical implications/limits of the simple approach. Not an attempt to somehow prove it's better, or anything like that. I'm not opposed to continuing work on the "complex" approach, but as I said, I'm sure I can't pull that off on my own. With your help, I think the chance of success would be considerably higher. Does this clarify how I think about the complex patch? regards [1] https://www.postgresql.org/message-id/32c15a30-6e25-4f6d-9191-76a19482c556%40vondra.me -- Tomas Vondra

Re: Adding basic NUMA awareness

2025-07-17 Thread Tomas Vondra
On 7/4/25 20:12, Tomas Vondra wrote: > On 7/4/25 13:05, Jakub Wartak wrote: >> ... >> >> 8. v1-0005 2x + /* if (numa_procs_interleave) */ >> >>Ha! it's a TRAP! I've uncommented it because I wanted to try it out >> without it (just

Re: index prefetching

2025-07-16 Thread Tomas Vondra
On 7/16/25 19:56, Tomas Vondra wrote: > On 7/16/25 18:39, Peter Geoghegan wrote: >> On Wed, Jul 16, 2025 at 11:29 AM Peter Geoghegan wrote: >>> For example, with "linear_10 / eic=16 / sync", it looks like "complex" >>> has about half the latency o

Re: index prefetching

2025-07-16 Thread Tomas Vondra
On 7/16/25 20:18, Peter Geoghegan wrote: > On Wed, Jul 16, 2025 at 1:42 PM Tomas Vondra wrote: >> On 7/16/25 16:45, Peter Geoghegan wrote: >>> I get that index characteristics could be the limiting factor, >>> especially in a world where we're not yet eagerly

Re: index prefetching

2025-07-16 Thread Tomas Vondra
ions. If you copy the first couple lines, you'll get scans.db, with nice column names and all that.. The selectivity is calculated as (rows / total_rows) where rows is the rowcount returned by the query, and total_rows is reltuples. I also had charts with "page selectivity", but that often got a bunch of 100% points squashed on the right edge, so I stopped generating those. regards -- Tomas Vondra

Re: index prefetching

2025-07-16 Thread Tomas Vondra
On 7/16/25 17:29, Peter Geoghegan wrote: > On Wed, Jul 16, 2025 at 4:40 AM Tomas Vondra wrote: >> For "uniform" data set, both prefetch patches do much better than master >> (for low selectivities it's clearer in the log-scale chart). The >> "complex&qu

Re: index prefetching

2025-07-16 Thread Tomas Vondra
On 7/16/25 16:45, Peter Geoghegan wrote: > On Wed, Jul 16, 2025 at 10:37 AM Tomas Vondra wrote: >> What sounds weird? That the read_stream works like a stream of blocks, >> or that it can't do "pause" and we use "reset" as a workaround? > > The fact

Re: index prefetching

2025-07-16 Thread Tomas Vondra
On 7/16/25 16:29, Peter Geoghegan wrote: > On Wed, Jul 16, 2025 at 10:20 AM Tomas Vondra wrote: >> The read stream can only return blocks generated by the "next" callback. >> When we return the block for the last item on a leaf page, we can only >> return "Inva

Re: index prefetching

2025-07-16 Thread Tomas Vondra
On 7/16/25 16:07, Peter Geoghegan wrote: > On Wed, Jul 16, 2025 at 9:58 AM Tomas Vondra wrote: >>> The "simple" patch has _bt_readpage reset the read stream. That >>> doesn't make any sense to me. Though it does explain why the "complex" >>>

Re: index prefetching

2025-07-16 Thread Tomas Vondra
On 7/16/25 15:36, Peter Geoghegan wrote: > On Wed, Jul 16, 2025 at 4:40 AM Tomas Vondra wrote: >> But the thing I don't really understand it the "cyclic" dataset (for >> example). And the "simple" patch performs really badly here. This data >> set

Re: AIO v2.5

2025-07-14 Thread Tomas Vondra
On 7/14/25 20:36, Andres Freund wrote: > Hi, > > On 2025-07-11 23:03:53 +0200, Tomas Vondra wrote: >> I've been running some benchmarks comparing the io_methods, to help with >> resolving this PG18 open item. So here are some results, and my brief >> analysis o

Re: AIO v2.5

2025-07-14 Thread Tomas Vondra
On 7/14/25 20:44, Andres Freund wrote: > On 2025-07-13 20:04:51 +0200, Tomas Vondra wrote: >> On 7/11/25 23:03, Tomas Vondra wrote: >>> ... >>> >>> e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png) >>> >>> There's an

Re: index prefetching

2025-07-13 Thread Tomas Vondra
On 7/13/25 01:50, Peter Geoghegan wrote: > On Thu, May 1, 2025 at 7:02 PM Tomas Vondra wrote: >> There's two "fix" patches trying to make this work - it does not crash, >> and almost all the "incorrect" query results are actually stats about >> b

Re: Proposal: Role Sandboxing for Secure Impersonation

2025-07-13 Thread Tomas Vondra
t in an explicit transaction. If the pooler starts wrapping everything in a transaction, this would break. Not sure if this might cause some issues with "idle in transaction" sessions. Maybe not ... So I don't think the pooler should be starting transactions, unless the user actually started a transaction. regards -- Tomas Vondra

Re: Proposal: Role Sandboxing for Secure Impersonation

2025-07-13 Thread Tomas Vondra
l too. Possibly by making the SET ROLE protocol "thing" generic enough to cover this kind of use case too. The other thing is that out-of-core extensions are not great for managed systems (e.g. set_user is nice, but how many systems can use that?). While that's really a problem of providers of those systems, I wonder if we should have something like this built-in. regards [1] https://www.postgresql.org/docs/current/ddl-rowsecurity.html [2] https://github.com/tvondra/signed_context -- Tomas Vondra

Re: AIO v2.5

2025-07-13 Thread Tomas Vondra
On 7/11/25 23:03, Tomas Vondra wrote: > ... > > e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png) > > There's an interesting difference difference I noticed in the run with > checksums on PG17. The full PDF is available here: > > https://github.co

Re: Adding basic NUMA awareness

2025-07-10 Thread Tomas Vondra
nse to partition the numBufferAllocs too, though? I don't remember if my hacky experimental patch NUMA-partitioning did that or I just thought about doing that, but why wouldn't that be enough? Places that need the "total" count would have to sum the counters, but it seemed to me most of the places would be fine with the "local" count for that partition. If we also make sure to "sync" the clocksweeps so as to not work on just a single partition, that might be enough ... regards -- Tomas Vondra

Re: Adding basic NUMA awareness - Preliminary feedback and outline for an extensible approach

2025-07-10 Thread Tomas Vondra
and all as you do is great, and combo of both approach probably > of great interest. There is also this weighted interleave discussed and > probably much more to come in this area in Linux. > > I think some points raised already about possible distinct policies, I > am precisely claiming that it is hard to come with one good policy with > limited setup options, thus requirement to keep that flexible enough > (hooks, api, 100 GUc ?). > I'm sorry, I don't want to sound too negative, but "I want arbitrary extensibility" is not a very useful feedback. I've asked you to give some examples of policies that'd customize some of the NUMA stuff. > There is an EPYC story here also, given the NUMA setup can vary > depending on BIOS setup, associated NUMA policy must probably take that > into account (L3 can be either real cache or 4 extra "local" NUMA nodes > - with highly distinct access cost from a RAM module). > Does that change how PostgreSQL will place memory and process? Is it > important or of interest ? > So how exactly would the policy handle this? Right now we're entirely oblivious to L3, or on-CPU caches in general. We don't even consider the size of L3 when sizing hash tables in a hashjoin etc. regards -- Tomas Vondra

Re: index prefetching

2025-07-09 Thread Tomas Vondra
me very helpful chats about this with Peter Geoghegan, and I'm still open to the possibility of making it work. This simpler version is partially a hedge to have at least something in case the complex patch does not make it. regards [1] https://www.postgresql.org/message-id/t5aqjhkj6xdkido535

  1   2   3   4   5   6   7   8   9   10   >