On 8/26/25 03:08, Peter Geoghegan wrote:
> On Mon Aug 25, 2025 at 10:18 AM EDT, Tomas Vondra wrote:
>> The attached patch is a PoC implementing this. The core idea is that if
>> we measure "miss probability" for a chunk of requests, we can use that
>> to estimate the distance needed to generate e_i_c IOs.
> 
> I noticed an assertion failure when the tests run. Looks like something about
> the patch breaks the read stream from the point of view of VACUUM:
> 
> TRAP: failed Assert("stream->pinned_buffers + stream->pending_read_nblocks <= 
> stream->max_pinned_buffers"), File: 
> "../source/src/backend/storage/aio/read_stream.c", Line: 402, PID: 1238204
> [0x55e71f653d29] read_stream_start_pending_read: 
> /mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/storage/aio/read_stream.c:401

Seems the distance adjustment was not quite right, didn't enforce the
limit on pinned buffers, and the distance could get too high. The
attached version should fix that ...

But there's still something wrong. I tried running check-world, and I
see 027_stream_regress.pl is getting stuck in join.sql, for the query on
line 417.

I haven't figured this out yet, but there's a mergejoin. It does reset
the stream a lot, so maybe there's something wrong there ... It's
strange, though. Why would a different distance make the query stuck?

Anyway, Thomas' patch from [1] doesn't seem to have this issue. And
maybe it's a better / more elegant approach in general?


[1]
https://www.postgresql.org/message-id/CA%2BhUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw%40mail.gmail.com

-- 
Tomas Vondra
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index ed5feac2d39..dc57ff0c640 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -116,6 +116,15 @@ struct ReadStream
 	int64		forwarded_count;
 	int64		distance_hist[16];
 
+	/* acceptable distance range */
+	int16		distance_min;
+	int16		distance_max;
+
+	/* number of hits / misses */
+	int16		num_reads;
+	int16		num_hits;
+	int16		threshold;
+
 	/*
 	 * One-block buffer to support 'ungetting' a block number, to resolve flow
 	 * control problems when I/Os are split.
@@ -235,6 +244,119 @@ read_stream_get_block(ReadStream *stream, void *per_buffer_data)
 	return blocknum;
 }
 
+/*
+ * read_stream_maybe_adjust_distance
+ *		adjust distance based on the number of hits/misses
+ *
+ * This adjusts three parameters used to pick the distance, based on hits and
+ * misses when reading the buffers:
+ *
+ * - distance - the "actual distance"
+ *
+ * - distance_min and distance_max - this restricts the accetable range
+ *   for the "actual distance"
+ *
+ * The range is adjusted with every miss. It starts with [1,1], and every miss
+ * increases the limits up to [max_ios, PG_INT16_MAX]. The "actual distance" is
+ * kept within the range. This means the distance "ramps up" gradually, and does
+ * not drop all the way back to 1.
+ *
+ * The "actual distance" is adjusted less frequently, after seeing a chunk of
+ * requests. We calculate the "probability of a miss" and use it to estimate how
+ * many requests to look ahead to keep "max_ios" in the queue. The calculated
+ * distance is still kept in the min/max range.
+ */
+static inline void
+read_stream_adjust_distance(ReadStream *stream, bool miss)
+{
+	int16	max_distance = stream->max_pinned_buffers;
+
+	/*
+	 * Count hits/misses and (maybe) widen the distance range.
+	 *
+	 * XXX Maybe the range should be adjusted always, not just for a miss?
+	 *
+	 * XXX The min distance is capped to max_ios, because that's the maximum
+	 * number we know we can handle for 100% miss rate.
+	 */
+	if (miss)
+	{
+		stream->num_reads++;
+		stream->distance_min = Min(stream->distance_min * 2,
+								   max_distance);
+		stream->distance_max = Min(stream->distance_max * 2,
+								   max_distance);
+	}
+	else
+	{
+		stream->num_hits++;
+	}
+
+	/*
+	 * Adjust the actual distance, based on miss ratio.
+	 *
+	 * We only do this once in a while, after seeing "threshold" requests, so
+	 * that we have somewhat accurate estimate of miss ratio. We can still see
+	 * miss_prob=0.0, so be careful about it.
+	 */
+	if ((stream->num_hits + stream->num_reads) >= stream->threshold)
+	{
+		/*
+		 * If we saw any misses, estimate how far to look ahead to see max_ios
+		 * I/Os (which considers effective_io_concurrency, our goal).
+		 */
+		if (stream->num_reads > 0)
+		{
+			/* probability of a miss */
+			double	miss_prob
+				= stream->num_reads * 1.0 / (stream->num_hits + stream->num_reads);
+
+			/* number of requests to get max_ios misses */
+			stream->distance = Min(stream->max_ios / miss_prob,
+								   PG_INT16_MAX);
+
+			stream->distance = Min(stream->distance, max_distance);
+		}
+		else
+		{
+			/*
+			 * With no misses, we simply use the current minimal distance.
+			 *
+			 * XXX Maybe we should use the maximum instead?
+			 */
+			stream->distance = stream->distance_min;
+		}
+
+		/* reset the counters, to start a new interval */
+		stream->num_hits = 0;
+		stream->num_reads = 0;
+
+		/*
+		 * When to re-calculate the distance? Not too often, to get a good
+		 * of miss probability (we need to see enough requests), but also
+		 * not too infrequently (we want this to be adaptive).
+		 *
+		 * XXX Seems reasonable to base this on the distance. It means we
+		 * expect to see max_ios misses, because that's how we calculated the
+		 * distance.
+		 *
+		 * XXX But don't do this too infrequently. The distance can get quite
+		 * high, so cap to 10x the I/Os. Arbitrary value, maybe needs more
+		 * thought.
+		 */
+		stream->threshold = Max(stream->distance, stream->max_ios * 10);
+	}
+
+	/*
+	 * in any case, respect distance_min, distance_max
+	 *
+	 * XXX This means we actually adjust the distance after every miss, not just
+	 * after every stream->threshold requests. Is this a good idea?
+	 */
+	stream->distance = Max(stream->distance, stream->distance_min);
+	stream->distance = Min(stream->distance, stream->distance_max);
+}
+
 /*
  * In order to deal with buffer shortages and I/O limits after short reads, we
  * sometimes need to defer handling of a block we've already consumed from the
@@ -398,8 +520,7 @@ read_stream_start_pending_read(ReadStream *stream)
 	if (!need_wait)
 	{
 		/* Look-ahead distance decays, no I/O necessary. */
-		if (stream->distance > 1)
-			stream->distance--;
+		read_stream_adjust_distance(stream, false);
 	}
 	else
 	{
@@ -758,6 +879,13 @@ read_stream_begin_impl(int flags,
 	else
 		stream->distance = 1;
 
+	stream->num_hits = 0;
+	stream->num_reads = 0;
+
+	stream->distance_min = 1;
+	stream->distance_max = 1;
+	stream->threshold = stream->max_ios;	/* XXX rethink? */
+
 	/*
 	 * Since we always access the same relation, we can initialize parts of
 	 * the ReadBuffersOperation objects and leave them that way, to avoid
@@ -971,7 +1099,6 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
 	{
 		int16		io_index = stream->oldest_io_index;
-		int32		distance;	/* wider temporary value, clamped below */
 
 		/* Sanity check that we still agree on the buffers. */
 		Assert(stream->ios[io_index].op.buffers ==
@@ -985,9 +1112,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->oldest_io_index = 0;
 
 		/* Look-ahead distance ramps up rapidly after we do I/O. */
-		distance = stream->distance * 2;
-		distance = Min(distance, stream->max_pinned_buffers);
-		stream->distance = distance;
+		read_stream_adjust_distance(stream, true);
 
 		/*
 		 * If we've reached the first block of a sequential region we're
@@ -1146,6 +1271,13 @@ read_stream_reset(ReadStream *stream)
 	stream->distance = Max(1, stream->distance_old);
 	stream->distance_old = 0;
 
+	stream->distance_min = stream->distance;
+	stream->distance_max = stream->distance;
+
+	stream->num_hits = 0;
+	stream->num_reads = 0;
+	stream->threshold = stream->max_ios;
+
 	/* track the number of resets */
 	stream->reset_count += 1;
 }

Reply via email to