I define a table user_ranks as such:

CREATE TABLE user_ranks (
  id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  rank INTEGER NOT NULL,
  CONSTRAINT "by (rank, id)" UNIQUE (rank, id)
);

INSERT INTO user_ranks (user_id, rank) SELECT generate_series(1, 10000),
generate_series(1, 10000);

Here's a query I'd like to optimize:

explain (analyze,verbose)
SELECT
  t3_0."id" AS "id",
  t3_0."rank" AS "rank"
FROM
  LATERAL (
    SELECT
      t4_0."rank" AS "rank"
    FROM
      user_ranks AS t4_0
    WHERE
      (t4_0."id" = 4732455)
  ) AS t3_1
  INNER JOIN user_ranks AS t3_0 ON true
WHERE
  (
    ((t3_0."rank", t3_0."id") <= (t3_1."rank", 4732455))
    AND true
  )
ORDER BY
  t3_0."rank" DESC,
  t3_0."id" DESC
LIMIT
  10

It compiles to the following plan:

 Limit  (cost=0.56..250.94 rows=10 width=12) (actual time=8.078..8.078
rows=1 loops=1)
   Output: t3_0.id, t3_0.rank
   ->  Nested Loop  (cost=0.56..41763.27 rows=1668 width=12) (actual
time=8.075..8.076 rows=1 loops=1)
         Output: t3_0.id, t3_0.rank
         Inner Unique: true
         Join Filter: (ROW(t3_0.rank, t3_0.id) <= ROW(t4_0.rank, 4732455))
         Rows Removed by Join Filter: 5002
         ->  Index Only Scan Backward using "by (rank,id)" on
public.user_ranks t3_0  (cost=0.28..163.33 rows=5003 width=12) (actual
time=0.023..0.638 rows=5003 loops=1)
               Output: t3_0.rank, t3_0.id
               Heap Fetches: 0
         ->  Index Scan using "by id" on public.user_ranks t4_0
 (cost=0.28..8.30 rows=1 width=8) (actual time=0.001..0.001 rows=1
loops=5003)
               Output: t4_0.id, t4_0.rating, t4_0.rank
               Index Cond: (t4_0.id = 4732455)

As you can see, there are a lot of rows returned by t3_0, which are then
filtered by Join Filter. But it would have been better if instead of the
filter, the  t3_0 table would have an Index Cond. Similar to how it happens
when a correlated subquery is used (or a CTE)

explain (analyze,verbose)
SELECT
  t3_0."id" AS "id",
  t3_0."rank" AS "rank"
FROM
  user_ranks AS t3_0
WHERE
  (
    ((t3_0."rank", t3_0."id") <= (
    SELECT
      t4_0."rank" AS "rank",
      t4_0."id" AS "id"
    FROM
      user_ranks AS t4_0
    WHERE
      (t4_0."id" = 4732455)
    ))
    AND true
  )
ORDER BY
  t3_0."rank" DESC,
  t3_0."id" DESC
LIMIT
  10

 Limit  (cost=8.58..8.95 rows=10 width=12) (actual time=0.062..0.064 rows=1
loops=1)
   Output: t3_0.id, t3_0.rank
   InitPlan 1 (returns $0,$1)
     ->  Index Scan using "by id" on public.user_ranks t4_0
 (cost=0.28..8.30 rows=1 width=12) (actual time=0.024..0.025 rows=1 loops=1)
           Output: t4_0.rank, t4_0.id
           Index Cond: (t4_0.id = 4732455)
   ->  Index Only Scan Backward using "by (rank,id)" on public.user_ranks
t3_0  (cost=0.28..61.47 rows=1668 width=12) (actual time=0.061..0.062
rows=1 loops=1)
         Output: t3_0.id, t3_0.rank
         Index Cond: (ROW(t3_0.rank, t3_0.id) <= ROW($0, $1))
         Heap Fetches: 0


I'm an opposite of a PostgreSQL expert, but it was surprising to me to see
that a correlated subquery behaves better than a join. Is this normal? Is
it something worth fixing/easy to fix?

Sincerely,
Bakhtiyar

Reply via email to