Hi,
I'm trying to implement a "streaming" version of the classic "select the
latest version of a record" query.
By "streaming" I mean a query that executes by streaming what it needs out
of tables and indexes as it needs it rather than using temporary b-trees or
materializing anything up front.
I'm looking for a query that I can run and then just consume as many
results as I want without worrying about the size of the entire result set.
Here's the schema I'm working with:
-----
CREATE TABLE "entrys" ("log-id" INTEGER NOT NULL , "entry-number" INTEGER
NOT NULL , "region" TEXT NOT NULL , "key" TEXT NOT NULL , "timestamp"
INTEGER NOT NULL , PRIMARY KEY ("log-id", "entry-number"))
CREATE UNIQUE INDEX "entrys-log-id-region-key-entry-number" ON "entrys" (
"log-id" ASC, "region" ASC, "key" ASC, "entry-number" ASC)
-----
There's only a couple of million rows in "entrys" and my query times are
into 2 or 3 seconds of startup time before the first row is returned.
Here's my query:
-----
-- explain query plan
SELECT
"entrys"."log-id" AS "log-id",
"entrys"."entry-number" AS "entry-number",
"entrys"."region" AS "region",
"entrys"."key" AS "key",
"entrys"."timestamp" AS "timestamp"
FROM
(SELECT
MAX("entry-number") AS "entry-number",
"key"
FROM "entrys"
WHERE
"log-id" = 1 AND
"region" = "user" AND
"entry-number" <= 1700108
AND key > "G"
GROUP BY "key"
ORDER BY "key" DESC
limit 20 -- (1)
) AS "specific-entrys"
INNER JOIN "entrys"
ON
1 = "entrys"."log-id" AND
"specific-entrys"."key" = "entrys"."key" AND
"user" = "entrys"."region" AND
"specific-entrys"."entry-number" = "entrys"."entry-number"
AND "entrys"."key" > "G"
WHERE
"entrys"."log-id" = 1
ORDER BY "key" ASC
;
-----
...which has this query plan in SQLite verson 3.31.0
-----
QUERY PLAN
|--MATERIALIZE 1
| `--SEARCH TABLE entrys USING COVERING INDEX
entrys-log-id-region-key-entry-number (log-id=? AND region=? AND key<?)
|--SEARCH TABLE entrys USING INDEX entrys-log-id-region-key-entry-number
(log-id=? AND region=? AND key<?)
`--SEARCH SUBQUERY 1 AS specific-entrys USING AUTOMATIC COVERING INDEX
(key=?)
-----
My problem is with the MATERIALIZE.
The query produces just shy of 2 million rows.
It takes several seconds to start up but is then pretty quick when fetching
each row.
What I want to do is get rid of the startup costs so that I can paginate it
efficiently.
If I run the subselect on its own then there is no startup cost. The
results just get streamed straight out of the index.
Its query plan is
-----
QUERY PLAN
`--SEARCH TABLE entrys USING COVERING INDEX
entrys-log-id-region-key-entry-number (log-id=? AND region=? AND key<?)
-----
Is there anything I can do to make the original version of the query stream
the results in this way?
The best I have come up with is to insert a LIMIT clause at the point
denoted with "-- (1)". This keeps the subselect small and then
materialising the subselect and generating the automatic covering index
becomes cheap.
For pagination I then feed in the key from the last row of the previous
batch at the points denoted with "-- (2)" and "-- (3)".
If I do this then it seems equally cheap to access batches at the start and
end of the complete result set. The per-query cost is determined by the
batch size as set but the LIMIT clause.
However, I have been under the impression that LIMIT is supposed to be a
"debugging" extension to the language and not recommended for use in
queries that end up in one's program. LIMIT also only hides the latency; it
amortises it over each batch, but I still end up with memory requirements
in each "client" thread that are larger than I'd like; ideally I'd just
store the current row.
Thanks for any tips!
Best wishes,
@ndy
--
andy...@ashurst.eu.org
http://www.ashurst.eu.org/
0x7EBA75FF
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users