[sqlite] Deterministic random sampling via SELECT

Merijn Verstraaten Thu, 07 Nov 2019 03:55:58 -0800

I'm trying sample a (deterministically) random subset of a SELECT query, the 
most common solution on the internet to get random samples seems to be "SELECT 
* FROM (...) ORDER BY RANDOM() LIMIT n;" (this already has some question marks, 
since it relies on seeding RANDOM and knowing the RANDOM function is always 
evaluated in the same order every query run), but looking at the query plans 
this materialises the entire result set in memory for the sort (no surprise, I 
can't think of anyway that could work otherwise) which is rather undesirable if 
the sample size becomes large (i.e. several million rows).


Now, I already know different ways to implement a predicate function that can 
deterministically keep elements from a stream, however that relies on having a 
deterministic order for the stream. Which brings us to SQLite. I can easily 
write something like:

SELECT *
FROM (...)
WHERE my_predicate_fun()
ORDER BY column1, column2,...

And this *seems* to evaluate the where clause for each row in the order 
determined by ORDER BY, but this doesn't seem at all guaranteed by the SQL 
spec. So, is this behaviour documented/guaranteed somewhere? If not, is there 
some way to guarantee my where clause is evaluated for each row in a 
deterministic order?

In the simple case like above I could always just evaluate the query without 
the ORDER BY, step through the entire query, and evaluate the predicate in the 
application, but if I want to use this random selection as a subquery, then 
that doesn't work.

And while I'm asking questions: What if I want to do the above, but selecting 
groups of rows? So, sort of like:

SELECT *
FROM (...)
GROUP BY groupColumn
HAVING my_predicate_fun();

But where I want to return all rows in the group, rather than an aggregate.

Thanks in advance,
Merijn
_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Deterministic random sampling via SELECT

Reply via email to