Re: [sqlite] Deterministic random sampling via SELECT

David Raymond Thu, 07 Nov 2019 05:57:00 -0800

"So, is this behaviour documented/guaranteed somewhere?"
Short version is: Nope. The engine is free to do whatever it wants as long as 
it gives the correct result in the end.


Consider a simple
select * from foo where predicate order by non_indexed_field;
Since there is no nice ordering of the data already it's going to have to sort 
it. In which case it's probably going to check the predicate against records on 
the way _in_ to the sorter rather than _after_ sorting. Think "I might only 
have to sort 5 things instead of 5 million, so let's filter out as much as we 
can as soon as possible." And since the same data could be on the disk with its 
pages in any order you could conceive a situation where the same data could be 
processed differently depending on the specific file that it's in. The same 
query could process the data in a different order before and after a vacuum for 
example.

Or maybe it does sort first then check. But that's an internal detail which 
could change every version. And all that is all considered fine, as the end 
result of the query will still be correct and in the order specified.

The closest thing you can do is limit a query to using a specific index during 
a query, but even then you're basically relying on implementation details, and 
not a guarantee.

Others will correct me if I'm wrong on that.



-----Original Message-----
From: sqlite-users <sqlite-users-boun...@mailinglists.sqlite.org> On Behalf Of 
Merijn Verstraaten
Sent: Thursday, November 7, 2019 6:55 AM
To: sqlite-users@mailinglists.sqlite.org
Subject: [sqlite] Deterministic random sampling via SELECT

I'm trying sample a (deterministically) random subset of a SELECT query, the 
most common solution on the internet to get random samples seems to be "SELECT 
* FROM (...) ORDER BY RANDOM() LIMIT n;" (this already has some question marks, 
since it relies on seeding RANDOM and knowing the RANDOM function is always 
evaluated in the same order every query run), but looking at the query plans 
this materialises the entire result set in memory for the sort (no surprise, I 
can't think of anyway that could work otherwise) which is rather undesirable if 
the sample size becomes large (i.e. several million rows).

Now, I already know different ways to implement a predicate function that can 
deterministically keep elements from a stream, however that relies on having a 
deterministic order for the stream. Which brings us to SQLite. I can easily 
write something like:

SELECT *
FROM (...)
WHERE my_predicate_fun()
ORDER BY column1, column2,...

And this *seems* to evaluate the where clause for each row in the order 
determined by ORDER BY, but this doesn't seem at all guaranteed by the SQL 
spec. So, is this behaviour documented/guaranteed somewhere? If not, is there 
some way to guarantee my where clause is evaluated for each row in a 
deterministic order?

In the simple case like above I could always just evaluate the query without 
the ORDER BY, step through the entire query, and evaluate the predicate in the 
application, but if I want to use this random selection as a subquery, then 
that doesn't work.

And while I'm asking questions: What if I want to do the above, but selecting 
groups of rows? So, sort of like:

SELECT *
FROM (...)
GROUP BY groupColumn
HAVING my_predicate_fun();

But where I want to return all rows in the group, rather than an aggregate.

Thanks in advance,
Merijn
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Deterministic random sampling via SELECT

Reply via email to