"So, is this behaviour documented/guaranteed somewhere?" Short version is: Nope. The engine is free to do whatever it wants as long as it gives the correct result in the end.
Consider a simple select * from foo where predicate order by non_indexed_field; Since there is no nice ordering of the data already it's going to have to sort it. In which case it's probably going to check the predicate against records on the way _in_ to the sorter rather than _after_ sorting. Think "I might only have to sort 5 things instead of 5 million, so let's filter out as much as we can as soon as possible." And since the same data could be on the disk with its pages in any order you could conceive a situation where the same data could be processed differently depending on the specific file that it's in. The same query could process the data in a different order before and after a vacuum for example. Or maybe it does sort first then check. But that's an internal detail which could change every version. And all that is all considered fine, as the end result of the query will still be correct and in the order specified. The closest thing you can do is limit a query to using a specific index during a query, but even then you're basically relying on implementation details, and not a guarantee. Others will correct me if I'm wrong on that. -----Original Message----- From: sqlite-users <sqlite-users-boun...@mailinglists.sqlite.org> On Behalf Of Merijn Verstraaten Sent: Thursday, November 7, 2019 6:55 AM To: sqlite-users@mailinglists.sqlite.org Subject: [sqlite] Deterministic random sampling via SELECT I'm trying sample a (deterministically) random subset of a SELECT query, the most common solution on the internet to get random samples seems to be "SELECT * FROM (...) ORDER BY RANDOM() LIMIT n;" (this already has some question marks, since it relies on seeding RANDOM and knowing the RANDOM function is always evaluated in the same order every query run), but looking at the query plans this materialises the entire result set in memory for the sort (no surprise, I can't think of anyway that could work otherwise) which is rather undesirable if the sample size becomes large (i.e. several million rows). Now, I already know different ways to implement a predicate function that can deterministically keep elements from a stream, however that relies on having a deterministic order for the stream. Which brings us to SQLite. I can easily write something like: SELECT * FROM (...) WHERE my_predicate_fun() ORDER BY column1, column2,... And this *seems* to evaluate the where clause for each row in the order determined by ORDER BY, but this doesn't seem at all guaranteed by the SQL spec. So, is this behaviour documented/guaranteed somewhere? If not, is there some way to guarantee my where clause is evaluated for each row in a deterministic order? In the simple case like above I could always just evaluate the query without the ORDER BY, step through the entire query, and evaluate the predicate in the application, but if I want to use this random selection as a subquery, then that doesn't work. And while I'm asking questions: What if I want to do the above, but selecting groups of rows? So, sort of like: SELECT * FROM (...) GROUP BY groupColumn HAVING my_predicate_fun(); But where I want to return all rows in the group, rather than an aggregate. Thanks in advance, Merijn _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users