We went off on a tangent, apologies.

If you have contiguous integer primary keys, you could randomly sample that 
range of integers, then pull the records with those keys.

Or in your external language of choice, sample the integers from 1 to the 
record count deterministically, select ordered by the primary key, and take the 
ones with the sampled offsets. (Stepping through 1 query and NOT doing a bunch 
of ...order by pk limit 1 offset n... queries)

Making something quick in Python I might do something like:


import random
import sqlite3

conn = sqlite3.connect(dbFile, isolation_level = None)
cur = conn.cursor()
cur.execute("select count(*) from foo;")
numRecords = cur.fetchone()[0]
sampleSize = 10
random.seed(5) #Your deterministic seed here
SampleOffsets = random.sample(range(1, numRecords + 1), sampleSize)
SampleOffsets.sort()
cur.execute("select * from foo order by primary_key;")
currentOffset = 0
for selectedOffset in SampleOffsets:
    for _ in range(selectedOffset - currentOffset - 1):
        cur.fetchone()
    nextSampleRecord = cur.fetchone()
    currentOffset = selectedOffset
    doSomethingWithSample(nextSampleRecord)



-----Original Message-----
From: sqlite-users <sqlite-users-boun...@mailinglists.sqlite.org> On Behalf Of 
Merijn Verstraaten
Sent: Thursday, November 7, 2019 2:16 PM
To: SQLite mailing list <sqlite-users@mailinglists.sqlite.org>
Subject: Re: [sqlite] Deterministic random sampling via SELECT


> On 7 Nov 2019, at 19:16, David Raymond <david.raym...@tomtom.com> wrote:
> 
> Along those lines SQLite includes the reverse_unordered_selects pragma
> https://www.sqlite.org/pragma.html#pragma_reverse_unordered_selects
> which will flip the order it sends rows in queries that don't explicitly 
> specify an ordering. It's there to assist you in finding spots in your code 
> where you might be relying on implicit ordering when you really shouldn't be.

Like the rest of this threads, this is just pointing out why the things in my 
initial email don't work, but I already knew that. Which is why I asked for 
help to see if there is a way to do what I want that *does* work. I don't care 
particularly about the details of "can I control the order the condition is 
evaluated", it's just that all reasonable ways to sample large streams that I 
know would require a deterministic order.

If someone has a different/better idea on how to return just a random sample 
from a query in a repeatable way, I'm all ears.

So far the only suggestion was "use some non-deterministic random sampling 
method and store the result", but since my samples are large and I have lots of 
them, this would balloon my storage by >100x and I don't have the available 
storage to make that work.

- Merijn
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to