Re: [sqlite] Efficient random sampling in a large table using builtin functions.

Black, Michael (IS) Thu, 08 Mar 2012 06:20:33 -0800

You don't say what language you are working in.  IN C++ I would just declare a 
"set" and put random row numbers in it until I had enough.  Then use that set 
to build the SQL.




SQLite's random() doesn't have a seed function so you don't really get very 
random numbers from run-to-run and have no good way of controlling it that I 
can find in the docs.  You want to use your language's random function if you 
want anything close to real randomness.



Hopefully your language has a similar data structure you can use.





Michael D. Black

Senior Scientist

Advanced Analytics Directorate

Advanced GEOINT Solutions Operating Unit

Northrop Grumman Information Systems

________________________________
From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on 
behalf of Benoit Mortgat [mort...@gmail.com]
Sent: Thursday, March 08, 2012 4:10 AM
To: General Discussion of SQLite Database
Subject: EXT :[sqlite] Efficient random sampling in a large table using builtin 
functions.

I have a table with millions of records.

When I run a query with

    ORDER BY random() LIMIT N;

the RANDOM() function is evaluated against all rows of my result set,
then sorting occurs, and as a result the query is slow.

In this case the query could be rewritten as:

 * Generate N, random, row numbers between 1 and (SELECT COUNT(*) FROM
   the_table_name). Maybe using remainder operator % and builtin ABS()
   and RANDOM() functions can help (see below)
 * SELECT FROM the_table WHERE rowid IN (those random numbers)

For the moment the most simple query I can think of is:

   SELECT *
     FROM my_table
    WHERE rowid IN
        ( SELECT 1 + (ABS(r.x) % c.num_rows)
            FROM (SELECT COUNT(*) AS num_rows
                    FROM my_table
                 ) AS c
           CROSS JOIN
                 (SELECT random() x
                    FROM my_table
                   LIMIT N
                 ) AS r
        );

This can however return less than N rows if by chance two random numbers
have the same remainder modulo COUNT(*) FROM my_table.

Note that the generation of N random numbers is quick because there is
no ORDER BY involved that would require computation of as many random
numbers as there are rows in the original table.

This could maybe be optimized inside SQLite if those conditions are met:

 * random() hasn't been overridden with sqlite_create_function() nor
   sqlite_create_function_v2()
 * the user SELECTs FROM a table without joins and w/o WHERE conditions.

--
Benoit Mortgat
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Efficient random sampling in a large table using builtin functions.

Reply via email to