Hi,

as I side-project I put together an implementation of Hyperminhash as an
extension for SQLite3. It provides fast, constant-memory cardinality
approximation, including the union- and intersection-set. 

For example, querying an in-memory table of two million (INT, INT)-rows,
having no indexes, the query `SELECT COUNT(*) FROM (SELECT DISTINCT foo, bar
FROM foobar)` completes in close to five seconds. The equivalent query
`SELECT hyperminhash(foo, bar) FROM foobar` completes in less than 400ms and
deviates from the true result by less than 0.5%.

The Hyperminhash-functions shine when indexes can't be used and there are
lots of unique elements, requiring sqlite to build large temporary tables to
hold unique elements in traditional `DISTINCT`-queries. There is also
significant memory savings, as the data structure used by Hyperminhash is
just 32kb and never grows while counting unique elements.

People may find this useful, the code is found at
https://github.com/lukaslueg/sqlite3_hyperminhash



--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to