[Preamble: I've been told that the hackers list is appropriate for extension-related topics like this, even if it's not about contributing to core. If I'm misappropriating, please let me know.]

Goal: Personalized, context-relevant query results

We are building a deeply personalized site; think "OKCupid for product recommendations" or "Pinterest for people with your tastes". We use psych research to measure and predict your personality and traits along a number of scales (dimensions), and then we connect you with people, products and content we think you'll like.

I won't go into the design history, but you can read a little here:

http://parapoetica.wordpress.com/2012/02/15/feature-space-similarity-search-in-postgresql/

Suffice to say, this ends up needing something like KNN-GiST cubes, only:

- The overall concept is more like N-dimensional vectors than cubes
- But a dimension might be in any domain, not just floats
- All vectors have the same number of dimensions with the same meanings
- The distance along each dimension is a domain-specific function
- NULLs are allowed (the distance function will handle the semantics)
- The distance between two vectors is a function that aggregates the distances of each dimension, along with arbitrary other arguments - for instances, it might take the weighted average of the dimensions

That aggregation (which may not literally be an aggregate; I'm not sure yet) needs to happen in a SELECT list, which means it needs to be fast, which means all this (or at least much of it) has to be C.

The "simplest thing that works" is probably to hack up the cube extension, declare that everything (except inner pages) must be a zero-volume cube (cube_is_point()), map our non-float features onto floats somehow, and hard-code all the distance functions and the aggregation function.

But I think this sort of similarity-search engine has general utility, and I also want to make it easy for us to add and subtract dimensions without too much pain; that should be DDL, not code. So thinking about how this might evolve...

- I'm not sure how to represent arbitrary column-like features without reinventing the wheel and putting a database in the database. hstore only stores text, probably for this reason; I took a look at the earlier json patch and saw that it handled only a few core data types. Have there been any other PoCs that involved collections of hetereogenous data? I almost want an actual instance of an "anyarray".

- Alternatively, is there a way to index an entire, arbitrary row, rather than on a column on that row? I'm fine with this extension requiring its own table, so I leave the data where it is in the row, and only worry about indexing it. I can't just use functional indexes, because I'll need to provide operators and support functions to GiST. Maybe I have a fake sentinel column, where all the operators use SPI to introspect the row, treat each column as a feature dimension, call the underlying operators on each column's data type, etc.

- Can domains have operators, or are operators defined on types?

- Does KNN-GiST run into problems when <-> returns values that don't "make sense" in the physical world? For instance, let's say NULL <-> NULL returns a distance of 1.0. That means that NULL1 <-> NULL2 = 1.0, and NULL2 <-> NULL3 = 1.0, but NULL1 <-> NULL3 = 1.0 as well. I think that's fine - that could even describe a triangle - but my spidey sense is tingling on this.

- Are there previous discussions, patches, abandoned projects, etc. that this reminds you of and that I should go research?

Thanks for any thoughts, and I'd love collaborators or even mentors - we plan to open source whatever we produce here, and I don't have quite the theoretical background it takes to do this properly.

Jay Levitt

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to