> On Dec 18, 2017, at 6:59 AM, Craig Ringer <cr...@2ndquadrant.com> wrote: > > If you think it'd make logical sense to extend seg with a string descriptor > of some sort and could come up with a name/use case that's not quite so > narrowly focused as genetics alone, then I could see adding it as a secondary > type in the same extension. > > But it's more likely that the best course would be to extract the seg > extension from core, rename it, hack it as desired, and build it as an > extension maintained out-of-tree.
That is exactly what I’ve been doing for a few days now, and the process is testing my sanity. Attaching a string to an interval seems like an easy enough undertaking, and I have got it to work at the UI level, at least. The queries I wanted to be able to make against it run without problems and produce the desired results. Here is my first attempt: https://github.com/selkovjr/locus Problems arise around queries I didn’t expect to be making and there are issues around indexing that I am not sure how to solve. The main problem is that attaching a tag to an interval makes it incommensurate with intervals having a different tag. That makes them hard to index with an access method based on containment, such as GiST. Problem 1. What is a union of ‘1:6000-7000’ and ‘X:10000-20000’? Intuitively, it should be NULL, however, I am not sure the method allows for that; it was developed for objects living in the same metric space. I have mechanistically reproduce the indexing methods of seg, but the resulting index is broken. All queries against an indexed table return a null result. Problem 2. While the intersection (overlap, &c.) of any two loci produces obvious results, non-intersection does not. When I query for all loci not overlapping ‘1:6000-7000’, I expect to find all non-overlapping loci on contig 1. I don’t want the query to return anything from other contigs, because it is obvious that features on different contigs do not overlap. I may be able to fix that by making separate functions for non-overlaps and adding a constraint to them, but that seems like a kludge. Problem 3 (alternative to 1). I realize that any clustering can help build an efficient index, no matter how bizarre. So I could, for example, ignore the contigs altogether and build a single index tree, using only position co-ordinates and pretending that all positions are on the same contig; the question then is whether and how such lossy index will affect the ordering of query results. Can I use a separate function for ordering? I have yet to make an experiment. Not that this would be equivalent to indexing the attributes of a composite type separately (if I understood it correctly). An alternative to neglecting the contig element might be to use it as a second dimension. Expressed that way, a union of several loci might consist of a set of contig names attached to the bounding interval. Not sure whether that makes any sense; in the first approximation, I imagine something equivalent to storing each contig’s data in a separate table with a separate index, except derived from a single actual database table, but I have no clue for how to go about doing that. Thanks, —Gene