Re: genomic locus

Gene Selkov Thu, 21 Dec 2017 17:40:38 -0800

> On Dec 18, 2017, at 6:59 AM, Craig Ringer <cr...@2ndquadrant.com> wrote:
> 
> If you think it'd make logical sense to extend seg with a string descriptor 
> of some sort and could come up with a name/use case that's not quite so 
> narrowly focused as genetics alone, then I could see adding it as a secondary 
> type in the same extension.
> 
> But it's more likely that the best course would be to extract the seg 
> extension from core, rename it, hack it as desired, and build it as an 
> extension maintained out-of-tree.


That is exactly what I’ve been doing for a few days now, and the process is 
testing my sanity.

Attaching a string to an interval seems like an easy enough undertaking, and I 
have got it to work at the UI level, at least. The queries I wanted to be able 
to make against it run without problems and produce the desired results. Here 
is my first attempt: https://github.com/selkovjr/locus

Problems arise around queries I didn’t expect to be making and there are issues 
around indexing that I am not sure how to solve.

The main problem is that attaching a tag to an interval makes it incommensurate 
with intervals having a different tag. That makes them hard to index with an 
access method based on containment, such as GiST.

Problem 1. What is a union of ‘1:6000-7000’ and ‘X:10000-20000’? Intuitively, 
it should be NULL, however, I am not sure the method allows for that; it was 
developed for objects living in the same metric space. I have mechanistically 
reproduce the indexing methods of seg, but the resulting index is broken. All 
queries against an indexed table return a null result.

Problem 2. While the intersection (overlap, &c.) of any two loci produces 
obvious results, non-intersection does not. When I query for all loci not 
overlapping ‘1:6000-7000’, I expect to find all non-overlapping loci on contig 
1. I don’t want the query to return anything from other contigs, because it is 
obvious that features on different contigs do not overlap. I may be able to fix 
that by making separate functions for non-overlaps and adding a constraint to 
them, but that seems like a kludge.

Problem 3 (alternative to 1). I realize that any clustering can help build an 
efficient index, no matter how bizarre. So I could, for example, ignore the 
contigs altogether and build a single index tree, using only position 
co-ordinates and pretending that all positions are on the same contig; the 
question then is whether and how such lossy index will affect the ordering of 
query results. Can I use a separate function for ordering? I have yet to make 
an experiment. Not that this would be equivalent to indexing the attributes of 
a composite type separately (if I understood it correctly).

An alternative to neglecting the contig element might be to use it as a second 
dimension. Expressed that way, a union of several loci might consist of a set 
of contig names attached to the bounding interval. Not sure whether that makes 
any sense; in the first approximation, I imagine something equivalent to 
storing each contig’s data in a separate table with a separate index, except 
derived from a single actual database table, but I have no clue for how to go 
about doing that.


Thanks,

—Gene

Re: genomic locus

Reply via email to